How to restore websites from the Web Archive - Part 3

Published: 2019-12-04

Choosing “BEFORE” limit when restoring websites from web archive

Last time we spoke about how to prepare a domain for restoring and how to open indexing through robots.txt. Today you will learn how to choose the date of a fully functional version of the old site from web archive.

When domain expires, domain provider or hoster’s parking page may appear. When entering such a page, the Internet Archive will save it as fully operational one, displaying the relevant information on the calendar. If you restore a website from the calendar by such a date, then instead of a normal page will see that mentioned parking page. How can I avoid such a problem and find out the working date of all website pages in order to restore it?

Instructions for choosing date using the example of the domain

Enter domain on the main page and then we get a web archive calendar.

Then we are looking for the latest working version, which will be marked in blue. Open it and see how it looks, if it’s a domain registrar parking page, go on searching. Find the time when the content was archived.

Attention, the timestamp that we see on the link is the date and time of saving only the html-code of this page, but not the CSS styles, images or scripts. All these items have their own save dates, sometimes significantly different from the html file date. In order for the web archive to completely save the page with all its elements, it needs some time. Starting with the page code, all elements are stored with a delay of several seconds to several days, and sometimes even more. Therefore, if you enter this particular timestamp as “from” date of the domain, only part of the page will be restored.

Then go to the DevTools tool to determine where all the elements of the site are loaded from. To do this, press ctrl+shift+i or F12 at the same time in your browser. We see a source code analysis tool.

Go to Network tab, uncheck persist logs. Press F5 or refresh the page.

Enter the desired domain in the filter field. For general understanding, the filter, even when partially entering the domain name, will display information, and everything that will be written with a space is determined by the tool as an additional filter to the search.

It means that when entering simultaneously the domain name Trimastera and 201803 (as the year and month of the last saving of the website working version based on the correct "blue" date on the calendar) we will get all the links (without external styles, images and other external resources).

Using the example of this domain, we see a lot of urls with 302 response code, since web archive assigned the same timestamp to the elements on the page, but when entering this element itself, web archive redirects to original timestamp.

Let's take a look at the page with styles.css styles. The main timestamp is displayed the same as the whole website page, but when we go to the style location page (url of the style itself), we see another timestamp of this particular element.

Let’s change alternately the date 201803 in the search to a larger or smaller one to determine whether there were workable pages during this period, i.e. with 200 response code. Thus, we confirm that March 2018 is the last period of the website working indexing.

You can specify a wider period of time, for example, only a year in the filter field. But when setting such a “to” date, web archive will consider the version for the last second of 2018, and at that moment, as we see, website didn’t work. Specifying a more accurate period up to seconds may not capture individual media or text elements indexed later.

Now we open Site Map tool. We see that for 2019 we do not have website internal pages.

Let’s open 2018, now we see that the website structure is displayed, which means that website has been indexed.

We selectively look through some pages in new tabs for 2018. As we see, timestamp of such pages was recorded on March 15, although at different time. Thus, we check the month of the last working pages.

Then open Summary tool for the year 2018 that we’ve chosen before. Check the New URLs column in the table. For the period of 2018, new links were not indexed, which means that at the time of restoration we will download the latest website version from web archive.

Open Explore tool. We analyze url output in a table. If domain was prepared correctly by specifying robots.txt file, we will see a table with all the links for this year.

Let's test the hypothesis that the robots.txt file closes indexing. We look at all versions of the robots.txt document in the calendar, and we see that on July 13, 2018 the site was closed.

If you have faced with such a situation, you need to prepare your domain by uploading a new robots.txt file, which we described in the last guide.

After robots.txt file successful preparation by using Explore tool, we sort all the links for the year that we are interested in by the “to” date so that the most recent ones are at the top. We are interested only in those links that do not have a redirect, parking pages, or other external elements (not related to the website contents). The latest date of the last url will be the date of last indexing (saving) of the current version of website materials.

Analysis by the Explore tool, as well as work with the “to” date of the domain as a whole, should be carried out 24-72 hours after uploading new robots.txt file, as we remember, in order for the web archive to correctly index all website elements.

In our example we cannot sort the files in the web archive, since there was no uploading a new robots.txt file on the new domain and hosting.


Difficulties when searching “to” date

Example: website

We go through the same steps as in the previous example:

On the website’s save calendar we look for the “blue” saving date (in our case, December 4, 2016), then check for the absence of a redirect and other errors. On the saved page version go to the DevTools tool (F12 and update through F5). In the same way, enter a part of the domain and date that is greater than the considered one (2017). Since there are no links with 200 response code for the selected period, the last website’s performance was recorded in 2016.

Open Site Map tool and let’s check this hypothesis. In most cases, the last year of the structure display will be the last year of the website’s performance.

As we see, 2016 is the last year of website saving, which confirms our assumption. Working structure is displayed exactly for this period. Go to Summary tool. Let’s see the results in the New URLs section for the period of 2016. As you can see, 39 unique links were generated during this period, which means that this year the site was last time functioning with the current pages versions.

Open Explore tool. Since we have results table, website site is open for indexing, which means that robots.txt file doesn’t block access to bots. Sort the results from the last. Check links for operability, excluding those with redirects and errors. For 2018, all links are redirected. For 2017, new materials are displayed, indexed on March 3.

Let’s check links and then we see that these are redirect pages. So, they should not be considered as working pages necessary for being restored.

Check links for 2016. We confirm the hypothesis that December 2016 is our “to” date for this domain, since the links are working.

Example: website

As in previous examples, we look at the website latest version for 2018. We see that website was working in 2015.

Once again select DevTools tool by clicking F12+F5 in the version we selected for March 18, 2015. As before, enter the domain name and by the method of date selecting we look for the period of the last unloading of the 200 response code. By month selection methods we see that March 2015 is the last period when website was available. Go to Summary tool. Now we open the 2015 supposed year.

We see that during this period 2 new files were generated. If many files were created, we would look for them in April, May. But we will look for 2 files in the specified month by setting the “to” 201503 date.

In order to confirm our assumptions, open Explore tool (for 2015), and see that website data is no longer indexed (no table upload).

It means that robots.txt restrictions were configured on this website. We conclude that the date we determined is the last when the site was available.



How to restore websites from the Web Archive - Part 1

How to restore websites from the Web Archive - Part 2

The use of article materials is allowed only if the link to the source is posted:

Latest news:
The first June update of Archivarix CMS with new, convenient features.
- Fixed: History section did not work when there was no zip extension enabled in php.
- New History tab with details of changes when editing text files.
- .htaccess edit tool.
- Ability to clean up backups to the desired rollback point.
- "Missing URLs" section removed from Tools as it is accessible from the dashboard.
- Monitoring and showing free disk space in the dashboard.
- Improved check of the required PHP extensions on startup and initial installation.
- Minor cosmetic changes.
- All external tools updated to latest versions.
An update that web studios and those using outsourcing will appreciate.
- Separate password for safe mode.
- Extended safe mode. Now you can create custom rules and files, but without executable code.
- Reinstalling the site from the CMS without having to manually delete anything from the server.
- Ability to sort custom rules.
- Improved Search & Replace for very large sites.
- Additional settings for the "Viewport meta tag" tool.
- Support for IDN domains on hosting with the old version of ICU.
- In the initial installation with a password, the ability to log out is added.
- If .htaccess is detected during integration with WP, then the Archivarix rules will be added to its beginning.
- When downloading sites by serial number, CDN is used to increase speed.
- Other minor improvements and fixes.
Our Archivarix CMS is developing by leaps and bounds. The new update, in which the following appeared:
- New dashboard for viewing statistics, server settings and system updates.
- Ability to create templates and conveniently add new pages to the site.
- Integration with Wordpress and Joomla in one click.
- Now in Search & Replace, additional filtering is done in the form of a constructor, where you can add any number of rules.
- Now you can filter the results by domain/subdomains, date-time, file size.
- A new tool to reset the cache in Cloudlfare or enable / disable Dev Mode.
- A new tool for removing versioning in urls, for example, "?ver=1.2.3" in css or js. Allows you to repair even those pages that looked crooked in the WebArchive due to the lack of styles with different versions.
- The robots.txt tool has the ability to immediately enable and add a Sitemap map.
- Automatic and manual creation of rollback points for changes.
- Import can import templates.
- Saving/Importing settings of the loader contains the created custom files.
- For all actions that can last longer than a timeout, a progress bar is displayed.
- A tool to add a viewport meta tag to all pages of a site.
- Tools for removing broken links and images have the ability to account for files on the server.
- A new tool to fix incorrect urlencode links in html code. Rarely, but may come in handy.
- Improved missing urls tool. Together with the new loader, now counts calls to non-existent URLs.
- Regex Tips in Search & Replace.
- Improved checking for missing php extensions.
- Updated all used js tools to the latest versions.

This and many other cosmetic improvements and speed optimizations.
New Friday, new updates!
A lot of new and useful was done in Archivarix CMS:
- In Search and Replace, you can now filter by url date.
- Now external links from all pages of the site can be deleted with the click of a button. Anchors are preserved.
- The new ACMS_SAFE_MODE parameter, which prohibits changing the Loader / CMS settings and loading custom files, is also prohibited from importing import settings and custom files.
- The JSON settings files for the Loader and CMS can now be downloaded to your computer and downloaded to the CMS from a file on the computer. Thus, the transfer of settings to other sites has become even easier.
- Creating custom rules has become more convenient, there are often used patterns that you can choose.
- New custom files can be created in the file manager without having to download the file.
- The url tree for the main domain always comes first.
- If you hide the url tree for the domain / subdomain, then this setting is saved while working with the CMS.
- Instead of two buttons, open / collapse the url tree, now one that can do both.
- Creating a new URL was simplified and when creating, you can immediately specify the file from the computer.
- In the mobile layout, the main working part comes first.
- After each manipulation of the file, its size is updated in the database.
- Fixed buttons for selective history rollbacks.
- Fixed creating new urls for subdomains that contain numbers in the domain name.
New portion of updates!
There is no need to change anything in the source code of the files.
- Now you can upload sites to the server by uploading to the server only one script from our Archivarix CMS.
- In order to change something in the CMS settings, you no longer need to open its source code. You can set a password or lower limits directly from the Settings section.
- To connect your counters, trackers, custom scripts, a separate "includes" folder is now used inside the .content.xxxxxx folder. You can also upload custom files directly through the new file manager in CMS. Adding counters and analytics to all pages of the site has also become convenient and understandable.
- Imports support a new file structure with settings and the "includes" folder.
- Added keyboard shortcuts for working in the code editor.

These and many other improvements in the new version. The loader has also been updated and works with the settings that the CMS creates.