Archive

Archive for February, 2009

Recover a Website with the Help of Google

February 19th, 2009 No comments

Long ago, in a galaxy far, far away (ok it was Las Vegas, but that seems very foreign to me) some servers were stolen. Vegas is a shady place and sadly we were on the receiving end of some shadiness. At any rate, all we were left with were files and reports people had saved to their personal computers. And yet, within two weeks, and a lot of long days, we had sites up and running again with most of their data intact.

Partially the data people had saved on their machines was really helpful. Lots of database info like prices, product codes, names, and the like were in Excel files. We plopped it into a new database for the site. The rest, like what was on the home page, we retrieved from our soon to be new best friend, the Google cache. Fresher and more comprehensive than archive.org, Google’s cache had pretty much every missing page we needed to rebuild. The process went something like this:

1. Make list of missing pages (basically home page + product pages)

2. Find missing pages in Google by searching for the URL, the code, anything to get the page to show up.

3. Click on the cache link and save the page to a html file.

See the cache link by the red arrow? That shows the web page Google spidered and indexed, as opposed to the live web page you would see by clicking on the result link. Usually there’s not a big difference between what is cached versus live, but there’s a lag between Google’s last visit and a website’s most recent changes. Sometimes that lag is minutes (if the website is CNN) or months if a site is less popular.

4. Give the html files to engineers who parse out all the important data field content and put it back into the database.

5. QA everything, then set it live again.

6. Pesky evildoers find themselves thwarted by technology!

Normally the cache is useful for trying to figure Google’s lag time in updating or seeing a site that might at the moment be unavailable, but should you find some web content has gone missing, its there to help you as well.