A few days ago I was surfing in my own blog, reading some old entries. Following some of the links on those entries, I found that a good deal of the links in them didn’t exist anymore. Some of them returned 404 errors, others just redirected to the main page of the site under which they were previously hosted or returned server errors.
Curious to see what was the proportion of dead links linked from my blog, I cooked a small Python program to check the issue. The program was simple: it collected all URLs found on the files present in my blog (a local copy, of course), and sent requests to the servers, asking for the status of those pages through HEAD requests (it also tried to do a GET request if the HEAD request was not allowed).
After running the program, and printing some statistics, I found that I had linked roughly 1200 URLs in the two versions of my blog (Portuguese and English). A few URLs came from comments, but I included them in the statistics as well to save the trouble of separating them from my own links. Also, some URLs were incorrectly parsed given the relative difficult of properly identifying URLs in raw text.
The results were not much a surprise, considering the impression I had surfing around in the blog. From the 1200 requests, 45 returned a 404 error. That represents 3.75% of the links, which is just a small part of the links. Considering the Web’s mobility, that is not bad. However, 48 other requests, which add up to more 4% of the links, returned various other error codes. Together, they represent a good part of the links.
Checking the results, I also found out that some sites are serving customized 404 error pages that return incorrect HTTP status codes. Some even return a 200 status code when they are just a warning that the page doesn’t exist anymore. The main culprits are online newspapers and magazines.
The full (and completely unscientific) results were:
- 43 unreachable domains (3.58%)
- Some of those errors are likely temporary conditions, although I checked some and they are really expired domains (sites of political candidates, for instance).
- 1064 successful requests (88.67%)
- Pages that were correctly served, and some that had been moved to elsewhere. Some of those results, as mentioned before, are incorrect, which increases the number of errors.
- 93 unsuccessful requests (7.75%)
- Those include pages not found in the servers anymore, server errors, and pages that are protected now.
Considering the period in which the links were posted (one year and two months), something like 10% of the pages I linked to in my blog are now returning errors, or inaccessible. Of every ten links in my blog, one is going to results in an error. I think that’s too much. Obviously, it’s not reasonable to expect that everybody will preserve the URL spaces they created. Sometimes that’s not possible — or even desirable. On the other hand, sites that could be taking a lot more of care with their data are simply allowing their links to get broken. Quotes are made with the expectation that they will point to permanent resources, and the resources will simply disappear after some time.
Anyway, it was an interesting experiment. As there is no way to prevent the problem, there are no measure to be taken. The only thing that remains is the feeling that all the links in this site are slowly being absorbed by a great black hole in the center of the Web.