In 1998 Tim Berners-Lee penned the historical essay “Cool URIs don’t change”, a treatise on keeping URIs (and URLs) active so content remains accessible and doesn’t just drop off the face of the CyberEarth because you wanted to re-organise your website. The post resonates so much with me and feels more applicable now then ever before as we enter an age of link rot with companies deleting swathes of valuable information when they don’t care about resources they acquire or just want to save a buck. A recent project of mine I describe at the end is about setting up ArchiveBox to take backups of all the outward-bound links from my blog and testing them regularly so I can be aware if they become inaccessible and retain a backup of what was once there.
This post is be organised into the following sections:
- Link Rot, A Fecundity Of Absence - A brief overview of what link rot is, some examples, and why it is a growing problem.
- Way To Go, Wayback Machine - Web archiving through the Internet Archive and the self-hosted ArchiveBox.
- Archive Of My Own (Blog) - The technical side of what I built to test my blog links and be aware of link rot.
Link Rot, A Fecundity Of Absence
Link rot is the process by which content is no longer accessible by a URL that was previously a gateway to that content.
As in Tim’s essay, a common cause of link rot is when people reorganise their CMS, choosing new URLs for content to match a new organisational strategy. In some cases this can lead to a 404 Not Found error but in arguably more insidious cases the link can continue to work but no longer be updated - dangerous for pages that previously supplied severe weather notifications or health advice. But a problem we face more and more is content that is just deleted entirely. Articles, reviews, news pieces, instructions, and lesson plans, disappearing off the cyberface of the world regardless of whether they might still be relevant.
In some cases it’s understandable to delete data. User privacy and the right to be forgotten mean someone might want to delete what they’ve posted and that should be respected. In other cases information may have been dangerously wrong and deleting that information was the responsible thing to do. But in cases like the latter, an argument can be made about redirecting the origins to correct information (via a 301/302 redirect) or maintained with a prominent notice saying what was incorrect with links to updated or corrected data.
But the biggest plague we face right now is the mass deletion of content because companies that own it no longer want to maintain it. In August 2024 Game Informer magazine was shut down by GameStop, where it had previously been a website and magazine dedicated to gaming that has been publishing reviews online since 1996. All the staff were let go, publication of the physical edition was cancelled and the entire website was taken down and replaced with the landing page linked above with no prior warning. 28 years worth of posts; many about games and IP that no longer exist or are unavailable, tranches of reviews and slices of history, all completely removed because GameStop no longer saw it as profitable enough to even provide as an archive. All we have left is scraps archived by services like Wayback Machine and a beautiful eulogy on Aftermath.
Running a website isn’t free, I understand that as someone whose been doing it for a job for over a decade now. But a read only, static version of Game Informer would have been a rounding error in GameStop’s IT budget. Instead the company decided to burn a repository of knowledge to the ground - another in a long line of companies burning the internet, much like what happened to the Library Of Alexandria, a process of mismanagement and unappreciation rather than a single conflagration.
Last year we watched CNET delete thousands of old articles because they wanted to “improve their SEO (search engine optimisation)”, despite the actual advice from search engine giants being “don’t do that”. Paramount decided to shut down MTV News with no warning and remove the entire archives. While they say the content is “archived” they control the archive and have made no public access to it - another form of link rot taking down collective knowledge. Beyond the direct destruction of content there’s also the loss of access to content caused by link shorteners dying. The most recent example is Google, a company that made twenty billion ($20,000,000,000) in profit last quarter but has had to shut down their goo.gl link shortening service. There is a massive difference between “stopping new links being shortened” and “burning the whole lot” but apparently keeping the historical links alive, links found all over the world in tweets and blogs and press releases and government notices, is just too much of a cost and so it must be Killed By Google. Although an arguably worse crime was pulled by The Guardian, who after tiring of running a link shortener for their news decided to just sell the URL shortening domain to anyone who wanted to host malware on those old links that were found all over the web and even out in print.
On a positive note though, props to companies like Anandtech who stopped publishing new articles but keep their old content up for future people to enjoy.
As a side note, with the MTV News shut down we also see another insidious side of link rot: blocking off content that was previously publicly accessible and forcing it to require accounts or subscriptions. We see this massive “fuck you” to the public from platforms like Medium who decide that content that was previously public facing is now forced behind an account sign up or subscription to read it, and the same from creators who decide to retroactively force paid subscriptions on their old content.
What can stop this festering canker of nothing from growing and taking more of the web from us?
Way To Go, Wayback Machine
The Internet Archive is one of the most beautiful institutions online for preservation. Books, movies, software, audio recordings, collections of pamphlets and posters - a vibrant community of people who submit material regularly that is out of circulation or no longer available so it can be enjoyed by future generations. They also run the aforementioned Wayback Machine, a platform that hosts snapshots of web pages submitted to its index or crawled steadily by its spider. If you’ve ever found a URL that’s now a 404, throw it in their search and see if it’s been captured before so you can see what existed before the rot set in.
Beyond a way to recover content lost to time and decay, it’s also an enlightening way to see what popular websites looked like throughout history. What did Apple’s website look like in 1996? Or in 2001 with the new Powerbook G4 vs 2014 when the iPhone 4 arrived?
The Wayback Machine is not the only web archiving platform, but it is probably the biggest and most extensive. Other options you might want to use include archive.is and CachedView which let us capture snapshots of websites for posterity or show the versions last captured in search engine caches. But one that grabbed my eye recently was ArchiveBox, a self-hosted solution for capturing web pages. Beyond archiving links from my blog, it would also help with mine and my wife’s work by letting us have private captures of sites we’ve done so we can do before/after comparison or keep stuff in our portfolio even if the original work has since disappeared.
And so I configured an ArchiveBox instance on a server and thought about how I could connect it to my blog so that I’d always have an archive of any links I add to it.
Archive Of My Own (Blog)
The project was broken down into two parts:
- Pull a list of all the links from my blog whenever I update it.
- Test a small sample of those links every 2 hours to see if they’re still up.
Pull Links
As discussed previously, my blog is a Hugo static site on S3+Cloudfront, deployed from GitLab. So every time I update it I regenerate all the pages for it. I’ve added a new CI step into my build process that runs a small Python script that does the following:
- Walks all the built HTML files in my blog directories (so not tags, RSS, etc).
- For each HTML file, use the
lxml.etree..HTMLParser
and XPath to find all the links. - If a link is not one to my blog, a site I manage, or the Wayback Machine, write it to a file.
This CI step runs after every push to main for any new blog posts or edits. Since we pull all the links every time, it also means if I’ve removed or changed any links it will remove them from this overall list.
Test Links
This acts as a scheduled pipeline in GitLab CI, one that runs every two hours and handles two functions:
- Any new links added to the blog that haven’t been seen before are sent to ArchiveBox to be captured.
- A random selection of 5 links is checked, using the
requests
library with a User-Agent that mimics my usual web browser.
I originally ran it as a weekly test of all links, but the script would take at least an hour to run as I’d have to wait at least ten (10) seconds between each check to ensure I wasn’t flooding my connection or accidentally hitting the same server too many times and getting 429 Too Many Requests errors. So now I create a queue of every link and then pop five (5) elements off each run to test. Any errors are logged and cause the CI stage to fail, meaning I get a handy email from GitLab to check said log. If the queue is empty, it will take the list of all links and shuffle it into a new queue.
I also keep a list of links that are safe to ignore because they will always fail due to bot detection or a quirk of the site.
Findings
I’ve only been running the scheduled task for about a fortnight so far and the instances of broken links are slowing to a trickle. Whenever it does find one, I pull a version from the Internet Archive or if it doesn’t exist there I try and find an alternate information source. So far I’ve had to do this for about 25-odd links from over the decade this blog has been running. In the future if more content drops off I’ll be able to link to either the Internet Archive or the copy I took myself with ArchiveBox.
It’s also crystalised my thoughts on the duty and social responsibility of webmasters to ensure content remains up, as well as made me think more about the current state of the internet and the future. The cost of keeping content statically online should be cheap, but moves by companies like Anthropic and OpenAI that violate web standards such as robots.txt and cause pricey headaches for organisations hosting valuable information online.
But I guess the ultimate message from all this is: make sure to archive the sites, games, and other things you love so you have a copy of them if they ever disappear and can let other people experience the joy or knowledge that you treasured too.