AI boom is making web archiving more expensive

The AI boom is making web archiving more expensive, with Internet Archive and Wikimedia Foundation both feeling the squeeze as demand from the AI industry pushes up prices for large hard drives, NAND memory, and even the servers needed to store the internet’s leftovers.

According to 404 Media, some high-capacity HDDs have jumped to as much as triple their usual price. That is bad timing for projects whose entire job is to hoard data for the long haul, especially now that websites are also tightening access against bots of every kind. The irony is painful: the same internet being scraped for AI training is becoming harder to preserve for everyone else.

Internet Archive is running into 28-30 TB shortages

Internet Archive is the clearest pressure point here. It already stores about 210 petabytes of data and adds roughly 100 terabytes a day, so even small changes in storage prices quickly turn into real money. Founder Brewster Kahle says finding suitable 28-30 TB drives has become a serious problem because they are either unavailable or far more expensive than expected.

That is the kind of supply problem that rarely makes headlines until it starts warping public infrastructure. Archiving is not a flashy business, which means it has little leverage when component makers and cloud buyers start fighting over the same hardware.

Wikipedia’s backend costs are rising too

Wikimedia Foundation is seeing the pain from another angle. It says the price pressure is hitting not just storage drives, but also server supply and the ability to plan future purchases with any confidence. For a nonprofit that depends on predictable infrastructure spending, that is a nasty combination.

Internet Archive stores about 210 petabytes of data
It adds roughly 100 terabytes a day
High-capacity HDDs have reportedly risen to as much as triple their usual price

Bot blocking is making preservation harder

The hardware bill is only half the story. More and more sites are blocking bots because owners fear automated scraping for AI training, and that is catching the ordinary archive crawlers too. So the projects trying to preserve digital history are being squeezed from both sides: higher storage costs and less access to the pages they are trying to save.

If this keeps spreading, archiving could become a slower, narrower version of itself, with gaps where yesterday’s web should have been. The likely winners are the storage vendors and the AI companies buying at scale; the losers are everyone who assumes the internet keeps a memory for free.

Source: Ixbt

Internet Archive is running into 28-30 TB shortages

Wikipedia’s backend costs are rising too

Bot blocking is making preservation harder

Leave a comment