Reddit Moves to Restrict The Internet Archive from Accessing its Communities

Reddit Moves to Restrict The Internet Archive from Accessing its Communities


A notable side-effect to the new wave of data protectionism online, in response to AI tools scraping any data that they can, is what that could mean for data access more broadly, and the capacity to research historic material that exists across the web.

Today, Reddit has announced that it’s going to start blocking bots from The Internet Archive’s “Wayback Machine,” due to concerns that AI projects have been accessing Reddit content from this resource, which is also a crucial reference point for many journalists and researchers online.

The Internet Archive is dedicated to keeping accurate records of all the content (or as much of it as it can) that’s shared online, which serves a valuable purpose in sourcing and crosschecking reference data. The not-for-profit project currently maintains data on some 866 billion web pages, and with 38% of all web pages that were available in 2013 now no longer accessible, the project plays a valuable role in maintaining our digital history.

And while it’s faced various challenges in the past, this latest one could be a significant blow, as the value of protecting data becomes a bigger consideration for online sources.

Reddit has already put a range of measures in place to control data access, including the reformation of its API pricing back in 2023.

And now, it’s taking aim at other sources of data access.

As Reddit explained to The Verge:

“Internet Archive provides a service to the open web, but we’ve been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine.”

As a result, The Wayback Machine will no longer be able to crawl the detail of Reddit’s various communities, it’ll only be able to index the Reddit.com homepage. Which will significantly limit its capacity on this front, and Reddit may be the first of many to implement tougher access restrictions.

Of course, some of the major social platforms have already locked down their user data as much as they can, in order to stop third-party tools from stealing their insights, and using them for alternative purpose.

LinkedIn, for example, recently had a court victory against a business that had been scraping user data, and using that to power its own HR platform. Both LinkedIn and Meta have pursued several providers on this front, and those battles are developing more definitive legal precedent against scraping and unauthorized access.

But the challenge remains in publicly posted content, and the legal questions around who owns that which is freely available online.

The Internet Archive, and other projects like it, are available for free by design, and the fact that they do scrape whatever pages and info that they can does pose a level of risk, in terms of data access. And if providers want to keep a hold of their info, and control over how such is used, it makes sense that they would need to implement measures to shut down such access.

But it will also mean less transparency, less insight, and fewer historical reference points for researchers. And with more and more of our interactions happening online, that could be a significant loss over time.

But data is the new oil, and as more and more AI projects emerge, the value of proprietary data is only going to increase.

Market pressures look set to dictate this element, which could restrict researchers in their efforts to understand key shifts.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *