In the short story “The Library of Babel” by Jorge Luis Borges, the narrator describes a library of seemingly infinite size, made up of hexagonal galleries full of bookshelves. The books contain every possible configuration of 22 letters and three punctuation marks. The narrator describes how he spent a lifetime searching through them, hoping to find coherence and ending up with just a few meaningful lines.
“The Library of Babel” is a horror story. On the shelves are every imaginable truth, the solution to every problem, the answer to every question—but also falsifications of those truths, untruths that are impossible to tell from the truths, and an almost infinite supply of sheer nonsense. That the Internet resembles the Library of Babel is not a new observation. For decades, individuals have been posting anything online they want to share—whether it’s profound truths, untruths, or just plain incoherent junk.
The Internet junk problem is getting much, much worse. In the late 1990s, the rise of the Google search engine revolutionized the way people navigated the global Internet. Searching the Internet requires a different strategy than searching in books. Before Google, most digital search engines relied on a simple heuristic to find web pages where the terms the user was looking for appeared with a high frequency. Want to find bike reviews? Look for a document that uses the phrase “bike reviews”! But that doesn’t work in a world where anyone can publish and where there are economic incentives to get people’s attention. These early search engines were vulnerable to individuals posting pages that just said “bike reviews” thousands of times.
Google refined the process by adding the idea of ”authority”. For a page to appear at the top of search results, many other pages must link to it. The theory behind the PageRank algorithm developed by Larry Page and Sergey Brin was that authoritative pages are the target of many organic links on the web, while very few people would choose to link to a page that repeated a keyword tens of thousands of times .
When Google first showed up, it was a revelation. The search results were significantly better. But it wasn’t long before “link farmers” — who often refer to themselves as “experts in search engine optimization” — figured out how to fool Google too: by creating farms of pages that link to one another. Bobsbikereviews.com might now have 10,000 pages linking to it, with 10,000 pages linking to each of them, and so on. Google has evolved and now has methods to bypass this form of search engine optimization. But the task is getting harder and harder – and recent developments in artificial intelligence have only made things harder.
How do we react when content is not created for our benefit, but to fool search engines?
ChatGPT, a system that generates text that is difficult to distinguish from human-written text, creates a perfect storm for search engines. For years people have tried to manipulate Google by posting en masse handcrafted spam. Most are repetitive and easily ignored by Google and its competitors. But now it becomes much easier to create masses of quality content and put it online to draw people’s attention to pages littered with ads or misleading offers. Search engine giants are already working on this problem, looking for signatures that pages were auto-generated and then penalizing them. What is likely to happen is an escalating war between AI-generated sites and algorithms designed to help search engines separate real human knowledge from artificial junk.
Unfortunately, even if Google can learn to distinguish between real and fake, people still have problems. Do you remember the Internet Research Agency (IRA)? A building in St. Petersburg was packed with people tasked with creating social media posts promoting Putin’s agenda and escalating political tensions in the US. The IRA claimed as one of its successes the creation of two rival groups in Texas: one right-wing populist group that pushed for state secession and campaigned for gun rights, the other a faith group, United Muslims of America, that campaigned for Hillary Clinton. In a remarkable act of deformity, these two Facebook groups, both controlled by the Russians, managed to get dozens of real Houstonians out on the streets to protest each other.
Running the IRA required paying hundreds of tech-savvy English-speaking Russians to create online personas and create multiple posts a day with their voices. This process is now fully automated. We should expect social media platforms like Facebook and Twitter to fill up with auto-generated propaganda promoting the viewpoints of controversial political figures.
Unfortunately, it’s difficult for people to navigate a landscape where a tremendous amount of the content they’re exposed to seems to favor one point of view. When one is bombarded with posts claiming that the invasion of Ukraine is legitimate, one naturally wonders whether your support for Kiev is misinformed or ill-considered. Are these seemingly ordinary Russians and seemingly Putin-friendly Europeans right?
Keeping these new junk accounts in check will be a major challenge – and unfortunately, platforms have all the wrong incentives when it comes to fighting the problem. Elon Musk, witnessing the fallout from his mismanagement of Twitter, may welcome the emergence of these robots hosting controversial and highly interesting content as long as his advertisers don’t complain that they’re wasting money selling ads to ChatGPT-enabled robots.
How do we react when content is not created for our benefit, but to fool search engines or to express extreme points of view? I recently previewed a possible answer with a system called Otherweb developed by AI programmer Alex Fink. The Otherweb tries to sort the news of the day and delete “anti-news”. Anti-news is content created by professional news organizations that has no real news value — his favorite example is a headline from a credible source that reads, “Stop what you’re doing and watch this elephant play with bubbles.” . This type of content is created by people to get attention: it doesn’t provide any useful information about the world, although it can be distracting for a while.
Anti-news is Fink’s bête noire, and he has put a lot of thought into creating a news stream free of clickbait and other forms of anti-news. Every day I now receive a newsletter from Otherweb, which has reduced thousands of news articles to nine, chosen for their apparent neutrality and newsworthiness. The system works very well: in a few moments I get a quick overview of current headlines without trying to attract and divert my attention.
There is an irony in seeking the help of AI to find our way through a landscape of junk created by competing AI systems. We might have avoided this problem if OpenAI, the developers of ChatGPT, had been more responsible in releasing their tool to the public. It seems likely that in the very near future users will be able to use ChatGPT or similar, creating an endless stream of junk that can be used for either search engine optimization or propaganda generation. We hope to see rapid innovation in tools to help us fight back.
We could also benefit from rethinking the incentives that make the current internet work. Spam is a feature of an ad-supported Internet with constant competition for user attention. If we’re working on something closer to a subscription model, the material would have to be of higher quality for users to be willing to pay for it. And if systems like Reddit didn’t reward users simply for creating content that people happened to engage with, they would have less incentive to inflate their post count by posting junk.
Perhaps there is a way to create incentives that reward high-quality engagement and severely penalize people for posting AI-generated junk. But for now, it seems likely that this battle for our attention will lead further into surreal, Borgesian territory as we navigate an endless series of hexagonal galleries online, armed with tools to help us unlock these increasingly rare nuggets of genuine human insight find.