search engine manipulation

How page ranking algorithms promised to save information, and ended up killing it  /  April 24, 2024


Chances are, if you're reading this, you didn't get here from a search engine. After all, how would you? I'm basically violating the whole dogma of search engine optimization: my paragraphs are wordy, my images sparse, my keywords cannibalized (more on that later), and I'm not planning on giving you a comprehensive guide to solving world hunger while simultaneously making $50,000 in passive income per month from a textile sweatshop in Bangladesh. No, I'm just sharing some thoughts.

It's no secret that search engines don't distribute information fairly. Algorithms like Google's PageRank impose deep epistemological restrictions on users and developers alike, with artificial standards of "helpfulness" dictating which results are buried five pages deep and which are enshrined as an indisputable authority. While this has been well-documented with regard to discrimination and "black-hat" SEO, its invisible effects on the norms of communication have been equally destructive. Supposedly neutral or even benign expectations—writing comprehensive articles, shortening paragraphs, improving on competitors' content to generate backlinks—are in truth a manifestation of the modern attention economy, and have been massively destructive to the Internet's role as an information ecosystem.


how are search results ranked?

We often treat search results as mathematical and objective. Google's PageRank algorithm, after all, is entirely driven by bots (or at least, as far as Google themselves claim). With an amoral technology dictating results, the natural conclusion is that the selection, too, will be amoral. Only the best results will rise to the top, while scams and misinformation fall into obscurity. The domination of a certain result in the rankings is confirmation beyond any doubt that it is relevant, high-quality, and honest.

At least, that's what we all wish was the case.

Ever since its infamous meter was hidden from the public in 2013, PageRank's name has fallen into obscurity. The feature, once a staple on the Google toolbar, gave users a visual of a site's alleged reputability, and developers a measure of how likely they were to appear on that shining pedestal: the front page of search results. The system, as outlined in this MOZ article, was fairly simple: the more backlinks pointing to a site, the higher its rank. A page's "passable" PageRank was a function of its base rank, divvied evenly among all the sites it linked to. A backlink from the New York Times or Wikipedia, then, was much more valuable than one from your grandmother's gardening blog. This not only made sense, but seemed like a very practical solution: the users, and the users alone, would naturally and democratically decide the hierarchy of the Web.

But, as in any other democratic process, stakeholders found a way to take matters into their own hands. What was meant to be a consensus vote became a ruthless battle to buy, extract, and even steal backlinks. You could still wait for natural traffic, of course, but doing so was akin to running for president without a campaign. If you had the financial means, you could buy a boost. And if you didn't, well, there was always the option to send a disingenuous email template to dozens of well-meaning bloggers, pleading them to shoehorn a backlink or two into their next article.

As Google matured in the decade that followed, so too did PageRank. No longer were backlinks the sole feature determining relevance: now, we had esoteric algorithms that scanned the content itself during the scoring process. And so, backlink farms yielded to a new industry: search engine optimization. Instead of paying someone to link to your page, you can now pay someone to align your site with all the ill-documented minutiae of Google's algorithm. You can cram as many keywords, top-ten lists, and meta tags as possible into your website, since that, apparently, is the definition of high-quality content. You can scrape the Web for the most verbose tutorials on the most popular skills, amalgamating them into a mess of words that simultaneously explains everything and nothing at all. Or, since you have ChatGPT at your disposal now, you can have a computer do that part too.


the user's standpoint

For all the manipulation among developers and bloggers, perhaps the most significant factor undermining the "democratic" model of the web—and the most overlooked—is that most people don't make websites. The majority of internet users are simply here to browse: to make a quick search, perhaps read an article or two, and continue about their day. They have no power to create backlinks, and in fact, they probably wouldn't be motivated to given the chance.

In our modern economy, information is the most abundant commodity: we sift through vast swaths from anonymous suppliers, extracting details and discarding byproducts without a second thought. Google's search algorithm is our only anchor amidst this mess, mediating the boundary between the user and the world of knowledge at their fingertips. It presents itself as unbiased, and since it constitutes nearly our entire means of knowledge acquisition, we have no reason to believe otherwise.

This fixation to authority is the root of complacency with search rankings. If the system remains neutral either way, yielding the mathematically-optimal projection of the internet for any imaginable query, what reason is there to cast a vote? Why kill the messenger?

Well, maybe if they tamper with the message.

This is why, with nothing more than the promise of a 25% decrease in typing time, Google's autocomplete has managed to mainstream harmful conspiracy theories and induce widespread polarization, giving a platform to right-wing extremists. It's why its "featured snippets" are grossly overestimated in their credibility and have significant swaying power on public opinion. Knowledge is traded for efficiency, rich contexts collapsed into digestible singularities. Valuable results may be filtered out, yes, but from the user's perspective, those that bubble to the top are usually good enough.


the developer's standpoint

However, the view that these algorithms only impose direct constraints, and that the same information is still "out there somewhere," is myopic. The model of information as a commodity is salient to the sellers of that commodity—developers and bloggers—just as much as it is to buyers.

the obvious (spamdexing, etc.)

It's impossible to make a good service without people trying to extort it, and that's exactly what happened to search. So-called spamdexing, or black-hat SEO, is exactly what it sounds like: a deliberate and malign attempt to manipulate page ranking algorithms in a manner inconsistent with their purpose. While this description could be applied (at least to an extent) to most SEO-optimized sites, some are certainly more blatant than others. You've likely seen Amazon listings that look more like Midjourney prompts (think "Kids Dancing Talking Cactus Toys for Baby Boys and Girls, Singing Mimicking Recording Repeating What You Say Sunny Cactus Electronic Light Up Plush Toy with 120 English Songs Smart Toy," the actual first result from the query "toy"). These are the spawn of keyword stuffing, a tactic that involves cramming search terms into title tags, meta descriptions, URLs, even invisible text. Everyone knows that this is bad: Google has banned it in their spam policies, and even SEO blogs advise against it.

And yet, it persists. And it's successful.

So too are other equally-malicious tactics. Scraper sites automatically extract and amalgamate content from other webpages, and article spinners aim shorter still, picking a single article to plagiarize and replacing just enough words to evade Google's spam detection algorithm. Countermeasures were deployed, of course—publishers were penalized for linking to spam content, the nofollow tag was introduced, and link-selling networks like SearchKing were pruned by Google's legal team—but like a bacterium, spamdexing evolved to survive.

With all of their legal and financial dangers, though, how many people are really willing to pry at these loopholes? Well, enough to make autocomplete suggestions look like this:

A search query for "black hat seo" with autofill suggestions for "forum", "techniques", "tools", and "affiliate marketing"

Perhaps, though, this isn't a symptom of widespread malignity—perhaps it's just a manifestation of the bias enabling it.

the insidious (trends in knowledge presentation)

We've seen the way developers manipulate page rankings, but the more nuanced question remains: how do page rankings manipulate developers? And how has this manipulation led to the degradation of knowledge as a whole?

To answer this question, we once again look to keywords—in this case, keyword cannibalization. According to this blog post from SEO authority Ahrefs, keyword cannibalization occurs when multiple pages on a site target the same or similar keywords, causing them to "cannibalize" traffic from other similar pages. In many ways, cannibalization can be seen as the nemesis to stuffing, discouraging developers from shoehorning the same search terms into every page to drive themselves up in the rankings. In fact, columnist Patrick Stox argues that it is an opportunity, not an issue: by consolidating shorter webpages into "ultimate guide" style content, you'll drive search traffic while making it easier for users to find what they're looking for.

But keyword cannibalization is an issue—just not in Stox's sense. It is an issue in the way it dilutes information, discouraging shared-purpose networks and forcing exclusive ownership over online spaces. With everyone aiming to produce the ultimate guide, the culture shifts away from a collaborative effort to build an accessible foundation for knowledge and toward a competition to own the most online real estate. Few people have the knowledge or authority to create an encyclopedic overview on their own, so some corners must be cut: assumptions are made, content plagiarized, opinions thrown about. All the while, the knowledge that is truly meaningful is drowned out in an ocean of noise, leaving readers with everything and absolutely nothing at the same time.

For academics, this trend may seem eerily familiar. Even with seemingly endless funding pouring into research and more publications churned out than ever before, breakthrough innovation is on the decline. Radu Vranceanu, a professor of economics at ESSEC, attributes this phenomenon to a lesser propensity for risk-taking and a prioritization of quantity over quality:

At the origin of this ever-increasing number of published papers, one can find the “more is better” vision of scientific progress, which led to the generalization of quantitative measures of scientific performance, and incentives to publish no matter what.

Just like developers and bloggers, the inclination for academics is to "own the space" rather than merely contributing to it. You must be the authority in your field, and to appear as such, you need to demonstrate knowledge of everything there is to know. You must be the first to publish new advancements, because if you aren't, then one of the countless researchers competing against you to solve the same problem will. You need the most influential citations, since citations, like backlinks, must be an objective measure of quality—even if you're just citing yourself. At the very least, since you know more about the field than anyone else, the research doesn't actually need to be revelatory—good numbers will suffice.

Some underlying factor is driving these trends, and it isn't just ego.

It's no secret that the attention economy has underlined nearly every aspect of 21st-century capitalism. We've seen the timeline give way to the algorithmic feed, long-form content to short-form video. If one fact is for certain, it's that attention is integral to success—perhaps more than anything else. This is why it is essential for scholars to cast a wide net: more eyes on their work means more citations, and more citations means better research (at least by Google Scholar's metrics). It's why bloggers are encouraged to cram as much information into each article as possible, because ranking for more keywords in the search results means more clicks, more backlinks, better content. When a quantitative measure like backlink count—or analogously, citation count—is used to determine rankings, attention becomes the sole metric of reputability. Even more than truth, authority, or utility, your article needs appeal: it should have a flashy hook to attract readers (there's a reason why manuscript titles are becoming longer and more humorous), while retaining enough padding to appeal to algorithms. Of course, there isn't much room left for valuable information in this equilibrium, but why should there need to be?

Keyword cannibalism is one manifestation of this issue, but there are countless others. Take a look at the Skyscraper technique, a popular SEO tactic where you "find link-worthy content, create something better, and reach out to the right people." Or, in more blatant terms, you scrape and reword popular articles to steal backlinks. Countless well-meaning experts accept this as part of the SEO canon without a second thought, inadvertently creating a culture where the space is emphasized more than the information itself, and the goal is contorted from education to outperformance. These expectations have even crept into the syntactical structures we use to present information: lists, as well as short sentences and paragraphs, yield a higher content grade than long-form content. As a result, information is spread thin, simultaneously restricted and diluted until it can no longer hold any context whatsoever.


should computers rank content at all?

During the writing of this article, this was the one question that fervently nagged me. It's clear that search is broken, but is there any way to fix it? Or do we need to turn to other means altogether? After a healthy dose of pessimistic deliberation, I decided on the latter.

The crux of the issue is that computers have no gauge of truth. They have heuristics for truth, yes—measures of engagement and support, conciseness and comprehensiveness—but they cannot distinguish the plausible from the impossible, the moral from the immoral, the valuable from the useless. They do not experience the world in the same way that humans do, and thus are incapable of measuring value the way we measure it. Our definitions of value and truth are founded on implicit, abstract maxims, which is why every attempt to concretize these maxims into computer code has involved some sort of shortcut—some imperfect heuristic with holes just large enough for bad actors to seep through.

With the advent of GPTs (and now, fantastically, GPT-powered search engines), promises to tackle the issue of information retrieval have been made and broken. In an effort to fix our human-designed value heuristics, we've allowed robots to step in and design them for us. The only result has been a veil—a more convincing one, perhaps, backed by billions of dollars and thousands of tons of carbon belched into the atmosphere—but a veil nonetheless. These algorithms still have no way to discern if their output is valuable or even true, just ever-more-esoteric ways to make it seem plausible. For all we know, they could have just as many holes as Google did, just as many ways to be extorted and to extort.

If we want value—true, world-changing value, not the contorted intuition of a machine—we need to turn to the ones who define it: humans.