The Staggering Volumes of Web Content Spam Detected by Google

Spam is a problem. In our inboxes there’s email spam. But there’s also web spam too. It has a negative impact on user experience because nobody wants to click on a link to a poor-quality website or land on a webpage that has been designed to deceive users. If the search engine results were as poor as they were when I first started on the web back in 1997, I’d rebel. and that’s why Google does such a good job, dedicating incredible resources to fighting web spam. Because, after all, it is the dominant search engine for good reason – the results are excellent.

2018: SpamBrain

Ever since 2018 when Google introduced SpamBrain, its AI system that powers many of Google’s spam detection algorithms, the amount of web content spam on the internet has grown exponentially. It was a burgeoning problem that needed tackling anyway, its just that spammers are relentless and so Google has had to work hard to counter this.

Whilst working for a local digital marketing agency on their top 10 ranked digital marketing blog, I used to dedicate some time every year researching highlights of the previous 12 months in the digital world for an annual recap. One of the sources for new developments in digital marketing was always the official Google Search Central Blog (formerly the Google Webmaster Central Blog). Amongst all the subjects, from algorithm updates to crawling, indexing, and structured data, the one that always stood out was Google’s Webspam Report.

Webspam Report 2019

Since SpamBrain, the major algorithm to detect web spam, was introduced in 2018, it wasn’t until June 2019 that we saw the first update based upon this particular algorithm in the Webspam Report 2019.

In that year’s report Google revealed that the number of spammy web pages they detect every day was 25 billion. That’s a total of 9,125 billion or over 9 trillion spam webpages detected that year.

Google’s reaction to that number was:

That’s a lot of spam and it goes to show the scale, persistence, and the lengths that spammers are willing to go.

They continued:

We’re very serious about making sure that your chance of encountering spammy pages in Search is as small as possible. Our efforts have helped ensure that more than 99% of visits from our results lead to experiences without spam.

So that’s encouraging to see how much effort the search engine is putting into combating spam – it helps deliver relevant results to the user and thus helps Google maintain its position as the most popular search engine on earth.

Google also admitted that they’d not been able to reduce the amount of spam they found, almost as if they were hoping their systems were a deterrent when their Terms of Service (TOS) obviously weren’t.

Webspam Report 2020

The next year, the official Google report for 2020 said that the amount of web pages detected every day had risen to 40 billion. Over the course of the year, that was 14,600 billion or 14.6 trillion spam pages. That is almost, but not quite, double the amount found the previous year.

Google reiterated that the volumes were still on a massive scale, that their AI system was still fighting the problem, and that spam from hacked websites remained a challenge, despite improving detection rates of hacked spam by over 50%.

Webspam Report 2021

The report for 2021 was pivotal. How many billions of spam web pages had been detected daily in 2021? Google did not quote an actual figure but did say:

In 2021, SpamBrain identified nearly six times more spam sites than in 2020

Now that’s not really reporting in my book unless the numbers are so significant that it would be frightening.

And there’s another clever twist too, did you spot it? Previously Google reported the number of pages. Now they’re hinting at the number of websites. A 600% increase in sites could mean a far larger number of pages – ten times? One hundred times? We just don’t know.

Webspam Report 2022

The pattern for the 2022 report is the same as last year:

SpamBrain detected 5 times more spam sites compared to 2021

Again, Google mentioned web sites but not volumes.

Webspam Report 2023

With Google’s cadence of publishing the web spam report in around April to June in the succeeding year, there was no 2023 report published in 2024.

I was really hoping to see some data for this because of one really huge, in fact let’s call it seismic, aspect affecting web spam – In November 2022 OpenAI launched ChatGPT publicly and the world went mad for it. Everywhere I looked, people were positioning themselves as the new gurus in the use of this generative AI tool. To add credence to the scale of the uptake, It was the most rapid software adoption in history.

The most frightening aspect, for me as a veteran digital marketer and content writer, was that people without skills were using AI to generate huge amounts of content.

2023 then was the year where I expected the spam report figures to go sky high. But because we had 14,6 trillion spam pages in 2020, then started using whole websites as the metric in 2021, the real number of spam webpages and websites is now lost. Google must know but they’re not telling us. I bet the figure is truly staggering and probably terrifying too, especially with gen AI “churning out” content.

2024

Now, there was an October 2023 Spam Update, which was designed to further improve Google’s spam detection system and reduce the amount of spam in the SERPs.

But now, after a year of unfettered AI malarkey, something big has happened…

The March 2024 Core Update.

Now THIS one was a big update. Google themselves publicly stated that it was “more complex than usual, involving changes to multiple core systems”. A big part of this core update was the March 2024 spam update which introduced two very specific new spam policies:

  • Expired domain abuse: Some spammers have been pushing the tactic of buying up expired domains and replacing what were once highly popular websites with tons of new spam content and links.
  • Scaled content abuse: Of course, whether they use an expired domain or not, there’s still too much mass produced, low-quality, often AI-generated content.

Other types of web spam still exist and remain a threat to the quality of the search results, including link spam, cloaking, and keyword stuffing.

The expired domain and scaled content abuse updates have been music to my ears as they are now addressing the sheer volumes of spam. However, we still don’t have the numbers.

But if we think that the last official number of spam webpages detected every single day was 40 billion in 2020, what is that number now, four years later? We may never know.

Conclusion

Spam is a huge and evolving issue and, whilst Google took over a year to really react to the very real generative AI threat, it seems that they’re on top of it. I don’t tend to see spam in the results and none of the projects I’m working on have been affected. But it’s good to know that scaled content abuse is finally being dealt with.

Google’s overall approach to fighting spam is very much appreciated as both an end-user searching the index, but also as a digital marketer, SEO consultant, and webmaster. Their combination of machine learning, manual review, and collaboration with website owners is what keeps 99% of results spam-free. This helps users retain trust in the search results.

If you see spam, deceptive, or low-quality web pages, remember that you can report them.

Leave a comment