Every once in a while we get a glimpse into rare and strange behavior that doesn’t involve the website being hacked, but causes major problems for website owners. We have spoken recently about malicious referral spam in Google Analytics and Google Search Console being used by attackers after they gain access to a website. Today we are going to look at 404 errors in Search Console caused by website spam.
Today, we’re going to look at how Googlebot ended up accidentally crashing a site after we cleaned up a large scale spam infection on a website. If you use Google Search Console / Webmaster Tools (and you should) we offer specific instructions to make sure you aren’t affected if you find yourself in a similar situation.
Indicators of Compromise (IoC)
Let’s start analyzing the signs of this type of mass spam infection which leads to the issue with Googlebot.
- It changes the Title and Description in your Google search results pages.
- It usually impacts the disk quota of your hosting account suddenly due to the large number of files being created.
- Once the spam is removed, 404 errors in Search Console / Webmaster Tools may remain, causing further damage.
Now let’s analyze how this Japanese spam campaign works:
- Attackers create doorway pages on an infected site in order to rank them in Google results for relevant search queries.
- When searchers click on these results the doorway redirects them to third-party sites that the hackers really want to promote.
Here’s where it gets interesting. Google will only rank the doorway pages if there are many incoming links to those doorway pages. This is one of the main ways that Google identifies “good” search results as part of its algorithm.
It’s difficult to expect that anyone would link to doorway pages only hackers know about. That’s why the attacker places links to their sites on other doorways that they have created on other hacked websites.
Here’s an example using Unmaskparasites to uncover one of those doorways and its external links from hacked sites:
- Now let’s do the math.
- Typical spam campaigns infect around 3,000 sites.
- Each site, as we know has at least 25,000 spam pages/doorways (usually more).
- Each doorway has at least 5 links to other hacked sites.
- This gives us around 125,000 outgoing links per hacked site.
- Since they are evenly distributed between all the compromised sites, it means that each hacked site has about 40 links to every other hacked site.
This means that all the hacked sites, combined, have around 125,000 thousands links to doorways on each individual hacked site. Even this is probably an underestimation since they usually create more than one directory with spam files, each of which contains 20,000+ spam files.
As you see, there are an enormous amount of incoming links to your site, and Google can see them too.
The Impact of Spam on Search Engine Optimization (SEO)
Now let’s take a look at how this problem of incoming spam links affects your SEO and what happens once you clean them up:
- As we previously calculated, there’s probably over 125,000 references on the web pointing to the spam on your website, so this means that Googlebot will eventually crawl them on the other infected sites and start crawling your website for those links.
- If the spam is not cleaned up promptly it can cause a sharp drop in your SEO rankings as it generates a huge amount of spam doorways that drain your link juice and lower your reputation.
- After all those files are cleaned, Google will still try to crawl them eventually because the backlinks were most likely posted somewhere else already. This can create a tremendous amount of 404 errors on your Google Search Console (Webmaster Tools) panel.
Mitigation and Recovery
Now let’s think how we can mitigate this large amount of 404 pages that Googlebot expects to find:
- If you want to get your hands dirty with a time-consuming process you can use the URL Removal tool in Search Console. It’s a good method for a small amount of links and it should show results quickly, however, it’s difficult to submit all the links one by one. Google even recommends that you use the following robots.txt solution instead in some cases.
- Robots.txt is a file on your server that tells bots which parts of your site it can crawl and which parts should not be crawled. Since most of the spam pages were inside fake folders such as “fjhj”, “thg”, “gtg”, “iut”, you just need to disallow robots (like Googlebot) from crawling these folders altogether. This will immediately remove 404 errors to those pages in the “Crawl Errors” report in Search Console, since Googlebot will not be allowed to even try to crawl them in the first place.
The Robots.txt Solution
Basically, you need to create a robots.txt file with the following content and put it in the root directory of your site:
User-agent: * Disallow: /fjhj/ Disallow: /thg/ Disallow: /gtg/ Disallow: /iut/
If no action is taken to mitigate these 404 pages, Google will gradually show more and more 404 “Not Found” errors each day in your Search Console, and it can reach the hundreds of thousands in a short amount of time.
Let’s take a deeper look at why Google will show more 404 pages after you remove the spam from your website:
- When the attackers infect your site, all the spam pages were there, but Googlebot might not have found any reference to them yet. If it did, the pages were there, so there is no 404 issue with them.
- Google doesn’t crawl all those new 100,000+ spam pages at the same time. They don’t want their bot to cause performance issues for your server, so Google bot usually has some quota on the number of pages they can crawl daily, especially when a site is known not to produce thousands of new URL every day.
- You can see this in “Crawl Stats” in Search Console. The average number that is set there by default is a value attributed automatically by the analysis they make of your website response capability/network load. Google wants to keep a good balance of crawls without affecting the speed of your website.
- The same thing has happened on the other hacked sites sending links to your spam pages. Over time, Google gradually increased the number of crawled spam pages it can handle daily (and new links to your site).
- Google will crawl all the links that they see on the external websites pointing to your website, so even after the infection on your website is clean there may be thousands links on Google’s “queue” to be crawled. They may also find more links on other infected websites that still link to the removed spam pages, and those will also be added to its crawl “queue” which at this point will also throw 404 error.
- Eventually Google begins to re-crawl the already indexed doorways on your site and noticed that they are all gone! This moves the number of 404 errors close to the maximum. Google always tries to re-crawl deleted pages at least few times over several weeks to make sure they are really gone for good before removing them from the index. This makes sure the 404 is not just a temporary maintenance issue.
It’s a normal practice for Google to re-visit 404 pages and only remove them from index after a few consecutive failed re-scans. Here’s what John Mueller (Google’s senior Webmaster Trends Analyst) said about it just a few months ago:
… we’ll still occasionally check to see if these pages are still gone, espectially when we spot a new link to them.
So what we know is, despite your site being clean, you are experiencing the long term side effect of the hack and may see Google’s attempts to re-crawl thousands of non-existent spam pages for a few more weeks.
The Rare Accidental DDoS by Googlebot
Now, there’s a few very rare cases where Google’s automatic performance analyzer fails to provide an accurate result of the “optimal” crawl rate. This is usually caused by a temporary network issue on the website or on your server, and in these cases Google starts to crawl your website very intensively. This can lead to your website going down, kinda like a very rough DDoS attack.
First start by going onto your Google Search Console (Webmaster Tools) and follow these steps:
On this screen we see a couple of settings, but we’re interested in the Crawl Rate settings.
The first setting is the default setting that your website receives as soon as you add a new site as your property. Since it is not working on the affected site we will need to make use of the second option.
The best setting for most cases is a bit before the middle of the slider, but if this is still not enough to get decent performance on your website, feel free to lower it until you get stable performance on your website.
Please note that this setting should only be changed if the Google crawl system is in fact slowing your website down, because it can make a big difference on the proliferation of your website and updates to the Google Search Index. We are asking Google to take longer to crawl your website if you set it to low.
It is important to note that this settings will only stay in effect for 90 days, after which it will revert back to automatic mode. This only changes the speed of Googlebot’s requests during the crawl process. It does not have any effect on how often Google crawls your site or how deeply the URL structure is crawled.