Robots (bots) have outnumbered people on the Internet for almost two years, and they browse much faster than your average visitor. Aside from spamming your comment systems and crawling for vulnerable websites to attack, bots can also cause a lot of confusion in your website traffic reporting systems.
If you use analytics software on your website, you may have already noticed some strange, inexplicable referrers in your reports. The scourge of malicious referrals and bad bots is becoming a real problem. Over the past six months, Google Trends shows an exponential increase in search engine queries involving “referral spam” and “google analytics spam.”
We get a lot of questions about these referrers, especially for common bots like Semalt. Recently, we have been noticing a rise in “ghost referrers” that bypass traditional methods of dealing with referral spam.
Good Bot Googlebot
Bots are programs designed to automate tasks or pretend to be real visitors. They scan the internet as fast as they can, reading source code, crawling your links and executing functions. This sounds malicious already, doesn’t it?
Not all bots are bad. In fact, some bots are essential to the way we use the internet. Googlebot, for example, is responsible for indexing all of the content online and making it available in search result pages. That being said, if you are a regular reader of our blog, you already know that malware can operate using fake user-agent strings to make it appear like their HTTP request is coming from Googlebot.
Referral Spam and the Bad Bots
The bad robots are programmed by hackers to do all the heavy lifting, such as:
- Scrape website content for the purposes of plagiarizing it.
- Steal contact and payment details during transactions.
- Click fraud on competitor’s pay-per-click (PPC) advertisements.
- Visit your website to mess up your Google Analytics reports, aka referral spam.
Now this is not surprising; the risk of using bots is incredibly small compared to the potential payout. It is the last item on this list, referral spam, that we are going to focus on for the remainder of this post.
Referral spam has a number of malicious applications that can include:
- Inflating your website traffic to mess up your data.
- Tricking you into visiting malicious websites found in the referral reports.
- Generating backlinks from publicly accessible server logs.
- Hiding the real referrer headers while attacking the website.
Referral spam often has high bounce rates and low time-on-page metrics, and this has the potential to invalidate your traffic data in a significant way. Over the years, referral spam bots have become pretty sophisticated. New ones are being created everyday, making blocking these spambots almost futile.
Hackers can infect hundreds of thousands of websites and personal computers, using their acquired resources and IP addresses to perform complex attacks. This is a robot network or botnet.
A globally distributed network of computers makes it much more difficult for IP blacklisting or rate-limiting to provide any protection for your website. This is exactly what makes Distributed Denial of Service attacks so effective, with “distributed” being the operative word. You block one bad referrer from your reports and another pops up in its place.
It’s like an endless game of whack-a-mole.
This brand of referral spam is painful. If you have Google Analytics, you will remember that each account has a unique UA code:
If your Google Analytics code is hard-coded into your website, bots can scrape it from your source code and use your tracking code.
Once the bot has your UA code, it can send data directly to your Google Analytics account without even visiting your website. They use the Google Analytics Measurement Protocol, designed to accept data from the Internet of Things (like a connected refrigerator), so they don’t even need to install the tracking code on a website.
While our Website Firewall will block malicious referrals by default, these ghost referrals never really hit the website and do not show up in your server logs.
Stopping Ghost Referral Spam
Thankfully, there is a way to ignore ghost referrals and even restore your previous data integrity.
In Google Analytics, you can create a Hostname Filter on each view to make sure that only traffic coming from valid website properties are included in the sample. This means total elimination of ghost referrers in the data for those views going forward. Take that into account when comparing historical metrics.
If you already have views set up for each subdomain, you may be familiar with this process – we are going to use a Custom filter and regular expressions to add all of your valid hostnames.
How To Exclude Ghost Hostnames from Google Analytics
- On the Reports tab, set the date to cover a month at least.
- Go to Audience > Technology > Network
- Select Hostnames as the Primary Dimension.
- Write down all of your property’s valid hostnames.
- e.g. blog.sucuri.net and sucuri.net
- Most of the invalid hostnames are coming from ghost referrers. Don’t visit them!
- On the Administration tab, click the View dropdown menu and select Create new view.
- Name the new view.
- This let’s us test changes before messing with your existing data.
- Pro Tip: Always keep at least one view for your raw, unfiltered data!
- Select Filters from the options under the View column and click + NEW FILTER.
- Select Custom > Include > Hostname.
- Enter the Filter Pattern as a RegEx string containing all of the valid hostnames from Step 4.
- Click Verify Filter to preview the changes.
- I noticed this doesn’t really work with larger strings, like the example above.
- Click Save to apply the filter.
- Wait at least 24 hours to see changes in the Hostnames report (Step 3).
- Visit the Real Time reports to make sure traffic is still coming through to your new, filtered test view.
- Confirm the filter is working and then safely apply it to your main view.
This filter will only affect new traffic coming in. The old traffic still has ghosts. In order to view historical data without the analytics spam, you can create a Custom Segment
How To Segment Valid Hostnames for Google Analytics:
- On any Report click + Add Segment.
- Click the + NEW SEGMENT button.
- Select Conditions under Advanced on the sidebar.
- Select Hostnames.
- The Filter options above are set to Sessions & Include by default.
- Enter one hostname and click the OR button to enter the rest.
- Click Save when all valid hostnames have been added.
- You can apply the new segment to any data to ensure it only contains your hostnames.
It is important to remember that these are not permanent solutions, but they are quick and effective. Hackers get wise to our counter-measures, and there are cases of attackers setting fake hostnames in their referral headers.
Blocking Normal Referral Spam
The more common type of referral spam involves bots actually visiting your website, therefore, this type can be blocked by traditional means. In Google Analytics you can set up a similar Referral Exclusion filter, but it can be an exhaustive process. Here are some of the most common ones you may see creeping up in your Referral reports:
It’s recommended to add server configuration rules in your .htaccess, web.config or nginx.conf files, to specifically exclude lists known bad referrers. There are a lot of long lists out there that you can look for, and they keep on growing. If data integrity is important to you, this might be something you want to keep on top of your priority list.
I know it’s frustrating, but don’t give up! Layered security is one of the most important concepts of the 21st century and we need to get comfortable with the practice of protecting our data. This includes protecting our websites and the third-party tools we use to support them.
Have you encountered analytics referral spam? Let us know about your experience in the comments.