Robots (bots) have outnumbered people on the Internet for almost two years, and they browse much faster than your average visitor. Aside from spamming your comment systems and crawling for vulnerable websites to attack, bots can also cause a lot of confusion in your website traffic reporting systems.
If you use analytics software on your website, you may have already noticed some strange, inexplicable referrers in your reports. The scourge of malicious referrals and bad bots is becoming a real problem. Over the past six months, Google Trends shows an exponential increase in search engine queries involving “referral spam” and “google analytics spam.”
We get a lot of questions about these referrers, especially for common bots. Recently, we have been noticing a rise in “ghost referrers” that bypass traditional methods of dealing with referral spam.
Good Bot Googlebot
Bots are programs designed to automate tasks or pretend to be real visitors. They scan the internet as fast as they can, reading source code, crawling your links and executing functions. This sounds malicious already, doesn’t it?
Not all bots are bad. In fact, some bots are essential to the way we use the internet. Googlebot, for example, is responsible for indexing all of the content online and making it available in search result pages. That being said, if you are a regular reader of our blog, you already know that malware can operate using fake user-agent strings to make it appear like their HTTP request is coming from Googlebot.
Referral Spam and the Bad Bots
The bad robots are programmed by hackers to do all the heavy lifting, such as:
- Scrape website content for the purposes of plagiarizing it.
- Steal contact and payment details during transactions.
- Click fraud on competitor’s pay-per-click (PPC) advertisements.
- Visit your website to mess up your Google Analytics reports, aka referral spam.
Now this is not surprising; the risk of using bots is incredibly small compared to the potential payout. It is the last item on this list, referral spam, that we are going to focus on for the remainder of this post.
Referral spam has a number of malicious applications that can include:
- Inflating your website traffic to mess up your data.
- Tricking you into visiting malicious websites found in the referral reports.
- Generating backlinks from publicly accessible server logs.
- Hiding the real referrer headers while attacking the website.
Referral spam often has high bounce rates and low time-on-page metrics, and this has the potential to invalidate your traffic data in a significant way. Over the years, referral spam bots have become pretty sophisticated. New ones are being created everyday, making blocking these spambots almost futile.
Botnets
Hackers can infect hundreds of thousands of websites and personal computers, using their acquired resources and IP addresses to perform complex attacks. This is a robot network or botnet.
A globally distributed network of computers makes it much more difficult for IP blacklisting or rate-limiting to provide any protection for your website. This is exactly what makes Distributed Denial of Service attacks so effective, with “distributed” being the operative word. You block one bad referrer from your reports and another pops up in its place.
It’s like an endless game of whack-a-mole.
Ghost Referrals
This brand of referral spam is painful. If you have Google Analytics, you will remember that each account has a unique UA code:
_gaq.push(['_setAccount', 'UA-XXXXX-X']);
If your Google Analytics code is hard-coded into your website, bots can scrape it from your source code and use your tracking code.
Once the bot has your UA code, it can send data directly to your Google Analytics account without even visiting your website. They use the Google Analytics Measurement Protocol, designed to accept data from the Internet of Things (like a connected refrigerator), so they don’t even need to install the tracking code on a website.
While our Website Firewall will block malicious referrals by default, these ghost referrals never really hit the website and do not show up in your server logs.
Stopping Ghost Referral Spam
Thankfully, there is a way to ignore ghost referrals and even restore your previous data integrity.
In Google Analytics, you can create a Hostname Filter on each view to make sure that only traffic coming from valid website properties are included in the sample. This means total elimination of ghost referrers in the data for those views going forward. Take that into account when comparing historical metrics.
If you already have views set up for each subdomain, you may be familiar with this process – we are going to use a Custom filter and regular expressions to add all of your valid hostnames.
How To Exclude Ghost Hostnames from Google Analytics
- On the Reports tab, set the date to cover a month at least.
- Go to Audience > Technology > Network
- Select Hostnames as the Primary Dimension.
- Write down all of your property’s valid hostnames.
- e.g. blog.sucuri.net and sucuri.net
- Most of the invalid hostnames are coming from ghost referrers. Don’t visit them!
- On the Administration tab, click the View dropdown menu and select Create new view.
- Name the new view.
- This let’s us test changes before messing with your existing data.
- Pro Tip: Always keep at least one view for your raw, unfiltered data!
- Select Filters from the options under the View column and click + NEW FILTER.
- Select Custom > Include > Hostname.
- Enter the Filter Pattern as a RegEx string containing all of the valid hostnames from Step 4.
- e.g.
^www.sucuri.net$|^sucuri.net$|^blog.sucuri.net$|^sitecheck.sucuri.net$|^kb.sucuri.net$|^performance.sucuri.net$|^login.sucuri.net$|^blog.unmaskparasites.com$
- Click Verify Filter to preview the changes.
- I noticed this doesn’t really work with larger strings, like the example above.
- Click Save to apply the filter.
- Wait at least 24 hours to see changes in the Hostnames report (Step 3).
- Visit the Real Time reports to make sure traffic is still coming through to your new, filtered test view.
- Confirm the filter is working and then safely apply it to your main view.
This filter will only affect new traffic coming in. The old traffic still has ghosts. In order to view historical data without the analytics spam, you can create a Custom Segment
How To Segment Valid Hostnames for Google Analytics:
- On any Report click + Add Segment.
- Click the + NEW SEGMENT button.
- Select Conditions under Advanced on the sidebar.
- Select Hostnames.
- The Filter options above are set to Sessions & Include by default.
- Enter one hostname and click the OR button to enter the rest.
- Click Save when all valid hostnames have been added.
- You can apply the new segment to any data to ensure it only contains your hostnames.
It is important to remember that these are not permanent solutions, but they are quick and effective. Hackers get wise to our counter-measures, and there are cases of attackers setting fake hostnames in their referral headers.
Blocking Normal Referral Spam
The more common type of referral spam involves bots actually visiting your website, therefore, this type can be blocked by traditional means. In Google Analytics you can set up a similar Referral Exclusion filter, but it can be an exhaustive process. Here are some of the most common ones you may see creeping up in your Referral reports:
buttons-for-website\.com|blackhatworth\.com|anticrawler\.org
It’s recommended to add server configuration rules in your .htaccess, web.config or nginx.conf files, to specifically exclude lists known bad referrers. There are a lot of long lists out there that you can look for, and they keep on growing. If data integrity is important to you, this might be something you want to keep on top of your priority list.
The Sucuri Website Firewall also blocks known bad referrers by default, and you can trust Sucuri Labs to keep these lists up-to-date. You can also add your own custom rules to the Firewall as needed.
Conclusion
I know it’s frustrating, but don’t give up! Layered security is one of the most important concepts of the 21st century and we need to get comfortable with the practice of protecting our data. This includes protecting our websites and the third-party tools we use to support them.
Have you encountered analytics referral spam? Let us know about your experience in the comments.
17 comments
Hi Alycia, thanks a lot for this article. I am using the spyderspanker plugin (which integrates with projecthoneypot as well) and it does a great job at blocking spyders and referral spam. It uses ioncube instead of .htaccess though. Wonder what you guys think of this though
Great article; funny how I kept thinking while reading you could have done the ubiquitous click bait headline by adding “ghost protocol” in there ala a certain movie title, but thankfully did not. Another great piece of information shared, thanks! 🙂
Thanks for these tips. We too are adding filters to Analytics to correct web site traffic inflated results, but you’ve given us more insight. You wonder that people don’t have better things to do with their time and surely the traffic they create will be low quality anyway!
There’s a really easy way to get rid of most of these without having to use filters.
Make sure the box is checked at:
Admin > View > View Settings > Exclude all hits from known bots and spiders.
In my opinion this gets rid of 99% of all of the trouble makers without having to do anything more.
Would this block google bots as well though? Or the “good” bots?
No bot should ever show up in your Analytics report. Not even Google Bot. This is only a setting for Analytics, it’s not instructing google to not crawl your website. It has no connection with Google Search. Google has never shown good bots in Analytics anyway. You’ll never see Google, MSN, Yahoo, Yandex etc in your Analytics reports.
Even then, why would you ever want to see a bot in an Analytics report. It would only provide false data.
Hope that helps clarify.
Ohh ok… yes, it does. Thanks a lot! I tried blocking out invalid hostnames like this article suggests and it did work however it also blocked out analytics from most media platforms and other campaigns we have running (which doesn’t make much sense to me). I’ll give this a try!
Awesome! Let me know how it goes.
Apart from setting filters, is there any way to stop these attacks? They appear to have a very bad effect of raising bounce rates and quite possibly, lowering Google search rankings.
Filters are the best way, but If you don’t want to use filters there are a few other things you could try. The “exclude all hits from known bots” checkbox as mentioned by Todd below will stop some of it, though not all for most people. Some people use a higher profile number (e.g. UA-XXX-8 instead of -1), this can also help a lot, though more of the spammers are trying higher numbered profiles now as well.
They won’t affect your Google search ratings, so while annoying you don’t have to worry about that at least.
Great article! Thank you.
I’m not sure this is exactly related to the topic, but I’ve got a client who is using an “email marketing agency” to send out emails to a supposedly “proprietary database”. They won’t give any information on how they generate and maintain said database which is very sketchy—in my eyes. These emails have sent a TON of traffic to the site but with very low time on page and other very questionable stats.
I finally dug into the “Network > Service Provider” area in GA and found that the majority of the traffic is from the Google Inc. and Amazon networks, but the campaign source is the email company. Is it possible that they are using Google and Amazon services to run bots from to inflate their deliver, open and click-through rates?
There have been around 10k visits via the Google and Amazon networks in just ONE day.
Thanks for any and all thoughts!
Jon
Every time I had to filter the ghost visits to exclude from my Google Analytics. But this is such a amazing article and it is quite remarkable in the way you simplified this. Thanks @Alycia :). I have applied the filters and will do comment again once I get success in this.
Hi Alycia. I am glad that the frequency of spam referral visits has been decreasing now. But the only problem is I am getting two different number of visits from social channel in my two Google Analytics views for the same website though I have applied same filter to both the views. Could you please guide me on this issue.
Thanks for your consideration.
Regards
Jyoti
Although good tips in this article, I find it ridiculous that we should have to spend so much time making all these efforts to see clean stats. I simply do not believe that Google cannot deliver clean stats in the first place. Why should we pay (time is money) for something that is caused indirectly by them?
This is the one and only reason I have quite using Google Analytics.
I am not affiliated with Google at all and I would adamantly disagree with your comments. Google has no reason to benefit from showing known bots and spiders. There’s only one setting that you need to enable. It’s in the view settings and it’s called “Exclude all hits from known bots and spiders”. Once this is checked all known bots and spiders will be ignored. Obviously some will make it through before they are known to Google. Anyone can purchase a domain, easily setup a bot to scan for UA codes and then use those codes to manipulate data. Once Google spots them they are then blocked.
I’m attaching data for one of our accounts that has this option enabled in the settings. All of the referrals for a 30 day snapshot are clean. All of the domains are legit and there’s no segment on page as you can see. This site averages approximately 300-500 visits a day. Nothing has to be done to see this. A single check box when setting up the account is all that needed to be done.
If you’re spending time on this there’s something wrong. In my opinion claiming that this is caused by Google is not at all correct. If you use something else it may not have the same issue. Why you ask? Because bots aren’t searching for tracking codes for Piwik or whatever you might be using. When 15,429,942 sites (per BuiltWith) are using GA it’s the obvious target.
(edit: Sucuri has image upload in Disqus disabled)
not sure why you would be offended by my comment, Todd, sorry that you were, was not my intention.
I’m not one bit offended and I apologize if you feel attacked. We run multiple analytic platforms. I simply feel that you are incorrect by passing blame to Google. There are simple measures that they put in place to allow the option to remove the erroneous referral spam.
Comments are closed.