Web Crawler & User Agent Blocking Techniques

From .tk Redirects to PushKa Browser Notification Scam

This is a simple script that allows hackers to block specific crawlers based upon website requests from specific user-agents. This is useful when you don’t want certain traffic from being able to load certain content – usually a phishing page or a malicious download.

if(preg_match('/bot|crawler|spider|facebook|alexa|twitter|curl/i', $_SERVER['HTTP_USER_AGENT'])) {
    logger("[BOT] {$_SERVER['REQUEST_URI']} - 500");

    header('HTTP/1.1 500 Internal Server Error');
    exit();
}

Using preg_match, the script looks for certain known crawler strings in the user-agent. If it finds a match, then instead of serving any website page, it instead reports a 500 Internal Server Error to the detected crawler. It accomplishes this through the header function, which can modify HTTP headers for incoming requests.

This can be verified by checking out the HTTP access logs for the website. Here is an example of a request sent from Googlebot, which receives a 500 Internal Server Error, and a request sent from an iPhone that goes through successfully (200 response code instead of 500):

127.0.0.1 - - [09/Jul/2020:11:36:52 -0500] "GET /test.php HTTP/1.1" 500 185 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

127.0.0.1 - - [09/Jul/2020:11:37:10 -0500] "GET /test.php HTTP/1.1" 200 147 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 9_2 like Mac OS X) AppleWebKit/601.1 (KHTML, like Gecko) CriOS/47.0.2526.70 Mobile/13C71 Safari/601.1.46"

These types of scripts can also be used to trick bots or users from being able to determine if a phishing page or malicious download still exists.

To detect and prevent these issues, we highly recommend having file integrity monitoring in place and clean backups of your files/database. If your website becomes compromised, you’d be able to identify indicators of compromise and malicious behavior within your environment.

You May Also Like