An XML sitemap is an important part of a website’s SEO and exists to help search engine crawlers index new URLs on your website. For example, if a site has a large number of pages that were recently updated and the owner wants Google to index their latest content, they could utilize a sitemap.xml containing all the URLs along with some metadata.
Unfortunately, blackhat SEO spammers are well aware of sitemap.xml’s abilities and use tools to assist in generating malicious sitemap files that direct search engine crawlers like Googlebot to prioritize their SEO spam content.
Indexing Unwanted Spam Content
This hacktool was found on a compromised website that had a SEO spam problem. Its only purpose is to generate a malicious sitemap.xml file so that the spam can be indexed.
... $request_url = isset($_GET['url']) ? $_GET['url'] : ''; if (!empty($request_url)) { $sx_content .= "\n\t\t" . '<url>'; $sx_content .= "\n\t\t\t" . '<loc>http://' . $_SERVER['SERVER_NAME'] . $url . '?' . $request_url . '</loc>'; $sx_content .= "\n\t\t\t" . '<lastmod>' . date('Y-m-d') . '</lastmod>'; $sx_content .= "\n\t\t\t" . '<changefreq>daily</changefreq>'; $sx_content .= "\n\t\t\t" . '<priority>0.8</priority>'; $sx_content .= "\n\t\t</url>"; } else { for ($j = 0;$j < 5000;$j++) { $str1 = random(8); $sx_content .= "\n\t\t" . '<url>'; $sx_content .= "\n\t\t\t" . '<loc>http://' . $_SERVER['SERVER_NAME'] . $url . '?' . $str1 . ".html" . '</loc>'; $sx_content .= "\n\t\t\t" . '<lastmod>' . date('Y-m-d') . '</lastmod>'; $sx_content .= "\n\t\t\t" . '<changefreq>daily</changefreq>'; $sx_content .= "\n\t\t\t" . '<priority>0.8</priority>'; $sx_content .= "\n\t\t</url>"; } } $sx_content .= "\n\t</urlset>"; $xr = fopen('sitemap_tea.xml', "w"); $xrr = fwrite($xr, $sx_content);
This PHP script found at ./shun-tea.php checks the incoming request to see whether it has a GET URL parameter with the name url. If this parameter exists in the incoming request and contains data, then its value is assigned to the $request_url variable.
If the $request_url variable exists, then the data provided in the GET request is used in the creation of the malicious sitemap. If absent, an eight-character string of text is instead randomly generated by the PHP and used for the $request_url variable.
Generating Spam URL Locations
The sitemap that is generated is named sitemap_tea.xml. This file contains the predefined sitemap XML text along with the SEO spam’s URL loc (location) that is randomly provided by the PHP script or through the attacker’s GET request.
The sitemap’s URL metadata also contains the current date as the lastmod time.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>hxxp://localhost/shun-tea.php?6jxwa9sf.html</loc> <lastmod>2020-06-01</lastmod> <changefreq>daily</changefreq> <priority>0.8</priority> </url> <url> <loc>hxxp://localhost/shun-tea.php?8egbvh6e.html</loc> <lastmod>2020-06-01</lastmod> <changefreq>daily</changefreq> <priority>0.8</priority> </url> <url> <loc>hxxp://localhost/shun-tea.php?bdhxzfdh.html</loc> <lastmod>2020-06-01</lastmod> <changefreq>daily</changefreq> <priority>0.8</priority> </url> ...
This file makes it incredibly simple for the attacker to generate massive amounts of outgoing links in their spam content to help boost SEO rankings to third-party sites.
Cleanup Instructions
Sites that have been infected with this hacktool can refer to the following steps to remove the infection.
- Find and remove the PHP file generating the sitemap*.xml with the SEO spam links.
- Check for any suspicious sitemaps in Google Webmasters.
After that, you can follow the mitigation steps found in our guide on how to clean a hacked website.
The best way to prevent unwanted (and off-topic) spam on your site is to mitigate risk in the first place. Practice strong password security, keep your components up to date, limit user access and permissions, and follow website security best practices.
If you need a hand removing SEO spam from a compromised website, we’re always happy to help.