A new client approached us recently with a sporadic problem on their Magento site. At certain points of the day, the performance would fall off a cliff. According to Google Analytics there were no jumps in visitor numbers during these periods. Further investigation required....
Once we were given access to the server, we were able to start checking the log files to get a feel for what was going on.
It turned out to be an issue with Baiduspider – a spider for a Chinese search engine which was crawling the site aggressively with 40-50 new sessions per minute for hours at a time. The Magento/var/sessions folder supported the theory with hundreds of orphaned session files.
This particular client sells within the UK only, so they have no need for this rogue spider to index their site on a Chinese search engine. The easiest solution was to block the user agent from accessing the site. The polite way to achieve this, is to place the following in your robots.txt file;
# Block Baidu
User-agent: Baiduspider
User-agent: baiduspider
User-agent: Baiduspider+
Disallow: /
However, some of these rogue spiders do not check or honour a robots file, so you need to be a bit more forceful and block requests altogether.
For Nginx, this means placing this in your vhost conf file:
location / {
if ($http_user_agent ~* ^Baiduspider) {
return 403;
}
For Apache, add this to your htaccess file:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC]
RewriteRule .* - [F,L]
Restart your web server and your spider problem has been squashed.