Security from Bad User Agents

One way to protect your site is blocking bad “User Agents”.

A user agent is the way your browser identifies itself to your server. Your server could use this to send different versions of your site to the browser; for example, not sending images to a Lynx browser that only displays text, or to a browser that is a screen reader for blind or low-vision people. My Firefox browser has this user agent:

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko/20100101 Firefox/27.0

Search engines also make requests of your server and identify themselves with a user agent, as do hacker bots.

The User Agent can be changed. Some browsers pretend to be a version of Internet Explorer. Some hacker bots pretend to be any major browser they like, switching for different requests. Some pretend to be Google’s search engine bot.

Web site developers will use browser plugins to send a different user agent string, so they can test what their web site sends to people with other browsers. (I’m using the Firefox “User Agent Overrider” plugin to change my browser’s user agent; on my server I detect user agent, which lets me know the browser type and version, operating system, whether it is a mobile device, and more, using a PHP script from http://techpatterns.com/downloads/php_browser_detection.php which I think is much nicer than using javascript since I only send to the browser the best HTML/ CSS for it.

So the User Agent isn’t a good field to check for security. There are other fields that bad bots can’t fake, for example bad words/phrases in the URL they are looking for on your site.

But checking the user agent is good for blocking nuisances that waste your site resources.

How Do You Block a User Agent

Almost all shared hosting accounts are on Apache servers.  Microsoft IIS is much harder for most people to use, so I’ll give Apache .htaccess examples.

The “Better WP Security” plugin for WordPress has a good-sized list of bad User Agents that it can add to your site’s list.

Here’s some of their code for blocking user agents (all the other examples I’ve seen use the exact same approach):

RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [NC,OR]

The ^ means “at the start” of the User Agent, the [NC,OR] says not case sensitive and any one or the other will match.

Here’s how I do the same thing, but allowing me to set exceptions to the security rule, and to log what specifically triggered the rule:

SetEnvIfNoCase User-Agent (eirgrabber|emailcollector|emailsiphon|emailwolf) badUserAgent=$1

For User-Agent values that have multiple values (the | separates them) and that contain special characters, you must use substitutions:
\s for spaces, \. for period, @ : ! are permitted
For this User-Agent: Bot mailto:craftbot@yahoo.com use this in your .htaccess: Bot\smailto:craftbot@yahoo\.com

Then I send bad user agent requests to a page this way:

RewriteCond %{REQUEST_URI} !/shared/bad-webbot.php$ [NC]
RewriteCond %{REQUEST_URI} !/shared/403.php$ [NC]
RewriteRule (.*) /shared/bad-webbot\.php [E=badUserAgent:%0,L]

First I check that the file being requested isn’t one of  my “error” pages. The user agent doesn’t change when I redirect to an error page, so if I don’t make an exception for my error page, the server has a security conflict (you told me to display this page with that security rule, but the page violates the security rule) and gives up by showing a bare “500: Server Error”.  Then I redirect the requested file to my “you’re a nuisance” page.

The E=badUserAgent:%0 sets an environment variable that my bad-webbot.php page can read.

The ,L says this is the Last thing to check in .htaccess, go right to the page.

My bad-webbot.php just displays minimal HTML, with no style sheet. But it also logs the IP address and what makes me think they’re a nuisance, in this case which phrase in their user agent triggered the security script. If something is being blocked incorrectly, I can see it in the log.

One exception I found for user agents was for LinkedIn. “Jakarta” is one of the agents suggested for blocking, but LinkedIn uses this user agent. Another I made an exception for is Alexa.

LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +http://www.linkedin.com)

Here’s how I made the exceptions:

SetEnvIfNoCase User-Agent ia_archiver\ \(+http://www\.alexa\.com\/site\/help\/webmasters;\ crawler@alexa\.com) !badUserAgent
SetEnvIfNoCase User-Agent LinkedInBot/1\.0\ \(compatible;\ Mozilla/5\.0;\ Jakarta\ Commons-HttpClient/3\.1\ +http://www\.linkedin\.com) !badUserAgent

Notice how I put a backslash before all spaces, periods and slashes inside the user agent string.

The ! before the variable (badUserAgent) un-sets the environment variable.


Posted

in

,

by

Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.