Dave's Mess > Blog

<<< The Great Croatian Adventure (Part III - The Good Bits) How different must a copy be before it is no longer a copy ? >>>

Clever girl...

7pm, 17th November 2006 - Geek, Interesting, Web, Security, Sysadmin
Clever girl: One of the Velociraptors from Jurassic Park.

Just when you've got one in your sights, the other two attack you from the side.

I was surprised a couple of days ago to find a significant number of entries in the slow queries log on one of our web servers. While looking through them we discovered that they were caused almost exclusively by someone doing a search for a fairly unusual search string. The search string looked like this: "<A HREF = "http://example.com">example.com</A>" which, as I said, is fairly unusual. What that represents, for all you non-web-geeks out there is the HTML code that would create a link to the site example.com and would look like: example.com. A strange thing to enter into a search box to be certain.

Why would anyone type that into a search box? My first thought was that it was referrer log spam, or some variation on it. The idea being that every request that is made to a website is stored in a log, and some websites publish those logs or the statistics from those logs as part of the website itself. These statistics can be visited by real users who may click on the link or by search engine robots which can increase the pagerank of the site in the link. None of our websites do that however so that seemed unlikely to be the motivation.

I realised quickly after this that the motivation was probably a little more clever, but clearly unscrupulous so we decided to block his IP address. Strangely enough he seemed to be coming from a range of IP addresses. A class-C range to be precise, and pretty much randomly at that. He also had a user-agent string of "Slurp". A quick reverse-DNS lookup and we realised that this was the Yahoo! search engine's robot crawling our sites and doing these searches and therefore that blocking the IP addresses was not a good idea.

So why was Yahoo! doing searches for random bits of HTML on our sites? The answer was found within another site, found via Google that had a large list of links that when followed linked to a search results page on some of our sites. The idea was similar to the referrer log spam but rather than creating a bot that had a link in it's referrer string, this one used search engine bots to attempt to insert links into our search results and then index those pages and potentially increase the pagerank of the linked site. It's unlikely to fool real users but they were not the motivation here; this was all about getting higher in Yahoo! and Google's search results pages.

We couldn't let this continue, and the easiest solution was simply to disallow robots from indexing search results pages. This had the added advantage of reducing the load the server was being caused by running all those searches that no one was looking at anyway. Also, no one wants to find a search results page linked from Google. If you are using a search engine to search for a particular topic, you want pages on that topic, not pages that redirect you to pages that redirect you to pages on that topic. From now on, all search results pages that I deal with will be disallowed to all bots. The bots themselves won't be doing any searches, anybody that links to them is likely to be up to no good and there's no point in search engines indexing them anyway.

To finish off, I thought I'd leave you with another quote from the same movie that the quote in the title is from: "It's a UNIX system! I know this!"

Related posts:

Galumph went the little green frog one day.
Internet Explorer exceeds all expectations.
Submit, Reset.
MoneySavingExpert under DDoS attack
How to recover your data after a crash

Comments


Be the first to comment !


(not shown publicly)


Limited HTML
Like BBCode
Common Usage
What's all this ?



Older blog posts: