This obscure bot popped up on my radar earlier this month. The complete user-agent string is
PRCrawler/Nutch-0.9 (data mining development project; crawler@projectrialto.com)
The description provided in the string contains several clues that this bot is a waste of my bandwidth. First, Nutch is an open source search engine written in Java. ‘Data mining’ is not an exercise to which I am interested in offering my assistance, especially in the form of my server resources. ‘Development’ and ‘project’ are both hints that this crawler is experimental and may do the world no good at all. Here is how the creators of this bot explain its purpose:
Corey,
Project Rialto is a new online security services solution provider that monetizes its infrastructure investment via relevant advertising for its users. We accomplish this in a very unobtrusive and anonymous method. Our bot is crawling in order to understand the contents of web sites our users visit to assist in serving more relevant content.
We are currently in our initial development phases. As Project Rialto approaches its market launch we’ll provide more information about our offering.
We hope this addresses your concerns; please let us know if you have any other questions.
Regards,
Kelvin Edmison
Software Architect
Project Rialto
This loosely translates to, “we scraped your site to serve someone advertisements based on its content.” I found traces of this bot in one of my error database tables, so we are certainly seeing evidence of a development phase. IncrediBILL agrees that this bot will do no good for your site, and has compiled an IP address list in his usual “get lost” fashion.
Here is some robots.txt love from me to you that will block the bot user-agent that hit me:
User-agent: PRCrawler/Nutch-0.9 (data mining development project; crawler@projectrialto.com)
Disallow: /
Using robots.txt exclusion only works for bots that behave properly. Bad bots do not care if you do not want them, and the only way to prevent them from crawling your site is to block the IP addresses the bot uses.