[Robots.txt] Matching user-agent names does not conform to robots.txt RFC
http://www.csurams.com/robots.txt contains the lines: User-agent: Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8 Disallow: /
SimpleRobotRulesParser parses this as disallowing access to any agent name that starts with “Mozilla/5.0”. Based on the robots.txt RFC, this should not be the case: only if the whole name of the agent is contained in the User-agent line, does the rule apply. The problem stems from the fact that that the User-agent line is split into "words", and this seems to be contrary to the RFC.
Thanks, @YossiTamari.
The intention was obviously to exclude the Topsy's "Butterfly" bot. Of course, just copying the user-agent string from the webserver logs to the robots.txt isn't the right approach and wasn't intention for the "User-agent line name token" in the robots.txt RFC:
3.2.1 The User-agent line Name tokens are used to allow robots to identify themselves via a simple product token. Name tokens should be short and to the point. The name token a robot chooses for itself should be sent as part of the HTTP User-agent header, and must be well documented.
Btw., there is also no reason why you have to use the user-agent name sent in the HTTP header also for robots.txt parsing: in Nutch you would set http.robots.agents to butterfly and http.agent.name to Mozilla/5.0 .... That may solve your problem in the concrete case if you have chosen a unique agent name. You hardly want to name your polite bot "Mozilla". Cf. Googlebot's full user agent strings.
From the comments in SimpleRobotRulesParser I guess that the actual implementation tries to cover such edge case as the Butterfly one. The intention of the prefix match may be to match Butterfly/1.0 with the line User-agent: butterfly. However, that's dangerous in combination with splitting the user-agent line into words: User-agent: a b c bot would apply to all agents starting with one of the letters abc. Either we should not split the user-agent line into words (as @YossiTamari suggested) or at least require that a full word is matched. Any thoughts?
@sebastian-nagel, the documentation for http.robots.agents says:
Any other agents, apart from 'http.agent.name', that the robots parser would look for in robots.txt.
So I don't think your work-around would work (or maybe I misunderstand your suggestion). The code in RobotRulesParser seems to match that description as well.
BTW, my actual use case is to use a short unique agent name in http.agent.name, but, for historical reasons, also support the longer name in http.robots.agents. I may be able to engineer a http.robots.agents value that will work with the current code, but it will be a hack.
My thinking on the correct implementation of this feature is that there are two cases:
- robots.txt was built using the short name token. In that case, there is no need to split the value, a prefix match to the user-agent header value should be enough.
- robots.txt was built by copying the full line. In this case, again, there is no need to split the value. Either a full match is required (this was the intention of the person writing the file), or a substring match is required.
Taking these two together, it seems either a prefix match (strict) or a substring match (greedy), of the user agent header, in the user agent line should be the correct solution.
The new RFC draft, section 2.2.1 is very specific:
Crawlers set a product token to find relevant groups. The product token MUST contain only "a-zA-Z_-" characters. [...]
Crawlers MUST find the group that matches the product token exactly, and then obey the rules of the group. If there is more than one group matching the user-agent, the matching groups' rules MUST be combined into one group. The matching MUST be case-insensitive.
In short: there should be just a case-insensitive comparison of the two user-agent names (passed to the parser and from the robots.txt) and
- no splitting of the
user-agentinto tokens - no prefix match
Notes:
-
the RFC unit tests (#360) includes a couple of tests to verify that only valid "product tokens" are accepted. Blocks with an invalid user-agent are ignored, resp. (specific to Googlebot) only the part up to the first space is used.
-
matching the user-agent from the robots.txt by prefix may cause that the wrong block is selected, see the unit test in #362. This especially applies if multiple blocks are merged (see #351).
-
this would be in contradiction to the Javadoc of BaseRobotsParser.parseContent(...):
An agent name is considered a match if it's a prefix match on the provided robot name.
However, the provided example
if you pass in "Mozilla Crawlerbot-super 1.0", this would match "crawlerbot"
does not state that also
crawlerorcare considered as matches. Additionally, the documentation of the parameterrobotNamesgives the instruction to passjust the name portion, w/o version or other details
To conform with the RFC
- shouldn't we assume that users to pass (a list of) complete user-agent names which are not split into tokens?
- for the robots.txt we could (optionally) accept multi-token user-agent lines, so that the old RFC:
The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring.
(reopened: this is not fixed by #351, cf. unit tests in #362)
Fixed by #362 if exact user-agent matching is enabled (now the default), see also the corresponding unit test.