crawler-commons [Robots.txt] Matching user-agent names does not conform to robots.txt RFC

http://www.csurams.com/robots.txt contains the lines: User-agent: Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly/) Gecko/2009032608 Firefox/3.0.8 Disallow: /

SimpleRobotRulesParser parses this as disallowing access to any agent name that starts with “Mozilla/5.0”. Based on the robots.txt RFC, this should not be the case: only if the whole name of the agent is contained in the User-agent line, does the rule apply. The problem stems from the fact that that the User-agent line is split into "words", and this seems to be contrary to the RFC.

Jan 02 '18 12:01 YossiTamari

Thanks, @YossiTamari.

The intention was obviously to exclude the Topsy's "Butterfly" bot. Of course, just copying the user-agent string from the webserver logs to the robots.txt isn't the right approach and wasn't intention for the "User-agent line name token" in the robots.txt RFC:

3.2.1 The User-agent line Name tokens are used to allow robots to identify themselves via a simple product token. Name tokens should be short and to the point. The name token a robot chooses for itself should be sent as part of the HTTP User-agent header, and must be well documented.

Btw., there is also no reason why you have to use the user-agent name sent in the HTTP header also for robots.txt parsing: in Nutch you would set http.robots.agents to butterfly and http.agent.name to Mozilla/5.0 .... That may solve your problem in the concrete case if you have chosen a unique agent name. You hardly want to name your polite bot "Mozilla". Cf. Googlebot's full user agent strings.

From the comments in SimpleRobotRulesParser I guess that the actual implementation tries to cover such edge case as the Butterfly one. The intention of the prefix match may be to match Butterfly/1.0 with the line User-agent: butterfly. However, that's dangerous in combination with splitting the user-agent line into words: User-agent: a b c bot would apply to all agents starting with one of the letters abc. Either we should not split the user-agent line into words (as @YossiTamari suggested) or at least require that a full word is matched. Any thoughts?

Jan 11 '18 10:01 sebastian-nagel

@sebastian-nagel, the documentation for http.robots.agents says:

Any other agents, apart from 'http.agent.name', that the robots parser would look for in robots.txt.

So I don't think your work-around would work (or maybe I misunderstand your suggestion). The code in RobotRulesParser seems to match that description as well.

BTW, my actual use case is to use a short unique agent name in http.agent.name, but, for historical reasons, also support the longer name in http.robots.agents. I may be able to engineer a http.robots.agents value that will work with the current code, but it will be a hack.

My thinking on the correct implementation of this feature is that there are two cases:

robots.txt was built using the short name token. In that case, there is no need to split the value, a prefix match to the user-agent header value should be enough.
robots.txt was built by copying the full line. In this case, again, there is no need to split the value. Either a full match is required (this was the intention of the person writing the file), or a substring match is required.

Taking these two together, it seems either a prefix match (strict) or a substring match (greedy), of the user agent header, in the user agent line should be the correct solution.

Jan 11 '18 11:01 YossiTamari

The new RFC draft, section 2.2.1 is very specific:

Crawlers set a product token to find relevant groups. The product token MUST contain only "a-zA-Z_-" characters. [...]

Crawlers MUST find the group that matches the product token exactly, and then obey the rules of the group. If there is more than one group matching the user-agent, the matching groups' rules MUST be combined into one group. The matching MUST be case-insensitive.

In short: there should be just a case-insensitive comparison of the two user-agent names (passed to the parser and from the robots.txt) and

no splitting of the user-agent into tokens
no prefix match

Notes:

the RFC unit tests (#360) includes a couple of tests to verify that only valid "product tokens" are accepted. Blocks with an invalid user-agent are ignored, resp. (specific to Googlebot) only the part up to the first space is used.
matching the user-agent from the robots.txt by prefix may cause that the wrong block is selected, see the unit test in #362. This especially applies if multiple blocks are merged (see #351).
this would be in contradiction to the Javadoc of BaseRobotsParser.parseContent(...):

An agent name is considered a match if it's a prefix match on the provided robot name.

However, the provided example

if you pass in "Mozilla Crawlerbot-super 1.0", this would match "crawlerbot"

does not state that also crawler or c are considered as matches. Additionally, the documentation of the parameter robotNames gives the instruction to pass

just the name portion, w/o version or other details

To conform with the RFC

shouldn't we assume that users to pass (a list of) complete user-agent names which are not split into tokens?
for the robots.txt we could (optionally) accept multi-token user-agent lines, so that the old RFC:

The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring.

Feb 20 '22 21:02 sebastian-nagel

(reopened: this is not fixed by #351, cf. unit tests in #362)

Aug 11 '22 12:08 sebastian-nagel

Fixed by #362 if exact user-agent matching is enabled (now the default), see also the corresponding unit test.

Apr 24 '23 15:04 sebastian-nagel