robots.js
robots.js copied to clipboard
Don't return first rule match for canFetch
Currently the first matching rule will be returned , but I don't think it's an good idea.
For example this will always be true: User-Agent: * Allow: / Disallow: /admin/ Disallow: /redirect/
change /lib/entry.js
Entry.prototype.allowance = function(url) {
ut.d('* Entry.allowance, url: '+url);
var ret=true;
for (var i = 0, len = this.rules.length, rule; i < len; i++) {
rule = this.rules[i];
if ( rule.appliesTo(url) ) {
ret= rule.allowance;
}
};
return ret;
};
...this will return the last matching rule
This is a big issue, but I'm not totally sure returning the last rule to match really is the fix. Is there any way to determine whether a rule is more specific because ultimately we want to get the most specific rule.
...it depends on how you like to interpret rules. In most ACL-cases you write something like: Disallow something Explicit allow some specific thing Disallow some more specific thing which normally will be allowed by rule before
Or reverse-case: Allow all Disallow a specific thing Allow a more specific thing which normally will be dissallowed by rule before
For robots.txt there is normally no official "allow" command - only a "dissallow" command is standard command. So a robots.txt should normally only contain "dissallow" commands to ensure correct interpretation. So the current "return on first matching rule" is correct and fastes way if you only have "disallow" commands in robots.txt or only respect "dissallow" command. But most big search engines are interpreting an "allow" command also to be able to crawl more sites. And in that case the last matching command rules, because it is always the most specific rule - see samples above.
And remember: robots.txt is NOT a "you should not crawl"-command, it's more "please, don't crawl" or "crawling of... Is not necessary"
So in my eyes it's up to creator of ACL to ensure correct order of rules and use of "allow" and there is no way to determinate a "more specific rule" - it's like army: "last order rules" if you respect "allow" command.
Very true... but I guess it really depends whether you want it to be able to accurately interpret all robots.txt or just ones that strictly follow the spec (practically none of them).
At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule. The order of precedence for rules with wildcards is undefined.
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?csw=1#order-of-precedence-for-group-member-records