Common "robots" values are undocumented
warcinfo records may contain optional fields. One such field is robots. The only suggested value is classic. No other values are documented. This may lead to implementations choosing different values with the same meaning, making the field more difficult for consumers to interpret.
I think this suggestion predates RFC 9309, which is the new hotness in the robots.txt space.
@notcancername Could you give examples of other used or suggested values, and what they would mean?
@wumpus I don't see how RFC 9309 solves this problem, what did you mean?
Heritrix, which is probably where the 'classic' value originally came from, currently has these policies:
classic(aliasobey)ignorerobotsTxtOnly(obeys robots.txt but ignores the robots meta tag)
Another common configuration that might be useful to record is following robots.txt at the page level but ignoring it for subresources.
You can also use value "obey" and in org.archive.crawler.prefetch.PreconditionEnforcer specify calculateRobotsOnly=true.
Then robots.txt is NOT obeyed, but exlcuded URIs are annotated in the crawl.log. for use in presentation and usage decisions later.
It's only since recently that the robots.txt was standardized by RFC 9309.
Before, practically every search engine crawler set its own standard modifying the original RFC proposal from 1994.
My interpretation of classic was that it refers to the 1994 version. Since summer 2018 the Common Crawl WARC files refer to the version of the robots.txt parser library used by the crawler at this point.
Given multiple robots.txt standards (or specs) and implementations, a value obey isn't really precise.
robotsTxtOnly(obeys robots.txt but ignores the robots meta tag)
Good point. Information about the meta tags should be also there. But again: interpretation of nofollow, noindex, noarchive, nosnippet etc. is not standardized - there's an expired draft.
@sebastian-nagel is getting good at reading my mind! I was suggesting "rfc9309" as a value, since it's different from "classic".