heritrix3 icon indicating copy to clipboard operation
heritrix3 copied to clipboard

Support full wildcard syntax in robots.txt directives

Open anjackson opened this issue 6 years ago • 3 comments

We only support trailing * wildcards at present. Ideally we should support wildcards as defined in https://developers.google.com/search/reference/robots_txt

The code to modify would be:

https://github.com/internetarchive/heritrix3/blob/05811705ed996122bea1f4e034c1ed5f7a07240f/modules/src/main/java/org/archive/modules/net/RobotsDirectives.java#L40-L42

The actual wildcards are not that difficult, but getting the precedence right is harder. Perhaps we can use a standard library e.g. the crawler commons code?

anjackson avatar Mar 29 '19 17:03 anjackson

see https://github.com/crawler-commons/crawler-commons/blob/bef1b8437e63930bcbab82a3f754bf835cda5cca/src/main/java/crawlercommons/robots/SimpleRobotRules.java#L153

anjackson avatar Mar 29 '19 17:03 anjackson

Maybe interesting https://github.com/google/robotstxt

jr-ewing avatar May 10 '23 17:05 jr-ewing

There's a Java port of Google's parser too https://github.com/google/robotstxt-java/ but unfortunately it doesn't seem to be in Maven Central.

ato avatar May 11 '23 07:05 ato