Support full wildcard syntax in robots.txt directives
We only support trailing * wildcards at present. Ideally we should support wildcards as defined in https://developers.google.com/search/reference/robots_txt
The code to modify would be:
https://github.com/internetarchive/heritrix3/blob/05811705ed996122bea1f4e034c1ed5f7a07240f/modules/src/main/java/org/archive/modules/net/RobotsDirectives.java#L40-L42
The actual wildcards are not that difficult, but getting the precedence right is harder. Perhaps we can use a standard library e.g. the crawler commons code?
see https://github.com/crawler-commons/crawler-commons/blob/bef1b8437e63930bcbab82a3f754bf835cda5cca/src/main/java/crawlercommons/robots/SimpleRobotRules.java#L153
Maybe interesting https://github.com/google/robotstxt
There's a Java port of Google's parser too https://github.com/google/robotstxt-java/ but unfortunately it doesn't seem to be in Maven Central.