cc-index-table Improve extraction of host names and registered domains

Improve extraction of host names and registered domains

Open sebastian-nagel opened this issue 1 year ago • 0 comments

no host name is extracted in the following situations
- URL contains 4 slashes after the protocol: https:////example.org/ - while java.net.URL extracts an empty hostname, the Nutch's OkHTTP-based protocol seems to fetch the resource as if there are only two slashes.
- similarly java.net.URL and OkHttp show a different behavior if there is an overlong (or even invalid?) userinfo before the hostname (scheme://userinfo@hostname/)
IP addresses are not recognized as such if ending in a dot: https://123.123.123.123./robots.txt
the extraction of registered domains (done by crawler-commons' EffectiveTldFinder does not extract anything if the hostname is equal to a public suffix (gov.uk, kharkov.ua for example)

Apr 04 '23 13:04 sebastian-nagel