cc-index-table
cc-index-table copied to clipboard
Improve extraction of host names and registered domains
- no host name is extracted in the following situations
- URL contains 4 slashes after the protocol: https:////example.org/ - while java.net.URL extracts an empty hostname, the Nutch's OkHTTP-based protocol seems to fetch the resource as if there are only two slashes.
- similarly java.net.URL and OkHttp show a different behavior if there is an overlong (or even invalid?) userinfo before the hostname (scheme://userinfo@hostname/)
- IP addresses are not recognized as such if ending in a dot: https://123.123.123.123./robots.txt
- the extraction of registered domains (done by crawler-commons' EffectiveTldFinder does not extract anything if the hostname is equal to a public suffix (
gov.uk
,kharkov.ua
for example)