cc-index-table icon indicating copy to clipboard operation
cc-index-table copied to clipboard

Improve extraction of host names and registered domains

Open sebastian-nagel opened this issue 1 year ago • 0 comments

  • no host name is extracted in the following situations
    • URL contains 4 slashes after the protocol: https:////example.org/ - while java.net.URL extracts an empty hostname, the Nutch's OkHTTP-based protocol seems to fetch the resource as if there are only two slashes.
    • similarly java.net.URL and OkHttp show a different behavior if there is an overlong (or even invalid?) userinfo before the hostname (scheme://userinfo@hostname/)
  • IP addresses are not recognized as such if ending in a dot: https://123.123.123.123./robots.txt
  • the extraction of registered domains (done by crawler-commons' EffectiveTldFinder does not extract anything if the hostname is equal to a public suffix (gov.uk, kharkov.ua for example)

sebastian-nagel avatar Apr 04 '23 13:04 sebastian-nagel