nutch icon indicating copy to clipboard operation
nutch copied to clipboard

WARC writer: unit tests for conversion of URLs to URIs

Open sebastian-nagel opened this issue 2 years ago • 0 comments

Nutch uses instances of the class java.net.URL to represent the URLs being crawled. WARC files require URIs for the WARC-Target-URI header. While the conversion to an URI is unproblematic for most of the URLs, there are some issues:

  1. there are instances of java.net.URL which fail to be converted to java.net.URI, see URL.toURI(). Note: the URLs were successfully fetched!
  2. the conversion of java.net.URI to an ASCII-only URI is not free of pitfalls (see #20)

Would be good to have unit tests to test and verify these issues - of course, ideally with "solutions" to make the conversion from URL to URI succeed. E.g.,

  • non-ASCII / Unicode components in URLs, including IDNs
  • encoding of white space in the URL path or query
  • encoding of characters invalid in URIs but valid in URLs

sebastian-nagel avatar Jul 12 '23 15:07 sebastian-nagel