nutch
nutch copied to clipboard
WARC writer: unit tests for conversion of URLs to URIs
Nutch uses instances of the class java.net.URL to represent the URLs being crawled. WARC files require URIs for the WARC-Target-URI header. While the conversion to an URI is unproblematic for most of the URLs, there are some issues:
- there are instances of java.net.URL which fail to be converted to java.net.URI, see URL.toURI(). Note: the URLs were successfully fetched!
- the conversion of java.net.URI to an ASCII-only URI is not free of pitfalls (see #20)
Would be good to have unit tests to test and verify these issues - of course, ideally with "solutions" to make the conversion from URL to URI succeed. E.g.,
- non-ASCII / Unicode components in URLs, including IDNs
- encoding of white space in the URL path or query
- encoding of characters invalid in URIs but valid in URLs