epubcheck icon indicating copy to clipboard operation
epubcheck copied to clipboard

epubcheck issues a warning on legitimate and correct IDNs in href attributes

Open sermo-de-arboribus opened this issue 4 years ago • 2 comments
trafficstars

If an EPUB file contains a link to a website with an Internationalized Domain Name (IDN), epubcheck issues a warning, e.g.

Couldn’t parse host of URL "https://www.jüdische-gemeinden.de/index.php/gemeinden/u-z/2477-wilna-vilnius-litauen/" (probably due to disallowed characters or missing slashes after the protocol)

For a few years domain names with umlauts or other non-ASCII characters are allowed by many registration agencies, yet java.net.URI cannot handle them directly, therefore the epubcheck's call to uri.getHost() does return a null value. Java also offers a class java.net.IDN for handling IDNs. Probably this should be used to sanitize the href string before checking it with uri.getHost()?

sermo-de-arboribus avatar Sep 08 '21 13:09 sermo-de-arboribus

Thanks for the report Kai. I'm surprised this hasn't come up earlier 🤔 I don't think IDNs are disallowed in EPUB (pinging @mattgarrish for double-check), so it looks like a bug.

There's a distinct security risk with homographs, but that's not specific to EPUB and is better dealt-with by editors or reading systems.

rdeltour avatar Sep 08 '21 13:09 rdeltour

I don't think IDNs are disallowed in EPUB

No, 3.2 references RFC3987 which allows most Unicode characters outside the PUAs (and even those in query strings). The URL standard referenced in 3.3 is similar.

Both warn about spoofing, as you say, but that's not an authoring concern.

mattgarrish avatar Sep 09 '21 12:09 mattgarrish