epubcheck
epubcheck copied to clipboard
epubcheck issues a warning on legitimate and correct IDNs in href attributes
If an EPUB file contains a link to a website with an Internationalized Domain Name (IDN), epubcheck issues a warning, e.g.
Couldn’t parse host of URL "https://www.jüdische-gemeinden.de/index.php/gemeinden/u-z/2477-wilna-vilnius-litauen/" (probably due to disallowed characters or missing slashes after the protocol)
For a few years domain names with umlauts or other non-ASCII characters are allowed by many registration agencies, yet java.net.URI cannot handle them directly, therefore the epubcheck's call to uri.getHost() does return a null value. Java also offers a class java.net.IDN for handling IDNs. Probably this should be used to sanitize the href string before checking it with uri.getHost()?
Thanks for the report Kai. I'm surprised this hasn't come up earlier 🤔 I don't think IDNs are disallowed in EPUB (pinging @mattgarrish for double-check), so it looks like a bug.
There's a distinct security risk with homographs, but that's not specific to EPUB and is better dealt-with by editors or reading systems.
I don't think IDNs are disallowed in EPUB
No, 3.2 references RFC3987 which allows most Unicode characters outside the PUAs (and even those in query strings). The URL standard referenced in 3.3 is similar.
Both warn about spoofing, as you say, but that's not an authoring concern.