resolveEntity being passed incorrect systemid
To my surprised I've found an actual issue!
We have an HTML file with this DOCTYPE
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
In org.ccil.cowan.tagsoup.Parser.decl, this results in the "systemid" variable being set to a blank string (Parser.java line 851), which is fine. However, if an EntityResolver is set, these lines just after:
if (theScanner instanceof Locator) { // Must resolve systemid
theDoctypeSystemId = ((Locator)theScanner).getSystemId();
try {
theDoctypeSystemId = new URL(new URL(theDoctypeSystemId), systemid).toString();
} catch (Exception e) {}
}
result in resolveEntity being called with the System ID from the Locator set as the system ID of the entity. This is clearly incorrect.
The fix is to change line 864 to
if (systemid != null && systemid.length() > 0 && theScanner instanceof Locator) {
i.e. don't try to resolve the entity systemid against the Locator's systemid unless it's actually specified.
(cross-posting to both the "orbeon" and "jukka" forks of tagsoup on github, to try to keep things in sync)