jsoup
jsoup copied to clipboard
Bug: DOM elements not being placed in (X)HTML namespace.
The description page for jsoup says:
jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
The WHATWG HTML5 Specification § 2.1.3 XML compatibility says:
To ease migration from HTML to XML, user agents conforming to this specification will place elements in HTML in the http://www.w3.org/1999/xhtml namespace, at least for the purposes of the DOM and CSS.
In other words, jsoup should be placing HTML elements in the http://www.w3.org/1999/xhtml
namespace, even in the absence of an xmlns
declaration. But it's not.
I'm using org.jsoup:jsoup:1.15.3
with Java 17. I have a test HTML document that looks something like this:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Foobar</title
</head>
<body>
…
</body>
In a JUnit 5/Hamcrest unit test I read that document:
URL testResourceUrl = getClass().getResource("foobar.html");
try (InputStream inputStream = testResourceUrl.openStream()) {
org.jsoup.nodes.Document jsoupDocument = Jsoup.parse(inputStream, null, testResourceUrl.toString());
org.w3c.dom.Document domDocument = new W3CDom().fromJsoup(jsoupDocument);
assertThat(domDocument.getDocumentElement().getNamespaceURI(), is("http://www.w3.org/1999/xhtml"));
}
Unfortunately this test fails:
java.lang.AssertionError:
Expected: is "http://www.w3.org/1999/xhtml"
but: was null
…
Thus jsoup has a bug: it is not following the HTML5 specification and placing HTML elements in the http://www.w3.org/1999/xhtml
namespace, but instead assigning them the null
namespace.
To reiterate, yes, I realize that the HTML document in question has no default namespace specified. But as an HTML document it isn't required or even expected to. The HTML5 specification says that the HTML elements should be placed in the http://www.w3.org/1999/xhtml
regardless.
While this bug is being fixed, does anyone have a workaround that would allow me to force the W3CDom
class to put the DOM Document
elements into the correct namespace? Right now this is breaking my XPath expressions (which are namespace aware and rely on the HTML namespace being properly defined). Thanks.
(Please don't let my mentioning XPath get this discussion sidetracked into how XPath processes namespaces or whether I should be turning off namespace processing or whatever. This is a bug because it directly violates a central requirement of the WHATWG HTML specification, whether I'm using XPath or traversing the DOM manually.)
At least part of the problem seems to be here in W3CDom
:
Element el = namespace == null && tagName.contains(":") ?
doc.createElementNS("", tagName) : // doesn't have a real namespace defined
doc.createElementNS(namespace, tagName);
…
If the element is an HTML element, it should be given a namespace of http://www.w3.org/1999/xhtml
, not the empty string.
And in case someone says, "But W3CDom
is processing as namespace-aware, which requires that namespaces be declared", the response is that you've confused "parsing namespace-aware" and "parsing using the XML syntax". Because in reality the HTML5 DOM (according to the specification) always should be namespace-aware, and the HTML elements should always go in the http://www.w3.org/1999/xhtml
. However if the document is being parsed as XML (i.e. using the XML syntax, i.e. "XHTML") then yes there are certain rules about namespaces being declared.
But I am not referring to the XML syntax. (If I were using the XML syntax, then I would just use an XML parser and have no need for jsoup.) I am referring to the HTML syntax, which does not require namespace declarations but nevertheless places the elements in the http://www.w3.org/1999/xhtml
namespace, as explained in the description of this bug ticket.
Unfortunately the following attempt at a workaround doesn't work. ☹️
…
jsoupDocument.attr("xmlns", "http://www.w3.org/1999/xhtml");
org.w3c.dom.Document domDocument = new W3CDom().fromJsoup(jsoupDocument);
…
Ah, here's a workaround. The workaround above didn't work because even though jsoup considers the org.jsoup.nodes.Document
an org.jsoup.nodes.Element
(which is confusing because that's different from the W3C model), W3CDom
skips the jsoup document node itself (naturally) when converting to elements, starting instead at jsoupDocument.child(0)
.
So this is a workaround:
…
jsoupDocument.child(0).attr("xmlns", "http://www.w3.org/1999/xhtml");
org.w3c.dom.Document domDocument = new W3CDom().fromJsoup(jsoupDocument);
…
Note that this is only a temporary workaround; the actual bug needs to be fixed.
From examining the code visually, I would suggest that the following in W3CDom
might fix this bug. Of course we wouldn't know without running regression tests. (I haven't looked at the existing tests; I wouldn't be surprised if some of the the tests themselves were wrong, though and need to be updated).
public W3CBuilder(Document doc) {
this.doc = doc;
namespacesStack.push(new HashMap<>());
namespacesStack.peek().put("", "http://www.w3.org/1999/xhtml"); // fix for #1837
…
}
(You could also configure the initial HashMap<>
before it's used to create the initial namespace context on the stack.)
This would make elements by default have the correct HTML namespace (bringing it into WHATWG spec compliance) by providing an ultimate fallback value for the default namespace. It would also allow elements to override the default namespace for a subtree by explicitly setting a default namespace declaration on an element.
You can assign me to this ticket. I've already started working on it.
Looks like several unit tests particularly in W3CDomTest
are breaking. This is because the unit tests are simply wrong. For example:
org.opentest4j.AssertionFailedError: expected: <null> but was: <http://www.w3.org/1999/xhtml>
at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
at org.junit.jupiter.api.AssertNull.failNotNull(AssertNull.java:50)
at org.junit.jupiter.api.AssertNull.assertNull(AssertNull.java:35)
at org.junit.jupiter.api.AssertNull.assertNull(AssertNull.java:30)
at org.junit.jupiter.api.Assertions.assertNull(Assertions.java:275)
at org.jsoup.helper.W3CDomTest.convertsGoogle(W3CDomTest.java:89)
…
Node htmlEl = wDoc.getChildNodes().item(1);
assertNull(htmlEl.getNamespaceURI());
As explained above, there is no question both in the specification and in browser implementations that the default namespace in an HTML document, if parsed as HTML (and not as XML), must be http://www.w3.org/1999/xhtml
.
If you don't believe me, load https://example.com/ in any modern browser, go into the developer tools (e.g. in Chrome you can create a snippet), and enter:
document.body.firstElementChild.namespaceURI
Your browser will proudly tell you:
http://www.w3.org/1999/xhtml
You'll note also that on this page there is no namespace declarations in the HTML source code.
As I mentioned, I'll fix this and submit a pull request. I'm a little surprised, though, that there is as of yet no response from the original developer(s) regarding such a fundamentally breaking bug. I hope if I go to this effort that it can quickly be accepted and integrated back into the code base so that others can benefit.
This is almost finished, but I need some clarification about usage of W3CDom.asString(Document doc)
and W3CDom.asString(Document doc, @Nullable Map<String, String> properties)
. This is used in a lot of the tests to convert DOM to a string. Now that HTML documents default to the correct namespace after my changes, the returned strings have xmlns="http://www.w3.org/1999/xhtml"
on the <html>
element—not because the DOM has that, but because W3CDom.asString()
is using a Transformer
which only thinks in terms of XML, and adds the namespace declaration.
We can leave it like that or play tricks to get rid of it, but it depends on the purpose and usage of W3CDom.asString()
. Is it just used for testing? Or is it part of the jsoup API for turning DOM into a string? Presumably once it's in DOM form, developers already have their preferred ways for serialization; jsoup is about parsing, not pretty-printing, after all. In this view, W3CDom.asString()
is more for testing that the DOM is correct, in which case we can simply update the unit tests to expect the xmlns="http://www.w3.org/1999/xhtml"
.
On the other hand, is W3CDom.asString()
meant to provide a round-trip string representation of the original "tidied" HTML? If that is the case, we would want to remove the xmlns="http://www.w3.org/1999/xhtml"
from the output, unless the user explicitly specified W3CDom.OutputXml()
. This is not trivial, because as mentioned jsoup is using a Transformer
, which is a Java feature from eons ago, before HTML5, and only thinks in terms of XML. (See my question Java XLST transformer with default namepace without xmlns on Stack Overflow.)
I went with the assumption that you want W3CDom.asString()
to continue providing "clean" HTML5 output and not showing a xmlns
default namespace attribute. The workaround is a bit of a kludge; see Java XSLT transformer with default namepace without xmlns for more discussion. (Ideally we would improve the serializer to output clean HTML5 to begin with, even now that the default namespace has been fixed.) But it's not a mysterious kludge; it is simple and low risk, with unit tests provided.
The only serialization change that should affect existing code (I don't know if anybody even called W3CDom.asString()
, outside the unit tests) is that if someone called it explicitly using W3CDom.OutputXml()
, then they will now get a string including the xmlns=\"http://www.w3.org/1999/xhtml\"
, because XML requires the namespace declaration and the document now has that default namespace.
Is anyone still working on this library and approving submitted bug fixes? I haven't received any feedback on the pull request.
Thanks @garretwilson - I have added some notes to your PR.
@jhy thank you for looking at the PR. I've added some responses. Please respond and let me know how you would like to proceed and I'll update the PR.