jsoup
jsoup copied to clipboard
W3CDom.fromJsoup(document) produces invalid XML DOM (multiple roots) - not usable for XPath evaluation
I'm trying to make a HTML->XHTML sanitizer (with absolutely no control over source HTML as it is supposed to be a web data scraper + later usage of FlyingSaucer library to convert to PDF).
Loading HTML from FileInputStream is OK:
org.jsoup.nodes.Document document = Jsoup.parse(fis, null, "./");
document.outputSettings().syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml);
document.outputSettings().escapeMode(EscapeMode.xhtml);
Conversion to org.w3c.dom seems to be processed, but the resulting document has two root elements (and fails with any later XPath evaluation):
W3CDom w3cDom = new W3CDom();
Document doc = w3cDom.fromJsoup(document);
Resulting document has two childs: DocumentTypeImpl and ElementNSImpl (containing the real data).
I have to read the doc from document.toString() (parse again) to get a valid XML DOM. I think the problem comes from W3CDom.convert
when a "document" node is passed and it skips to firstChild - first child is in my case a "DOCTYPE" declaration. The main document body is in next child. It might be that I'm doing something completely wrong, but in that case there is something very counter-intuitive.
I'm using JSoup 1.18.1 from Maven, OpenJDK 17.
(Optionally) auto-skipping non-xml declarations for XML output would be nice.