jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

W3CDom.fromJsoup(document) produces invalid XML DOM (multiple roots) - not usable for XPath evaluation

Open jelinj8 opened this issue 4 months ago • 0 comments

I'm trying to make a HTML->XHTML sanitizer (with absolutely no control over source HTML as it is supposed to be a web data scraper + later usage of FlyingSaucer library to convert to PDF).

Loading HTML from FileInputStream is OK:

org.jsoup.nodes.Document document = Jsoup.parse(fis, null, "./");
document.outputSettings().syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml);
document.outputSettings().escapeMode(EscapeMode.xhtml);

Conversion to org.w3c.dom seems to be processed, but the resulting document has two root elements (and fails with any later XPath evaluation):

W3CDom w3cDom = new W3CDom();
Document doc = w3cDom.fromJsoup(document);

Resulting document has two childs: DocumentTypeImpl and ElementNSImpl (containing the real data).

I have to read the doc from document.toString() (parse again) to get a valid XML DOM. I think the problem comes from W3CDom.convert when a "document" node is passed and it skips to firstChild - first child is in my case a "DOCTYPE" declaration. The main document body is in next child. It might be that I'm doing something completely wrong, but in that case there is something very counter-intuitive.

I'm using JSoup 1.18.1 from Maven, OpenJDK 17.

(Optionally) auto-skipping non-xml declarations for XML output would be nice.

jelinj8 avatar Oct 14 '24 15:10 jelinj8