jsdom
jsdom copied to clipboard
No elements in result if there is a :not in the selector
Basic info:
- Node.js version: v12.21
- jsdom version: v19.0.0
Minimal reproduction case
const jsdom = require('jsdom');
const { JSDOM } = jsdom;
const dom = new JSDOM(`
<html>
<head></head>
<body>
<p>
<table>
<tbody>
<tr>
<td>
<p></p>
</td>
</tr>
</tbody>
</table>
</p>
<p></p>
</body>
</html>
`);
const result = dom.window.document.querySelectorAll('p:not(table p)');
console.log(Object.entries(result));
Actual output:
[]
Expected output:
[
[ '0', HTMLParagraphElement {} ],
[ '1', HTMLParagraphElement {} ]
]
How does similar code behave in browsers?
There are two HTMLParagraphElement
s in the result.
https://jsfiddle.net/7v8dzgfh/6/
I have a script runnable in a certain website (through dev tools console) and I was trying to move it to a backend running Node JS. I tried using the linkedom library but I found out that it has no window.getComputedStyle()
. Then I tried this package but it fails on the :not
selector at a case like this for some reason. I ended up using puppeteer for the task to get the same browser behavior to run my script properly.
One aspect is here is that the HTML is not strictly valid according to the standard, so the result might depend on how the HTML parsers fix broken layouts:
https://stackoverflow.com/questions/10086912/why-is-table-not-allowed-inside-p
In linkedom and css-select, a related bug that I opened was closed, as broken layouts are not supported by both projects. I don't know if in jsdom, compatibility with Firefox and Chrome in these edge cases is a design goal. But knowing that the layout in the example and the logic behind the auto-correction of the parser will be relevant in this example HTML, as it is technically broken HTML.
Yeah, I'm aware this is not valid HTML but I have no control over it in this case at least.
To be more specific. I am trying to extract data from an old website made with Microsoft Word, but its markup is really bad. I had to make a script for extracting data in a structured way, making the markup more normalized (converting to appropriate semantic tags based on how they are styled), etc.. It took me a good while to figure out that it's not consistent in selecting elements with actual browsers (at least in cases of invalid HTML like my example).
If one wants/needs to have element selection behavior consistent with the browser, I suggest using something like puppeteer instead.
If it's possible, maybe they can add an option or something that makes the parsing closer to the browser's actual behavior.