jsdom icon indicating copy to clipboard operation
jsdom copied to clipboard

No elements in result if there is a :not in the selector

Open jp06 opened this issue 3 years ago • 2 comments

Basic info:

  • Node.js version: v12.21
  • jsdom version: v19.0.0

Minimal reproduction case

const jsdom = require('jsdom');
const { JSDOM } = jsdom;

const dom = new JSDOM(`
  <html>
    <head></head>
    <body>
      <p>
        <table>
          <tbody>
            <tr>
              <td>
                <p></p>
              </td>
            </tr>
          </tbody>
        </table>
      </p>
      <p></p>
    </body>
  </html>
`);

const result = dom.window.document.querySelectorAll('p:not(table p)');

console.log(Object.entries(result));

Actual output:

[]

Expected output:

[
  [ '0', HTMLParagraphElement {} ],
  [ '1', HTMLParagraphElement {} ]
]

How does similar code behave in browsers?

There are two HTMLParagraphElements in the result.

https://jsfiddle.net/7v8dzgfh/6/

I have a script runnable in a certain website (through dev tools console) and I was trying to move it to a backend running Node JS. I tried using the linkedom library but I found out that it has no window.getComputedStyle(). Then I tried this package but it fails on the :not selector at a case like this for some reason. I ended up using puppeteer for the task to get the same browser behavior to run my script properly.

jp06 avatar Feb 18 '22 13:02 jp06

One aspect is here is that the HTML is not strictly valid according to the standard, so the result might depend on how the HTML parsers fix broken layouts:

https://stackoverflow.com/questions/10086912/why-is-table-not-allowed-inside-p

In linkedom and css-select, a related bug that I opened was closed, as broken layouts are not supported by both projects. I don't know if in jsdom, compatibility with Firefox and Chrome in these edge cases is a design goal. But knowing that the layout in the example and the logic behind the auto-correction of the parser will be relevant in this example HTML, as it is technically broken HTML.

philipp-classen avatar Apr 08 '22 19:04 philipp-classen

Yeah, I'm aware this is not valid HTML but I have no control over it in this case at least.

To be more specific. I am trying to extract data from an old website made with Microsoft Word, but its markup is really bad. I had to make a script for extracting data in a structured way, making the markup more normalized (converting to appropriate semantic tags based on how they are styled), etc.. It took me a good while to figure out that it's not consistent in selecting elements with actual browsers (at least in cases of invalid HTML like my example).

If one wants/needs to have element selection behavior consistent with the browser, I suggest using something like puppeteer instead.

If it's possible, maybe they can add an option or something that makes the parsing closer to the browser's actual behavior.

jp06 avatar Apr 09 '22 06:04 jp06