node-html-parser Wrong output on malformed HTML

Wrong output on malformed HTML

Open amartini opened this issue 1 year ago • 0 comments

I know it's hard to predict every malformed HTML possibilities, but I came across this while scraping a website. The misplaced apostrophe before the > of the <a> makes the parser skip the rest of the row. This code displays correctly on browsers (the invalid token is discarded). If you remove the ' the code runs correctly.

import { parse } from 'node-html-parser';

const html = `
<table id="mytable">
<tr class="myrow">
  <td>1</td>
  <td><a href="#" 2'>x</a></td>
  <td>2</td>
</tr>
<tr class="myrow">
  <td>3</td>
  <td><a href="#" 2'>x</a></td>
  <td>4</td>
</tr>
</table>
`;

const root = parse(html);

for (let tr of root.querySelectorAll("#mytable tr.myrow")) {
  console.log(tr.querySelectorAll(":scope > td").map(e => e.innerText));
}

Feb 21 '24 20:02 amartini

node-html-parser node-html-parser copied to clipboard

Wrong output on malformed HTML

node-html-parser
node-html-parser copied to clipboard