node-html-parser
node-html-parser copied to clipboard
Wrong output on malformed HTML
I know it's hard to predict every malformed HTML possibilities, but I came across this while scraping a website. The misplaced apostrophe before the > of the <a> makes the parser skip the rest of the row. This code displays correctly on browsers (the invalid token is discarded). If you remove the ' the code runs correctly.
import { parse } from 'node-html-parser';
const html = `
<table id="mytable">
<tr class="myrow">
<td>1</td>
<td><a href="#" 2'>x</a></td>
<td>2</td>
</tr>
<tr class="myrow">
<td>3</td>
<td><a href="#" 2'>x</a></td>
<td>4</td>
</tr>
</table>
`;
const root = parse(html);
for (let tr of root.querySelectorAll("#mytable tr.myrow")) {
console.log(tr.querySelectorAll(":scope > td").map(e => e.innerText));
}