node-html-parser icon indicating copy to clipboard operation
node-html-parser copied to clipboard

Regression: Versions >= v5.3.2 are unable to parse specific link

Open stalgiag opened this issue 1 year ago • 4 comments

I work for a project that validates its links using this library. One link that is frequently validated is the HTML spec at https://html.spec.whatwg.org/. This page has one of the bigger HTML files on the web but node-html-parser was able to parse it well in approximately 23 seconds on my local machine until release 5.3.2.

Consider this example:

const HTMLParser = require('node-html-parser');
const nFetch = require('node-fetch');

async function parseHTMLSpec() {
  try {
    const response = await nFetch('https://html.spec.whatwg.org/');
    const html = await response.text();

    console.log('Fetched HTML. Attempting to parse...');
    console.time('parseHTMLSpec');
    const parsedHTML = HTMLParser.parse(html);
    console.timeEnd('parseHTMLSpec');

    console.log('HTML parsed successfully.');
    console.log('Title:', parsedHTML.querySelector('title').text);
  } catch (error) {
    console.error('Error occurred:', error);
  }
}

parseHTMLSpec();

With node-html-parser 5.3.1, this outputs the following:

Fetched HTML. Attempting to parse...
parseHTMLSpec: 23.415s
HTML parsed successfully.
Title: HTML Standard

With node-html-parser 5.3.2, this hangs indefinitely; only outputting the following even after running for hours:

console.log('Fetched HTML. Attempting to parse...');

stalgiag avatar Sep 24 '24 00:09 stalgiag