node-html-parser
node-html-parser copied to clipboard
100% cpu while parsing document
simple code hangs causing kube to kill pod:
import {parse} from 'node-html-parser';
const html = // load https://www.a1supplements.com/
const root = parse(html);
I use:
"node-html-parser": "^6.1.10",
host: https://www.a1supplements.com/ html size: 4620497 symbols
I have these from node inspect:
break in node_modules/node-html-parser/dist/nodes/html.js:1192
1190 oneBefore.removeChild(last);
1191 last.childNodes.forEach(function (child) {
>1192 oneBefore.appendChild(child);
1193 });
1194 }
The contents (in case the website will be updated):
actually it finished parsing in 7 minutes on my macbook pro with i7. is it considered to be correct?
as a workaround i use this (i mostly need data only from
): const root = parse(html, {
parseNoneClosedTags: false,
fixNestedATags: false,
blockTextElements: {
'div': true,
'p': true,
'pre': true
}
});
const title = option(root.querySelector('title'))
.map(x => x.text)
.filter(x => !!x && x.trim().length > 0);
I'm so sorry I could find any clue about your usecase. I even could not find title element. I did not get a macbook either. But I parsed the file you uploaded and it finished parsing immediately .
Using:
...
"dependencies": {
"axios": "^1.6.2",
"node-html-parser": "^6.1.10",
}
...
This is how to reproduce it (using default options in parse):
import parse from 'node-html-parser';
import axios from 'axios';
async function runImpl() {
const url = 'https://www.a1supplements.com/';
const resp = await axios.get(url);
const html = resp.data;
const start = Date.now();
parse(html);
const duration = Date.now() - start;
console.log(`Parsing took: ${duration.toLocaleString()}ms, document size: ${html.length.toLocaleString()} chars`)
}
runImpl().then(() => console.log('done'))
result:
Parsing took: 383,073ms, document size: 4,705,196 chars
done
Using custom options in parse:
...
const start = Date.now();
parse(html, {
parseNoneClosedTags: false,
fixNestedATags: false,
blockTextElements: {
'div': true,
'p': true,
'pre': true,
script: true,
noscript: true,
style: true,
}
});
const duration = Date.now() - start;
...
results:
Parsing took: 284ms, document size: 4,705,196 chars
done
I believe that blockTextElements -> div generally fixes the issue
I don't think we should block div elements. that maybe the html is broken, the option parseNoneClosedTags: true will speed up.
parseNoneClosedTags: true
Not sure if it was related, but had a big html page and adding this fixed it.
Curious what this mean @taoqf , couldn't find any documentation or issues around it?