article-extractor
article-extractor copied to clipboard
Crashes on Pinterest and a lot of other websites
Pages to test on:
- https://www.pinterest.ca/variamsingh87/
- https://www.pinterest.com.au/seriako/
Code:
import { extract } from '@extractus/article-extractor'
const input = 'https://www.pinterest.ca/variamsingh87/'
await extract(input)
Error:
TypeError: Cannot read properties of null (reading 'tagName')
at Readability._grabArticle (/Users/vasyl/code/killme/node_modules/@mozilla/readability/Readability.js:1150:37)
at Readability.parse (/Users/vasyl/code/killme/node_modules/@mozilla/readability/Readability.js:2277:31)
at default (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/extractWithReadability.js:18:25)
at file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/parseFromHtml.js:88:14
at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:38
at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:40
at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:40
at default (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/parseFromHtml.js:98:19)
at extract (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/main.js:24:10)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
I presume that the bug is somewhere inside the linkedom package, DOMParser class.
@koresar unfortunately, this library works best with simple article formats. Complex layouts like lists and grids can be tricky for it to handle.
I believe it would work fine if the linkedom was less buggy.
I believe it would work fine if the
linkedomwas less buggy.
maybe you can try to use happydom for testing
@SettingDust @koresar glad to hear about happydom, I will try it.
People generally recommend using jsdom with readability. FYI
People generally recommend using
jsdomwithreadability. FYI
It's too heavy. We used to use it. The Happy DOM slogan is JSDOM alternative
Happy DOM focuses heavily on performance and can be used as an alternative to JSDOM.
After some tests. Happy DOM doesn't support many pseudo classes like :not, :is make it isn't an option for now. I'll look into the linkedom and readability for this issue
Like https://github.com/mozilla/readability/issues/836#issuecomment-1911720374 said. The pinterest returned html is wrong
Readability won't work as expected I think.
I'm going to try catch the content parser and don't include content property when error. How are you thinking about this? @ndaidong
@SettingDust thank you. I've tried happydom in another project but not happy with it!
Anw, if the HTML structrure is not well-formed, no library can help. And yes, I think we should catch the error like this.
Anw, if the HTML structrure is not well-formed, no library can help. And yes, I think we should catch the error like this.
Browser, JSDOM, or happy dom will handle them. But not linkedom JSDOM and happydom will wrap the tags in body XD
linkedom just keep them
I've tried happydom in another project but not happy with it!
Aw. I'm a little curious, what's the problem?
I would like to use JSDOM, if it behaves well with pinterest. (Not sure how heavy it is, but I do not care much. I'd better have reliable slow code rather unreliable but fast.)
Or can I pass my DOM object instead of the HTML as a string?
Aw. I'm a little curious, what's the problem?
@SettingDust I'm crawling content from some websites. When I replace linkedom with happydom, the rate of failed cases will increase about 30%. So I had to continue going with linkedom. No time to try new solution in this short project.
@koresar the main purpose of this lib is extract article content only. If you want to extract a special kind of content, such as Pinterest, it's better to fetch the content directly and parse them with any DOM utils you can.
I am not after special kind of content. I'm after very random pages on the internet. :) It fails not only on the pinterest, but a large amount of other websites.
I don't see JSDOM as a performance bottleneck. Do you?
Or can I pass my DOM object instead of the HTML as a string?
That's a point I mentioned before. I'm going to do that in new extractus project.
I am not after special kind of content. I'm after very random pages on the internet. :) It fails not only on the pinterest, but a large amount of other websites.
Can you provide the list of failed websites plz?
I don't see JSDOM as a performance bottleneck. Do you?
JSDOM is slower and larger memory alloc than linkedom. https://github.com/WebReflection/linkedom#benchmarks
linkedom benchmark for ./html.html
initial heap: 7.73 MB
parsing cold: 1.282s
document heap: 347.37 MB
html.normalize() cold: 31.981ms
crawling childNodes cold: 91.346ms
crawling childNodes hot: 77.313ms
total childNodes 580234
crawling children cold: 54.539ms
crawling children hot: 50.959ms
total children 270313
after crawling heap: 350.01 MB
html.cloneNode(true) cold: 1.129s
cloning: OK
outerHTML: OK
outcome: OK
after cloning heap: 829.71 MB
querySelectorAll("div") cold: 37.924ms
querySelectorAll("div") hot: 35.58ms
total div 3714
getElementsByTagName("p") cold: 34.354ms
getElementsByTagName("p") hot: 32.025ms
total p 18800
after querying heap: 839.56 MB
removing divs: 2.116ms
div count cold: 19.291ms
total div 0
p count cold: 22.322ms
total p 15212
after removing divs heap: 837.12 MB
html.cloneNode(true) cold: 600.18ms
cloneNode: OK
html.outerHTML cold: 190.948ms
html.outerHTML hot: 177.362ms
outcome: OK
html.innerHTML: 5.756s
total benchmark time: 11.510s
total heap memory: 1383.01 MB
linkedom cached benchmark for ./html.html
initial heap: 7.06 MB
parsing cold: 1.340s
document heap: 348.8 MB
html.normalize() cold: 57.576ms
crawling childNodes cold: 212.709ms
crawling childNodes hot: 100.948ms
total childNodes 580234
crawling children cold: 151.29ms
crawling children hot: 50.785ms
total children 270313
after crawling heap: 415.36 MB
html.cloneNode(true) cold: 1.329s
cloning: OK
outerHTML: OK
outcome: OK
after cloning heap: 891.2 MB
querySelectorAll("div") cold: 43.107ms
querySelectorAll("div") hot: 0.02ms
total div 3714
getElementsByTagName("p") cold: 49.835ms
getElementsByTagName("p") hot: 0.092ms
total p 18800
after querying heap: 894.28 MB
removing divs: 3.51ms
div count cold: 26.572ms
total div 0
p count cold: 31.736ms
total p 15212
after removing divs heap: 901.92 MB
html.cloneNode(true) cold: 642.068ms
cloneNode: OK
html.outerHTML cold: 211.345ms
html.outerHTML hot: 202.012ms
outcome: OK
html.innerHTML: 9.180s
total benchmark time: 15.569s
total heap memory: 1468.02 MB
JSDOM benchmark for ./html.html
initial heap: 21.84 MB
parsing cold: 5.250s
document heap: 1144.18 MB
html.normalize() cold: 440.291ms
crawling childNodes cold: 1.408s
crawling childNodes hot: 602.199ms
total childNodes 580243
crawling children cold: 631.032ms
crawling children hot: 311.591ms
total children 270322
after crawling heap: 1542.26 MB
html.cloneNode(true) cold: 2.320s
cloning: OK
outerHTML: OK
outcome: OK
after cloning heap: 2642.37 MB
querySelectorAll("div") cold: 180.664ms
querySelectorAll("div") hot: 38.099ms
total div 3714
getElementsByTagName("p") cold: 177.417ms
getElementsByTagName("p") hot: 4.433ms
total p 18800
after querying heap: 2549.03 MB
removing divs: 8.520s
div count cold: 61.379ms
total div 0
p count cold: 66.741ms
total p 15212
after removing divs heap: 2561.58 MB
html.cloneNode(true) cold: 1.646s
cloneNode: OK
html.outerHTML cold: 278.583ms
html.outerHTML hot: 276.089ms
outcome: OK
html.innerHTML: 6.017s
total benchmark time: 30.761s
total heap memory: 3368.16 MB
JSDOM is slower than happydom https://github.com/capricorn86/happy-dom#performance
Indeed. The numbers are wild. Okay. Got ya.