article-extractor Crashes on Pinterest and a lot of other websites

Pages to test on:

https://www.pinterest.ca/variamsingh87/
https://www.pinterest.com.au/seriako/

Code:

import { extract } from '@extractus/article-extractor'
const input = 'https://www.pinterest.ca/variamsingh87/'
await extract(input)

Error:

TypeError: Cannot read properties of null (reading 'tagName')
    at Readability._grabArticle (/Users/vasyl/code/killme/node_modules/@mozilla/readability/Readability.js:1150:37)
    at Readability.parse (/Users/vasyl/code/killme/node_modules/@mozilla/readability/Readability.js:2277:31)
    at default (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/extractWithReadability.js:18:25)
    at file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/parseFromHtml.js:88:14
    at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:38
    at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:40
    at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:40
    at default (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/parseFromHtml.js:98:19)
    at extract (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/main.js:24:10)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

I presume that the bug is somewhere inside the linkedom package, DOMParser class.

Jan 24 '24 00:01 koresar

@koresar unfortunately, this library works best with simple article formats. Complex layouts like lists and grids can be tricky for it to handle.

Jan 24 '24 03:01 ndaidong

I believe it would work fine if the linkedom was less buggy.

Jan 24 '24 07:01 koresar

I believe it would work fine if the linkedom was less buggy.

maybe you can try to use happydom for testing

Jan 24 '24 07:01 SettingDust

@SettingDust @koresar glad to hear about happydom, I will try it.

Jan 24 '24 07:01 ndaidong

People generally recommend using jsdom with readability. FYI

Jan 25 '24 02:01 koresar

People generally recommend using jsdom with readability. FYI

It's too heavy. We used to use it. The Happy DOM slogan is JSDOM alternative

Happy DOM focuses heavily on performance and can be used as an alternative to JSDOM.

Jan 25 '24 03:01 SettingDust

After some tests. Happy DOM doesn't support many pseudo classes like :not, :is make it isn't an option for now. I'll look into the linkedom and readability for this issue

Jan 26 '24 09:01 SettingDust

Like https://github.com/mozilla/readability/issues/836#issuecomment-1911720374 said. The pinterest returned html is wrong

Readability won't work as expected I think.

I'm going to try catch the content parser and don't include content property when error. How are you thinking about this? @ndaidong

Jan 26 '24 09:01 SettingDust

@SettingDust thank you. I've tried happydom in another project but not happy with it!

Anw, if the HTML structrure is not well-formed, no library can help. And yes, I think we should catch the error like this.

Jan 26 '24 09:01 ndaidong

Anw, if the HTML structrure is not well-formed, no library can help. And yes, I think we should catch the error like this.

Browser, JSDOM, or happy dom will handle them. But not linkedom JSDOM and happydom will wrap the tags in body XD

linkedom just keep them

I've tried happydom in another project but not happy with it!

Aw. I'm a little curious, what's the problem?

Jan 26 '24 09:01 SettingDust

I would like to use JSDOM, if it behaves well with pinterest. (Not sure how heavy it is, but I do not care much. I'd better have reliable slow code rather unreliable but fast.)

Or can I pass my DOM object instead of the HTML as a string?

Jan 29 '24 00:01 koresar

Aw. I'm a little curious, what's the problem?

@SettingDust I'm crawling content from some websites. When I replace linkedom with happydom, the rate of failed cases will increase about 30%. So I had to continue going with linkedom. No time to try new solution in this short project.

Screenshot from 2024-01-29 09-34-54

@koresar the main purpose of this lib is extract article content only. If you want to extract a special kind of content, such as Pinterest, it's better to fetch the content directly and parse them with any DOM utils you can.

Jan 29 '24 02:01 ndaidong

I am not after special kind of content. I'm after very random pages on the internet. :) It fails not only on the pinterest, but a large amount of other websites.

I don't see JSDOM as a performance bottleneck. Do you?

Jan 29 '24 03:01 koresar

Or can I pass my DOM object instead of the HTML as a string?

That's a point I mentioned before. I'm going to do that in new extractus project.

Jan 29 '24 04:01 SettingDust

I am not after special kind of content. I'm after very random pages on the internet. :) It fails not only on the pinterest, but a large amount of other websites.

Can you provide the list of failed websites plz?

I don't see JSDOM as a performance bottleneck. Do you?

JSDOM is slower and larger memory alloc than linkedom. https://github.com/WebReflection/linkedom#benchmarks

 linkedom  benchmark for  ./html.html             

initial heap: 7.73 MB

 parsing cold: 1.282s

document heap: 347.37 MB

 html.normalize() cold: 31.981ms

 crawling childNodes cold: 91.346ms
 crawling childNodes  hot: 77.313ms
 total childNodes 580234

 crawling children cold: 54.539ms
 crawling children  hot: 50.959ms
 total children 270313

after crawling heap: 350.01 MB

 html.cloneNode(true) cold: 1.129s
 cloning: OK
 outerHTML: OK
 outcome: OK

after cloning heap: 829.71 MB

 querySelectorAll("div") cold: 37.924ms
 querySelectorAll("div")  hot: 35.58ms
 total div 3714

 getElementsByTagName("p") cold: 34.354ms
 getElementsByTagName("p")  hot: 32.025ms
 total p 18800

after querying heap: 839.56 MB

 removing divs: 2.116ms

 div count cold: 19.291ms
 total div 0

 p count cold: 22.322ms
 total p 15212

after removing divs heap: 837.12 MB

 html.cloneNode(true) cold: 600.18ms
 cloneNode: OK

 html.outerHTML cold: 190.948ms
 html.outerHTML  hot: 177.362ms
 outcome: OK

 html.innerHTML: 5.756s

total benchmark time: 11.510s
total heap memory: 1383.01 MB

 linkedom cached  benchmark for  ./html.html             

initial heap: 7.06 MB

 parsing cold: 1.340s

document heap: 348.8 MB

 html.normalize() cold: 57.576ms

 crawling childNodes cold: 212.709ms
 crawling childNodes  hot: 100.948ms
 total childNodes 580234

 crawling children cold: 151.29ms
 crawling children  hot: 50.785ms
 total children 270313

after crawling heap: 415.36 MB

 html.cloneNode(true) cold: 1.329s
 cloning: OK
 outerHTML: OK
 outcome: OK

after cloning heap: 891.2 MB

 querySelectorAll("div") cold: 43.107ms
 querySelectorAll("div")  hot: 0.02ms
 total div 3714

 getElementsByTagName("p") cold: 49.835ms
 getElementsByTagName("p")  hot: 0.092ms
 total p 18800

after querying heap: 894.28 MB

 removing divs: 3.51ms

 div count cold: 26.572ms
 total div 0

 p count cold: 31.736ms
 total p 15212

after removing divs heap: 901.92 MB

 html.cloneNode(true) cold: 642.068ms
 cloneNode: OK

 html.outerHTML cold: 211.345ms
 html.outerHTML  hot: 202.012ms
 outcome: OK

 html.innerHTML: 9.180s

total benchmark time: 15.569s
total heap memory: 1468.02 MB

 JSDOM  benchmark for  ./html.html             

initial heap: 21.84 MB

 parsing cold: 5.250s

document heap: 1144.18 MB

 html.normalize() cold: 440.291ms

 crawling childNodes cold: 1.408s
 crawling childNodes  hot: 602.199ms
 total childNodes 580243

 crawling children cold: 631.032ms
 crawling children  hot: 311.591ms
 total children 270322

after crawling heap: 1542.26 MB

 html.cloneNode(true) cold: 2.320s
 cloning: OK
 outerHTML: OK
 outcome: OK

after cloning heap: 2642.37 MB

 querySelectorAll("div") cold: 180.664ms
 querySelectorAll("div")  hot: 38.099ms
 total div 3714

 getElementsByTagName("p") cold: 177.417ms
 getElementsByTagName("p")  hot: 4.433ms
 total p 18800

after querying heap: 2549.03 MB

 removing divs: 8.520s

 div count cold: 61.379ms
 total div 0

 p count cold: 66.741ms
 total p 15212

after removing divs heap: 2561.58 MB

 html.cloneNode(true) cold: 1.646s
 cloneNode: OK

 html.outerHTML cold: 278.583ms
 html.outerHTML  hot: 276.089ms
 outcome: OK

 html.innerHTML: 6.017s

total benchmark time: 30.761s
total heap memory: 3368.16 MB

JSDOM is slower than happydom https://github.com/capricorn86/happy-dom#performance

Jan 29 '24 04:01 SettingDust

Indeed. The numbers are wild. Okay. Got ya.

Jan 29 '24 05:01 koresar

article-extractor article-extractor copied to clipboard

Crashes on Pinterest and a lot of other websites

article-extractor
article-extractor copied to clipboard