article-extractor icon indicating copy to clipboard operation
article-extractor copied to clipboard

Crashes on Pinterest and a lot of other websites

Open koresar opened this issue 5 months ago • 16 comments

Pages to test on:

  • https://www.pinterest.ca/variamsingh87/
  • https://www.pinterest.com.au/seriako/

Code:

import { extract } from '@extractus/article-extractor'
const input = 'https://www.pinterest.ca/variamsingh87/'
await extract(input)

Error:

TypeError: Cannot read properties of null (reading 'tagName')
    at Readability._grabArticle (/Users/vasyl/code/killme/node_modules/@mozilla/readability/Readability.js:1150:37)
    at Readability.parse (/Users/vasyl/code/killme/node_modules/@mozilla/readability/Readability.js:2277:31)
    at default (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/extractWithReadability.js:18:25)
    at file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/parseFromHtml.js:88:14
    at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:38
    at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:40
    at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:40
    at default (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/parseFromHtml.js:98:19)
    at extract (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/main.js:24:10)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

I presume that the bug is somewhere inside the linkedom package, DOMParser class.

koresar avatar Jan 24 '24 00:01 koresar