Request to support PDF scraping

Open Zenpenguin opened this issue 2 years ago • 2 comments

Hi, Thank you for this amazing repo. I am trying to use this on a website which also has 100s of pdfs. The crawler is unable to get the content from the PDFs. It fails with the error:

PlaywrightCrawler: Request failed and reached maximum retries. page.goto: net::ERR_ABORTED

It will be great if request for crawling through PDFs can be added as well.

Nov 30 '23 02:11 Zenpenguin

How to skip files that come across from parsing?

Dec 07 '23 09:12 Gorchakov-Pressure

How to skip files that come across from parsing?

You must specify which extensions you want to exclude in the config.ts file. resourceExclusions: []

Dec 07 '23 11:12 isarikaya