examples icon indicating copy to clipboard operation
examples copied to clipboard

crawlsite.js crashes on PDFs

Open minthemiddle opened this issue 6 years ago • 10 comments

When the script reaches a PDF, it crashes.

Example:

(node:23872) UnhandledPromiseRejectionWarning: Error: net::ERR_ABORTED at https://code.design/files/code-design-magazine-001.pdf
    at navigate (/Users/martin/Sites/crawlsite/node_modules/puppeteer/lib/Page.js:539:37)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:23872) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:23872) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

minthemiddle avatar Jun 07 '18 16:06 minthemiddle

Good catch. Do you have the starting page you were running it on? That'll help me debug.

ebidel avatar Jun 07 '18 16:06 ebidel

Yes, my non-profit: https://code.design

minthemiddle avatar Jun 07 '18 16:06 minthemiddle

@ebidel any progress on the crash on PDF documents issue... this is a really cool project!

aamakerlsa avatar Oct 13 '18 02:10 aamakerlsa

I found a way around the by making this modification

.filter(el => el.localName === 'a' && el.href && el.href.indexOf('.pdf') < 0) // element is an anchor with an href.

... basically it checks to make sure the href of the a tag does NOT contain .pdf

aamakerlsa avatar Oct 13 '18 03:10 aamakerlsa

@aamakerlsa Right, it would be something like that. However, not every PDF link contains ".pdf" in the name :)

ebidel avatar Oct 16 '18 16:10 ebidel

Can I work on this issue?

TruptiM18 avatar Jan 13 '19 09:01 TruptiM18

Sure

ebidel avatar Jan 14 '19 19:01 ebidel

@ebidel Thanks. Can we just read the header of the file pointed by href in hex and figure out if its of .pdf format file or not? Pdf File Format Basic Structure

TruptiM18 avatar Jan 21 '19 02:01 TruptiM18

Hi @ebidel, Did you get a chance to look into the above query? Thanks.

TruptiM18 avatar Jan 29 '19 05:01 TruptiM18

Not sure if that would work but you could try. You'd have to read the response body of every request though :(

ebidel avatar Jan 29 '19 21:01 ebidel