pywebcopy icon indicating copy to clipboard operation
pywebcopy copied to clipboard

save_website/crawl() does not download PDF

Open chstrehlow opened this issue 5 years ago • 4 comments

I tried to clone a complete website and noticed that the PDF files were skipped. This is the code I currently use:

config.setup_config(	project_url=URL,
 			project_folder=ProjectFolder,
 			project_name=ProjectName,
 			bypass_robots=True,
 			)

crawler = Crawler()
crawler.crawl()

But

save_website(	url='http://example-site.com/index.html',
 		project_folder='path/to/downloads',
 		**kwargs
 		)

Produced the same result.

One of the URLs I tested was: https://www.akkufit-berger.de/kataloge/#akkus As far as I can see the PDF extension is part of the “safe_file_exts”, which is the default option.

Even if I point the URL directly to the PDF file, it just downloads an html file which has a different file size as the original PDF and cannot be opened with the browser, or the PDF viewer.

chstrehlow avatar Jan 22 '20 09:01 chstrehlow

The pdfs are not downloaded because they are not on the same domain server hence the process marks it external and skips it entirely.

rajatomar788 avatar Jan 23 '20 12:01 rajatomar788

But the links point at the same domain: https://www.akkufit-berger.de/kataloge/#akkus https://www.akkufit-berger.de/wp-content/uploads/2018/10/EndkundenKatalog-Back-Up-Akkus.pdf The IP address is also the same?

I already noticed the “domain behavior” on different site. There was a link pointing to same server, but was missing the “WWW”. (http://example.com/file.ext instead of http://www.example.com/file.ext) and it seems like it was treated like an external link. Is there a way to whitelist external domains or to use placeholders?

chstrehlow avatar Jan 23 '20 12:01 chstrehlow

I just checked the project again. No it doesn't allow pdf downloading as of now to avoid bandwidth issues. It could be available in future versions.

On Thu, Jan 23, 2020, 6:04 PM chstrehlow [email protected] wrote:

But the links point at the same domain: https://www.akkufit-berger.de/kataloge/#akkus

https://www.akkufit-berger.de/wp-content/uploads/2018/10/EndkundenKatalog-Back-Up-Akkus.pdf The IP address is also the same?

I already noticed the “domain behavior” on different site. There was a link pointing to same server, but was missing the “WWW”. ( http://example.com/file.ext instead of http://www.example.com/file.ext) and it seems like it was treated like an external link. Is there a way to whitelist external domains or to use placeholders?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rajatomar788/pywebcopy/issues/27?email_source=notifications&email_token=AIGSNTWBYDDF5TP2HTBFQZLQ7GFEFA5CNFSM4KKCP2BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXG5RQ#issuecomment-577662662, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIGSNTRNJSF4FYBDKXA3KI3Q7GFEFANCNFSM4KKCP2BA .

rajatomar788 avatar Jan 23 '20 15:01 rajatomar788

Whitelisting is not available in the current version.

But there is an hack I built for making URLs absolute so that you can download any of the pdfs manually by just clicking on them.

https://drive.google.com/file/d/0B6XyXxdVDjXIQTYwSVpmaF9ETldTcnNQeXVKZ0VKNUFBQVhN/view?usp=sharing

On Thu, Jan 23, 2020, 6:04 PM chstrehlow [email protected] wrote:

But the links point at the same domain: https://www.akkufit-berger.de/kataloge/#akkus

https://www.akkufit-berger.de/wp-content/uploads/2018/10/EndkundenKatalog-Back-Up-Akkus.pdf The IP address is also the same?

I already noticed the “domain behavior” on different site. There was a link pointing to same server, but was missing the “WWW”. ( http://example.com/file.ext instead of http://www.example.com/file.ext) and it seems like it was treated like an external link. Is there a way to whitelist external domains or to use placeholders?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rajatomar788/pywebcopy/issues/27?email_source=notifications&email_token=AIGSNTWBYDDF5TP2HTBFQZLQ7GFEFA5CNFSM4KKCP2BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXG5RQ#issuecomment-577662662, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIGSNTRNJSF4FYBDKXA3KI3Q7GFEFANCNFSM4KKCP2BA .

rajatomar788 avatar Jan 23 '20 15:01 rajatomar788

Checking status on auto-buffers such that websites would not flag the network as such (or maybe have distributed crawlers to help out)

BradKML avatar Apr 02 '23 16:04 BradKML

So ok in the new pywebcopy 7 you can just create a new GenericResource which could download the pdfs after checking the content type of the response. You would have to read the elements.py file to do it manually

rajatomar788 avatar Apr 03 '23 02:04 rajatomar788