haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Crawler downloads pdfs to project root directory

Open augchan42 opened this issue 1 year ago • 1 comments

Describe the bug crawler = Crawler(output_dir="crawled_files") works ok. Defaults are a bit screwy (hidden_text=True, really?) but it also ends up following links to pdfs and downloading them. They aren't placed in the output_dir. I believe the underlying Selenium driver is just doing its thing, and the pdf link corner case isn't handled.

Error message No error, but if you crawl for a while you will invariably see a bunch of pdfs in your root folder if you are crawling sites with pdf brochures.

Expected behavior Ideally, a way to handle the pdfs, convert them to documents, and specify how you want them converted.

Additional context The default extract_hidden_text=True doesn't make sense, it will extract javascript code usually not what you want in your documents

To Reproduce Steps to reproduce the behavior Just crawl the following urls with defaults, you will see a bunch of pdfs appear in your project root folder travel_insurance_urls = [ "https://www.hsbc.com.hk/insurance/products/travel", "https://www.aig.com.hk/personal/travel-insurance", "https://www.zurich.com.hk/en/products/travel", "https://www.bluecross.com.hk/en/Travel-Smart/Information", "https://www.moneysmart.hk/en/travel-insurance", "https://www.moneyhero.com.hk/en/travel-insurance?psCollapse=true", ]

FAQ Check

System:

  • OS:
  • GPU/CPU:
  • Haystack version (commit or version number): 1.24
  • Crawler - from haystack.nodes.connector import Crawler

augchan42 avatar Jan 26 '24 09:01 augchan42

Minimal reproducible example

# pip install farm-haystack[crawler]

from haystack.nodes import Crawler

crawler = Crawler(output_dir="crawled_files")
docs = crawler.crawl(urls=["https://www.hsbc.com.hk/insurance/products/travel"])

Some PDFs are created in the working directory.

Solutions

Making sure that PDF files are created in the output_dir: this involves investigating how Selenium works with the files, shouldn't be much of an effort. @augchan42 if you want to do this, feel free to open a PR.

Enhancements

Handle the pdfs, convert them to documents, and specify how you want them converted.

I would not prioritize this: it's a big change.

anakin87 avatar Feb 05 '24 11:02 anakin87

fixed in #7335

anakin87 avatar Mar 11 '24 17:03 anakin87