Crawler downloads pdfs to project root directory
Describe the bug crawler = Crawler(output_dir="crawled_files") works ok. Defaults are a bit screwy (hidden_text=True, really?) but it also ends up following links to pdfs and downloading them. They aren't placed in the output_dir. I believe the underlying Selenium driver is just doing its thing, and the pdf link corner case isn't handled.
Error message No error, but if you crawl for a while you will invariably see a bunch of pdfs in your root folder if you are crawling sites with pdf brochures.
Expected behavior Ideally, a way to handle the pdfs, convert them to documents, and specify how you want them converted.
Additional context The default extract_hidden_text=True doesn't make sense, it will extract javascript code usually not what you want in your documents
To Reproduce
Steps to reproduce the behavior
Just crawl the following urls with defaults, you will see a bunch of pdfs appear in your project root folder
travel_insurance_urls = [ "https://www.hsbc.com.hk/insurance/products/travel", "https://www.aig.com.hk/personal/travel-insurance", "https://www.zurich.com.hk/en/products/travel", "https://www.bluecross.com.hk/en/Travel-Smart/Information", "https://www.moneysmart.hk/en/travel-insurance", "https://www.moneyhero.com.hk/en/travel-insurance?psCollapse=true", ]
FAQ Check
- [x] Have you had a look at our new FAQ page?
System:
- OS:
- GPU/CPU:
- Haystack version (commit or version number): 1.24
- Crawler - from haystack.nodes.connector import Crawler
Minimal reproducible example
# pip install farm-haystack[crawler]
from haystack.nodes import Crawler
crawler = Crawler(output_dir="crawled_files")
docs = crawler.crawl(urls=["https://www.hsbc.com.hk/insurance/products/travel"])
Some PDFs are created in the working directory.
Solutions
Making sure that PDF files are created in the output_dir: this involves investigating how Selenium works with the files, shouldn't be much of an effort.
@augchan42 if you want to do this, feel free to open a PR.
Enhancements
Handle the pdfs, convert them to documents, and specify how you want them converted.
I would not prioritize this: it's a big change.
fixed in #7335