crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Enable PDF Scraping and Return Both PDF and MD Versions

Open jmontoyavallejo opened this issue 1 year ago • 1 comments
trafficstars

It would be great if crawl4ai could scrape PDF files from websites and return both the PDF and a Markdown (MD) version of the content. Similar to this link https://arxiv.org/pdf/2402.06196

Detect and download PDF files. Convert PDF content into MD format. Return both the PDF and MD files.

jmontoyavallejo avatar Oct 15 '24 21:10 jmontoyavallejo

and llm extraction strategy

jmontoyavallejo avatar Oct 15 '24 21:10 jmontoyavallejo

@jmontoyavallejo Thx for the suggestion, crawling PDF, and media files (video, audio) in the backlog, hopefully soon.

unclecode avatar Oct 16 '24 06:10 unclecode