Question: Function `process_pdf` only read PDF from s3?

Open derekhsu opened this issue 9 months ago • 2 comments

t seems that the process_pdf method in pipeline.py only processes PDFs from S3 storage, but the README.md file says I can specify one or more local PDFs using the --pdfs parameter. So, where is the code to process local PDFs?

Mar 06 '25 06:03 derekhsu

I'm wondering the same thing. Why can't this leverage local PDFs without having to host them in S3 storage?

Mar 20 '25 23:03 sbarham

Yes, I encountered the same problem. After reading the source code, I found that I could only use the PDF on the cloud. like a shit.......

Mar 26 '25 10:03 YYH211

Newer versions have supported local files for a while now.

Jul 24 '25 16:07 jakep-allenai