phidata icon indicating copy to clipboard operation
phidata copied to clipboard

Need support for PDFs with images

Open sridharaiyer opened this issue 9 months ago • 5 comments

Uploading Fake-Police-Report-HighRes.pdf… I want to read from a PDF that contains text inside an image. See attached. I am able to do this using Langchain by installing pip install rapidocr-onnxruntime. I want the from phi.document.reader.pdf import PDFReader PDFReader class to support this.

Any suggestions for me to be able to do RAG with PDFS with images, while this request is being implemented?

sridharaiyer avatar May 13 '24 18:05 sridharaiyer

@sridharaiyer agree this is critical @ysolanky @jacobweiss2305 any takers?

ashpreetbedi avatar May 13 '24 21:05 ashpreetbedi

@sridharaiyer mind sharing a good PDF to test this with?

ashpreetbedi avatar May 13 '24 21:05 ashpreetbedi

@sridharaiyer The PDF Image reader is ready. It would be great it you could share an ideal pdf for your use case so that we can test further before releasing :)

ysolanky avatar May 14 '24 00:05 ysolanky

@sridharaiyer Good day, team phidata! Thank you for your prompt response and willingness to address this critical feature request. I appreciate your dedication to continuously improving the library.

To aid in your testing, I would like to share the following PDF files that contain a mix of scanned images (text):

  1. https://www.deped.gov.ph/wp-content/uploads/DO_s2024_005.pdf
  2. https://www.deped.gov.ph/wp-content/uploads/DO_s2024_002.pdf

These PDFs are representative of the types of documents I frequently work with, and having the ability to extract text from images within PDFs would greatly enhance my workflow.

I have been using phidata for my personal projects and have found it to be an invaluable tool. Additionally, I have been following the progress of the library and the insightful video demos led by Sir @ashpreetbedi on Twitter.

Thank you once again for your efforts in making phidata an even more powerful and versatile library. I look forward to testing the PDF Image reader feature once it is released.

llegomark avatar May 14 '24 01:05 llegomark

@sridharaiyer The PDFImageReader is now live in v2.4.8

We can import it using from phi.document.reader.pdf import PDFImageReader

@ysolanky please add some docs to help :)

@llegomark thank you for sharing the PDFs and your help with the product.

we appreciate your help with this so much

ashpreetbedi avatar May 14 '24 07:05 ashpreetbedi

Thanks a lot team!! The PDF with an image containing text is indeed being read properly. So I am closing this ticket.

However, I have another question in regards to the knowledge base strategy. I will open another thread. Thanks, once again!!

sridharaiyer avatar May 16 '24 18:05 sridharaiyer