ask-multiple-pdfs icon indicating copy to clipboard operation
ask-multiple-pdfs copied to clipboard

Replace PyPDF2 with pypdfium2

Open yiwei-ang opened this issue 1 year ago • 2 comments

I really appreciate @alejandro-ao for creating good video demonstrating the perfect blend of openai, PDF readers and streamlit!

I've tried to use the tool for several PDFs, I found that there's an issue of text extraction quality using PyPDF2, that contexts of a PDF are not extracted fully and completely.

After looking into https://github.com/py-pdf/benchmarks, it seems we can go with pypdfium2 that serves similar functionality, while providing better text extraction quality and faster computational time (Verified from my end!)

yiwei-ang avatar Aug 23 '23 05:08 yiwei-ang

As a side note, LangChain also supports pypdfium2 as a document loader: https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf#using-pypdfium2

IlianP avatar Sep 08 '23 14:09 IlianP

I have added this important feature to my larger pull request (my first one ever). I gave you credit there, but no sure this is the right way to do it.

costabm avatar Nov 02 '23 14:11 costabm