ask-multiple-pdfs
ask-multiple-pdfs copied to clipboard
Replace PyPDF2 with pypdfium2
I really appreciate @alejandro-ao for creating good video demonstrating the perfect blend of openai, PDF readers and streamlit!
I've tried to use the tool for several PDFs, I found that there's an issue of text extraction quality using PyPDF2, that contexts of a PDF are not extracted fully and completely.
After looking into https://github.com/py-pdf/benchmarks, it seems we can go with pypdfium2 that serves similar functionality, while providing better text extraction quality and faster computational time (Verified from my end!)
As a side note, LangChain also supports pypdfium2 as a document loader: https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf#using-pypdfium2
I have added this important feature to my larger pull request (my first one ever). I gave you credit there, but no sure this is the right way to do it.