promptable icon indicating copy to clipboard operation
promptable copied to clipboard

PDF Document Loader

Open rikuthinks opened this issue 2 years ago • 3 comments

For PDFs:

https://github.com/kartik1998/pdf-images https://github.com/naptha/tesseract.js#tesseractjs

Spent many hours experimenting with the best way to extract text data from PDFs. Tried a couple different libraries - they all had problems preserving whitespace. This ended up being pretty problematic when I went to query embeddings of this text. The incorrect formatting would be preserved in the answers, which won't do.

The best solution in practice came out to be converting the PDFs to images then using OCR to extract text from the images. I have this implemented in python for now but will be rewriting in TS for the production app so can contribute that code in the future if someone else doesn't already pick it up

rikuthinks avatar Feb 22 '23 15:02 rikuthinks

Adding this to the roadmap! Would love other people to chime in about their usecases here too.!

cfortuner avatar Feb 22 '23 15:02 cfortuner

This could be applied to use cases such as performing semantic search on research papers or books, which can be found in the form of PDFs.

rikuthinks avatar Feb 22 '23 17:02 rikuthinks

I have the same use case as @rikuthinks , another use case is extracting specific informations, in case of something like an invoice, to extract names, adresses, etc..

yassinebridi avatar Feb 22 '23 17:02 yassinebridi