langchainrb
langchainrb copied to clipboard
Improve PDF parsing
Is your feature request related to a problem? Please describe.
I noticed that on some complex PDF, with tables, the system pdftotext produce better result than pdf-reader gem.
I started this PR on an open-source project using langchainrb: https://github.com/nosia-ai/nosia/pull/20/files
Describe the solution you'd like
Maybe another option than pdf-reader here:
https://github.com/patterns-ai-core/langchainrb/blob/2054ef0e9c925215e2be696dbedc3876d21530af/lib/langchain/processors/pdf.rb#L17
Or an improvement in pdf-reader: https://github.com/yob/pdf-reader
Describe alternatives you've considered
pdftotext: https://www.xpdfreader.com/pdftotext-man.html