langchainrb icon indicating copy to clipboard operation
langchainrb copied to clipboard

Improve PDF parsing

Open cbldev opened this issue 1 year ago • 0 comments

Is your feature request related to a problem? Please describe. I noticed that on some complex PDF, with tables, the system pdftotext produce better result than pdf-reader gem. I started this PR on an open-source project using langchainrb: https://github.com/nosia-ai/nosia/pull/20/files

Describe the solution you'd like Maybe another option than pdf-reader here: https://github.com/patterns-ai-core/langchainrb/blob/2054ef0e9c925215e2be696dbedc3876d21530af/lib/langchain/processors/pdf.rb#L17 Or an improvement in pdf-reader: https://github.com/yob/pdf-reader

Describe alternatives you've considered pdftotext: https://www.xpdfreader.com/pdftotext-man.html

cbldev avatar Jun 29 '24 12:06 cbldev