haystack icon indicating copy to clipboard operation
haystack copied to clipboard

feat: Add converter based on pdfminer

Open sjrl opened this issue 1 year ago • 4 comments

Is your feature request related to a problem? Please describe. We have found the https://github.com/pdfminer/pdfminer.six package to perform well for text extraction from PDFs especially in cases that involve two-column layouts.

Additionally pdfminer offers quite a few options that can be passed to their command line tool (docs here) giving users a lot of flexibility to tune the text extraction for their needs.

For example, in the past we ran into a two-column PDF from a client that required the following command

pdf2txt.py FILE.pdf --line-marg 0.6 > out.txt

to properly extract the text in the correct reading order.

Describe the solution you'd like Implementation of a PDFMinerToDocument converter.

Describe alternatives you've considered Continue to use the PyPDFToDocument converter, but it comes with less user specifiable options out of the box. From their docs to customize the text extraction it would require users to:

  1. Create their own pypdf converter in Haystack. For example, our default one is https://github.com/deepset-ai/haystack/blob/c0b67432e4da0af41a25cdbb74115c6cad1c6abf/haystack/components/converters/pypdf.py#L27-L35
  2. Create their own custom visitor functions. For example as shown in this example from their docs https://pypdf2.readthedocs.io/en/latest/user/extract-text.html#example-1-ignore-header-and-footer

I think adding support for pdfminer would be a great way to provide more user flexibility without requiring the need to write custom code.

sjrl avatar Jan 18 '24 08:01 sjrl

Original discussion from private repo https://github.com/deepset-ai/haystack-private/issues/5 that is summarized above

sjrl avatar Jan 18 '24 08:01 sjrl

@sjrl Is PyMuPDF based converter also in the works?

devinsaini avatar Feb 23 '24 18:02 devinsaini

@devinsaini I don’t believe there is work planned for that currently. I would recommend opening a new issue for that as a feature request!

sjrl avatar Feb 26 '24 07:02 sjrl

I'll take on this

medsriha avatar Apr 24 '24 20:04 medsriha

Hey @dfokina is it safe to close this issue thanks to @medsriha's PR? Or do we need to add respective docs for this before we should close it?

sjrl avatar May 13 '24 08:05 sjrl

Hey @sjrl , thank you for tagging me here! I'll assign to myself to work on the docs, and then close it 👍

dfokina avatar May 13 '24 12:05 dfokina