haystack feat: Add converter based on pdfminer

Is your feature request related to a problem? Please describe. We have found the https://github.com/pdfminer/pdfminer.six package to perform well for text extraction from PDFs especially in cases that involve two-column layouts.

Additionally pdfminer offers quite a few options that can be passed to their command line tool (docs here) giving users a lot of flexibility to tune the text extraction for their needs.

For example, in the past we ran into a two-column PDF from a client that required the following command

pdf2txt.py FILE.pdf --line-marg 0.6 > out.txt

to properly extract the text in the correct reading order.

Describe the solution you'd like Implementation of a PDFMinerToDocument converter.

Describe alternatives you've considered Continue to use the PyPDFToDocument converter, but it comes with less user specifiable options out of the box. From their docs to customize the text extraction it would require users to:

Create their own pypdf converter in Haystack. For example, our default one is https://github.com/deepset-ai/haystack/blob/c0b67432e4da0af41a25cdbb74115c6cad1c6abf/haystack/components/converters/pypdf.py#L27-L35
Create their own custom visitor functions. For example as shown in this example from their docs https://pypdf2.readthedocs.io/en/latest/user/extract-text.html#example-1-ignore-header-and-footer

I think adding support for pdfminer would be a great way to provide more user flexibility without requiring the need to write custom code.

Jan 18 '24 08:01 sjrl

Original discussion from private repo https://github.com/deepset-ai/haystack-private/issues/5 that is summarized above

Jan 18 '24 08:01 sjrl

@sjrl Is PyMuPDF based converter also in the works?

Feb 23 '24 18:02 devinsaini

@devinsaini I don’t believe there is work planned for that currently. I would recommend opening a new issue for that as a feature request!

Feb 26 '24 07:02 sjrl

I'll take on this

Apr 24 '24 20:04 medsriha

Hey @dfokina is it safe to close this issue thanks to @medsriha's PR? Or do we need to add respective docs for this before we should close it?

May 13 '24 08:05 sjrl

Hey @sjrl , thank you for tagging me here! I'll assign to myself to work on the docs, and then close it 👍

May 13 '24 12:05 dfokina