feat: Add converter based on pdfminer
Is your feature request related to a problem? Please describe. We have found the https://github.com/pdfminer/pdfminer.six package to perform well for text extraction from PDFs especially in cases that involve two-column layouts.
Additionally pdfminer offers quite a few options that can be passed to their command line tool (docs here) giving users a lot of flexibility to tune the text extraction for their needs.
For example, in the past we ran into a two-column PDF from a client that required the following command
pdf2txt.py FILE.pdf --line-marg 0.6 > out.txt
to properly extract the text in the correct reading order.
Describe the solution you'd like
Implementation of a PDFMinerToDocument converter.
Describe alternatives you've considered
Continue to use the PyPDFToDocument converter, but it comes with less user specifiable options out of the box. From their docs to customize the text extraction it would require users to:
- Create their own pypdf converter in Haystack. For example, our default one is https://github.com/deepset-ai/haystack/blob/c0b67432e4da0af41a25cdbb74115c6cad1c6abf/haystack/components/converters/pypdf.py#L27-L35
- Create their own custom visitor functions. For example as shown in this example from their docs https://pypdf2.readthedocs.io/en/latest/user/extract-text.html#example-1-ignore-header-and-footer
I think adding support for pdfminer would be a great way to provide more user flexibility without requiring the need to write custom code.
Original discussion from private repo https://github.com/deepset-ai/haystack-private/issues/5 that is summarized above
@sjrl Is PyMuPDF based converter also in the works?
@devinsaini I don’t believe there is work planned for that currently. I would recommend opening a new issue for that as a feature request!
I'll take on this
Hey @dfokina is it safe to close this issue thanks to @medsriha's PR? Or do we need to add respective docs for this before we should close it?
Hey @sjrl , thank you for tagging me here! I'll assign to myself to work on the docs, and then close it 👍