haystack feat: add converter based on pdfminer

Related Issues

#6763

Proposed Changes:

The default PDF converter may not extract text correctly for PDFs with complex layouts, such as those containing multiple text columns. To address this issue, PDFMinerToDocument is being introduced to enable users to customize text extraction from PDF files through pdfminer native arguments. Users can then configure the object to retain the reading order, among other options.

How did you test it?

Tested using several unit tests

Notes for the reviewer

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

Apr 27 '24 02:04 medsriha

Pull Request Test Coverage Report for Build 8901937116

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.07%) to 90.195%

Totals
Change from base Build 8896610003:	0.07%
Covered Lines:	6384
Relevant Lines:	7078

💛 - Coveralls

Apr 27 '24 22:04 coveralls

Nice! PR looks great, there's some linting issues that need to be fixed though. I see pylint, mypy and black failing, should be easy fixes. You can run those quite easily locally with hatch run test:lint and hatch run test:types, some lint failures can be automatically fixed with hatch run lint-fix.

Also when adding new lazy imports remember to update the test dependencies with the necessary dependencies otherwise tests will always fail.

Apr 29 '24 14:04 silvanocerza

Also I suggest rebasing or merging main in your branch to bring PR #7215 in as I recently changed which checks are required to merge.

Apr 29 '24 15:04 silvanocerza

Didn't notice at all we were missing tests, this causes coverage to go down a bit. Could you add some? 👀

Apr 30 '24 07:04 silvanocerza