haystack
haystack copied to clipboard
feat: add converter based on pdfminer
Related Issues
- #6763
Proposed Changes:
The default PDF converter may not extract text correctly for PDFs with complex layouts, such as those containing multiple text columns. To address this issue, PDFMinerToDocument
is being introduced to enable users to customize text extraction from PDF files through pdfminer native arguments. Users can then configure the object to retain the reading order, among other options.
How did you test it?
Tested using several unit tests
Notes for the reviewer
- I have read the contributors guidelines and the code of conduct
- I have updated the related issue with new insights and changes
- I added unit tests and updated the docstrings
- I've used one of the conventional commit types for my PR title:
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
. - I documented my code
- I ran pre-commit hooks and fixed any issue
Pull Request Test Coverage Report for Build 8901937116
Details
- 0 of 0 changed or added relevant lines in 0 files are covered.
- No unchanged relevant lines lost coverage.
- Overall coverage increased (+0.07%) to 90.195%
Totals | |
---|---|
Change from base Build 8896610003: | 0.07% |
Covered Lines: | 6384 |
Relevant Lines: | 7078 |
💛 - Coveralls
Nice! PR looks great, there's some linting issues that need to be fixed though. I see pylint
, mypy
and black
failing, should be easy fixes. You can run those quite easily locally with hatch run test:lint
and hatch run test:types
, some lint failures can be automatically fixed with hatch run lint-fix
.
Also when adding new lazy imports remember to update the test
dependencies with the necessary dependencies otherwise tests will always fail.
Also I suggest rebasing or merging main
in your branch to bring PR #7215 in as I recently changed which checks are required to merge.
Didn't notice at all we were missing tests, this causes coverage to go down a bit. Could you add some? 👀