haystack icon indicating copy to clipboard operation
haystack copied to clipboard

feat: add converter based on pdfminer

Open medsriha opened this issue 10 months ago • 4 comments

Related Issues

  • #6763

Proposed Changes:

The default PDF converter may not extract text correctly for PDFs with complex layouts, such as those containing multiple text columns. To address this issue, PDFMinerToDocument is being introduced to enable users to customize text extraction from PDF files through pdfminer native arguments. Users can then configure the object to retain the reading order, among other options.

How did you test it?

Tested using several unit tests

Notes for the reviewer

medsriha avatar Apr 27 '24 02:04 medsriha

Pull Request Test Coverage Report for Build 8901937116

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.07%) to 90.195%

Totals Coverage Status
Change from base Build 8896610003: 0.07%
Covered Lines: 6384
Relevant Lines: 7078

💛 - Coveralls

coveralls avatar Apr 27 '24 22:04 coveralls

Nice! PR looks great, there's some linting issues that need to be fixed though. I see pylint, mypy and black failing, should be easy fixes. You can run those quite easily locally with hatch run test:lint and hatch run test:types, some lint failures can be automatically fixed with hatch run lint-fix.

Also when adding new lazy imports remember to update the test dependencies with the necessary dependencies otherwise tests will always fail.

silvanocerza avatar Apr 29 '24 14:04 silvanocerza

Also I suggest rebasing or merging main in your branch to bring PR #7215 in as I recently changed which checks are required to merge.

silvanocerza avatar Apr 29 '24 15:04 silvanocerza

Didn't notice at all we were missing tests, this causes coverage to go down a bit. Could you add some? 👀

silvanocerza avatar Apr 30 '24 07:04 silvanocerza