pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

Update version of `pdfminer-six` to `20240706`

Open ValentinaGalataAA opened this issue 1 year ago • 10 comments

Please update the version of pdfminer-six to 20240706.

ValentinaGalataAA avatar Jul 08 '24 08:07 ValentinaGalataAA

There seems to be a bug in the latest release — https://github.com/pdfminer/pdfminer.six/issues/1004 — which also happens to be throwing errors in pdfplumber's test suite. I'll keep an eye out for pdfminer.six's next release, which hopefully fixes the bug.

jsvine avatar Jul 14 '24 15:07 jsvine

There seems to be a bug in the latest release — pdfminer/pdfminer.six#1004 — which also happens to be throwing errors in pdfplumber's test suite. I'll keep an eye out for pdfminer.six's next release, which hopefully fixes the bug.

I fixed the bug :) https://github.com/pdfminer/pdfminer.six/pull/1027 hopefully it gets released soon!

dhdaines avatar Jul 31 '24 21:07 dhdaines

@dhdaines Wonderful, thanks!

jsvine avatar Jul 31 '24 22:07 jsvine

@jsvine would you consider upgrade this dependency before the next release of pdfminer.six ?

  • pdfminer has a release cycle of about 5-6 months, so it can means another 5 months until next release, which is a bit too long imo
  • the current version throw similar errors too, which is what I encountered (please see below)

The project I'm working on uses pdfplumber in production, and when parsing the following PDF https://www.ge.com/sites/default/files/ge2021_sustainability_report.pdf, it raises TypeError: 'PDFObjRef' object is not iterable

I tested locally that pdfminer.six 20240706 could solve the issue. (I forced pdfplumber 0.10.2 and pdfminer.six 20240706 to coexist in order to verify it. However I couldn't do that in the project code because poetry is used there)

chenxi-briink avatar Aug 13 '24 11:08 chenxi-briink

Hi @chenxi-briink, can you try upgrading pdfplumber to the latest version, 0.11.3? Using that version, I'm able to parse the PDF you've cited with no problems/errors.

jsvine avatar Aug 18 '24 23:08 jsvine

Hi @jsvine,

Sorry that I mis-typed the version number in my previous message

I forced pdfplumber 0.10.2 and pdfminer.six 20240706

should be: I forced pdfplumber 0.11.3 and pdfminer.six 20240706 to coexist.

yes that combination works for me.

however, the issue is, the requirements.txt of pdfplumber depends on pdfminer.six 20231228, it is the latter throws this exception.

File ~/foo/bar/.venv/lib/python3.11/site-packages/pdfminer/pdftypes.py:373, in PDFStream.decode(self)
    371     raise PDFNotImplementedError("Unsupported filter: %r" % f)
    372 # apply predictors
--> 373 if params and "Predictor" in params:
    374     pred = int_value(params["Predictor"])
    375     if pred == 1:
    376         # no predictor

TypeError: argument of type 'PDFObjRef' is not iterable

For in my production environment, in which poetry is used, I couldn't override the stated pdfminer.six version 20231228.

chenxi-briink avatar Aug 19 '24 02:08 chenxi-briink

Hi @chenxi-briink and thanks for the clarification. That's strange; I'm running the exact same combination and seeing no error. First, I set up this fresh environment:

python -m venv venv
source venv/bin/activate
pip install pdfplumber==0.11.3
pip freeze | grep pdf

... which outputs:

pdfminer.six==20231228
pdfplumber==0.11.3
pypdfium2==4.30.0

Then I ran this:

import pdfplumber

pdf = pdfplumber.open("./ge2021_sustainability_report.pdf")

for page in pdf.pages:
    assert len(pdf.objects)

... which completed without error.

jsvine avatar Aug 19 '24 14:08 jsvine

Hi @jsvine,

Gee, by trying to replicate what you posted, I realised that the file I got turned out to be a modified version of the public available one I shared with you. For this modified file, the exception will occur when doing the same as you shared. (Sorry that I didn't double check cause I didn't expect there would be a modified version)

I uploaded this file to a public accessible GDrive folder , basically it's a shortened version of the original GE 2021 Sustainability Report. A PDF viewer could render it w/o problem.

chenxi-briink avatar Aug 19 '24 16:08 chenxi-briink

Thanks for providing the updated PDF, @chenxi-briink. Using that one, I can indeed replicate the error.

In this case, however, I don't plan on upgrading the dependency until at least the next pdfminer.six release — although doing so might fix your situation, it will likely break others (as confirmed pdfplumber's test suite). @dhdaines's fix in https://github.com/pdfminer/pdfminer.six/pull/1027 handles your PDF well; perhaps you can use his fork in the meantime?

As context: pdfminer.six is a pinned dependency in pdfplumber because changes to that library can have breaking changes for this one. I realize it can cause issues when someone wants to use a different specific version of pdfminer.six, but that tradeoff is preferable to all new installations of pdfplumber breaking.

jsvine avatar Aug 19 '24 23:08 jsvine

Hi @jsvine , I totally understand the rational for not upgrading. Thanks for explaining and pointing me to @dhdaines 's fork, I might find sometime to give it a try.

chenxi-briink avatar Aug 20 '24 06:08 chenxi-briink

Any news on this?

PhorstenkampFuzzy avatar Jan 31 '25 09:01 PhorstenkampFuzzy

Any update on this request? I have a project in which the packages I use rely on two different versions of pdfminer.six and unfortunately pdfplumber relies on the older version while the other package relies on newer version.

Any idea on when will pdfplumber be updated to support the 20240706 version of pdfminer-six

thivagar-manickam avatar Feb 07 '25 08:02 thivagar-manickam

You can use PAVÉS now, it is mostly a drop-in replacement for pdfminer, except that it fixes a bunch of problems and is also somewhat faster.

Can you try this PR? https://github.com/jsvine/pdfplumber/pull/1272

dhdaines avatar Feb 07 '25 18:02 dhdaines

Thanks for the suggestion @dhdaines ; let me see if I can give it a try

thivagar-manickam avatar Feb 15 '25 05:02 thivagar-manickam

For what it is worth:

I previously installed unstructured-ingest project (which has gone for-profit now) with the hope to mine PDFs of court decisions which have caption boxes using ASCII: that makes for problematic ingestion.

I installed pdfplumber, but got the warning suggesting my unstructured-ingest would be broken. I upgraded pdfminer-six, and got a warning that I'm now breaking pdfplumber. Re-installing pdfplumber re-installed the older version of pdfminer-six, so the work-around, I suppose, if you're going to use either tool, to perform a pip install ahead of time to set the pdfminer-six version.

(map) jlpoole@ryzdesk ~ $ pip install pdfplumber
Collecting pdfplumber
Downloading pdfplumber-0.11.5-py3-none-any.whl.metadata (42 kB)
Collecting pdfminer.six==20231228 (from pdfplumber)
Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Requirement already satisfied: Pillow>=9.1 in ./map/lib/python3.12/site-packages (from pdfplumber) (10.4.0)
Requirement already satisfied: pypdfium2>=4.18.0 in ./map/lib/python3.12/site-packages (from pdfplumber) (4.30.1)
Requirement already satisfied: charset-normalizer>=2.0.0 in ./map/lib/python3.12/site-packages (from pdfminer.six==20231228->pdfplumber) (3.3.2)
Requirement already satisfied: cryptography>=36.0.0 in ./map/lib/python3.12/site-packages (from pdfminer.six==20231228->pdfplumber) (44.0.1)
Requirement already satisfied: cffi>=1.12 in ./map/lib/python3.12/site-packages (from cryptography>=36.0.0->pdfminer.six==20231228->pdfplumber) (1.17.1)
Requirement already satisfied: pycparser in ./map/lib/python3.12/site-packages (from cffi>=1.12->cryptography>=36.0.0->pdfminer.six==20231228->pdfplumber) (2.22)
Downloading pdfplumber-0.11.5-py3-none-any.whl (59 kB)
Downloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.6/5.6 MB 12.7 MB/s eta 0:00:00
Installing collected packages: pdfminer.six, pdfplumber
Attempting uninstall: pdfminer.six
Found existing installation: pdfminer.six 20240706
Uninstalling pdfminer.six-20240706:
Successfully uninstalled pdfminer.six-20240706
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
unstructured-inference 0.8.7 requires pdfminer-six==20240706, but you have pdfminer-six 20231228 which is incompatible.
Successfully installed pdfminer.six-20231228 pdfplumber-0.11.5
(map) jlpoole@ryzdesk ~ $ pip install pdfminer-six -U
Requirement already satisfied: pdfminer-six in ./map/lib/python3.12/site-packages (20231228)
Collecting pdfminer-six
Using cached pdfminer.six-20240706-py3-none-any.whl.metadata (4.1 kB)
Requirement already satisfied: charset-normalizer>=2.0.0 in ./map/lib/python3.12/site-packages (from pdfminer-six) (3.3.2)
Requirement already satisfied: cryptography>=36.0.0 in ./map/lib/python3.12/site-packages (from pdfminer-six) (44.0.1)
Requirement already satisfied: cffi>=1.12 in ./map/lib/python3.12/site-packages (from cryptography>=36.0.0->pdfminer-six) (1.17.1)
Requirement already satisfied: pycparser in ./map/lib/python3.12/site-packages (from cffi>=1.12->cryptography>=36.0.0->pdfminer-six) (2.22)
Using cached pdfminer.six-20240706-py3-none-any.whl (5.6 MB)
Installing collected packages: pdfminer-six
Attempting uninstall: pdfminer-six
Found existing installation: pdfminer.six 20231228
Uninstalling pdfminer.six-20231228:
Successfully uninstalled pdfminer.six-20231228
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pdfplumber 0.11.5 requires pdfminer.six==20231228, but you have pdfminer-six 20240706 which is incompatible.
Successfully installed pdfminer-six-20240706
(map) jlpoole@ryzdesk ~ $ pip install pdfplumber
Requirement already satisfied: pdfplumber in ./map/lib/python3.12/site-packages (0.11.5)
Collecting pdfminer.six==20231228 (from pdfplumber)
Using cached pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Requirement already satisfied: Pillow>=9.1 in ./map/lib/python3.12/site-packages (from pdfplumber) (10.4.0)
Requirement already satisfied: pypdfium2>=4.18.0 in ./map/lib/python3.12/site-packages (from pdfplumber) (4.30.1)
Requirement already satisfied: charset-normalizer>=2.0.0 in ./map/lib/python3.12/site-packages (from pdfminer.six==20231228->pdfplumber) (3.3.2)
Requirement already satisfied: cryptography>=36.0.0 in ./map/lib/python3.12/site-packages (from pdfminer.six==20231228->pdfplumber) (44.0.1)
Requirement already satisfied: cffi>=1.12 in ./map/lib/python3.12/site-packages (from cryptography>=36.0.0->pdfminer.six==20231228->pdfplumber) (1.17.1)
Requirement already satisfied: pycparser in ./map/lib/python3.12/site-packages (from cffi>=1.12->cryptography>=36.0.0->pdfminer.six==20231228->pdfplumber) (2.22)
Using cached pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
Installing collected packages: pdfminer.six
Attempting uninstall: pdfminer.six
Found existing installation: pdfminer.six 20240706
Uninstalling pdfminer.six-20240706:
Successfully uninstalled pdfminer.six-20240706
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
unstructured-inference 0.8.7 requires pdfminer-six==20240706, but you have pdfminer-six 20231228 which is incompatible.
Successfully installed pdfminer.six-20231228
(map) jlpoole@ryzdesk ~ $

jlpoolen avatar Mar 02 '25 16:03 jlpoolen

I previously installed unstructured-ingest project (which has gone for-profit now) with the hope to mine PDFs of court decisions which have caption boxes using ASCII: that makes for problematic ingestion.

Unstructured.io monkey-patches pdfminer.six to get around some issues that they themselves introduced into the codebase... They used to "repair" PDFs that weren't broken, because their change to pdfminer.six caused them to fail to extract: https://github.com/Unstructured-IO/unstructured/issues/3815

Now it seems that they used to include pdfplumber (even though they didn't actually use it for anything) to get a transitive dependency on pdfminer.six and then changed it to pin the exact version at 20240706, thus preventing anybody from installing pdfplumber: https://github.com/Unstructured-IO/unstructured-inference/pull/406

Then they "fixed" this to just make it a lower bound, which of course solves no problems, since that version still has bugs that prevent it from working on a large number of real-world PDFs, and there is no newer version on the horizon: https://github.com/Unstructured-IO/unstructured-inference/pull/410

I see no indication that Unstructured as a company has any idea what they're doing with their open-source code releases.

dhdaines avatar Mar 02 '25 19:03 dhdaines

Just released a new version.

pietermarsman avatar Mar 24 '25 07:03 pietermarsman

Thanks, @pietermarsman! Just pushed pdfplumber==0.11.6, which updates the pinned pdfminer.six version.

jsvine avatar Mar 28 '25 03:03 jsvine