pdfplumber
pdfplumber copied to clipboard
Update version of `pdfminer-six` to `20240706`
Please update the version of pdfminer-six to 20240706.
There seems to be a bug in the latest release — https://github.com/pdfminer/pdfminer.six/issues/1004 — which also happens to be throwing errors in pdfplumber's test suite. I'll keep an eye out for pdfminer.six's next release, which hopefully fixes the bug.
There seems to be a bug in the latest release — pdfminer/pdfminer.six#1004 — which also happens to be throwing errors in
pdfplumber's test suite. I'll keep an eye out forpdfminer.six's next release, which hopefully fixes the bug.
I fixed the bug :) https://github.com/pdfminer/pdfminer.six/pull/1027 hopefully it gets released soon!
@dhdaines Wonderful, thanks!
@jsvine would you consider upgrade this dependency before the next release of pdfminer.six ?
- pdfminer has a release cycle of about 5-6 months, so it can means another 5 months until next release, which is a bit too long imo
- the current version throw similar errors too, which is what I encountered (please see below)
The project I'm working on uses pdfplumber in production, and when parsing the following PDF
https://www.ge.com/sites/default/files/ge2021_sustainability_report.pdf, it raises TypeError: 'PDFObjRef' object is not iterable
I tested locally that pdfminer.six 20240706 could solve the issue. (I forced pdfplumber 0.10.2 and pdfminer.six 20240706 to coexist in order to verify it. However I couldn't do that in the project code because poetry is used there)
Hi @chenxi-briink, can you try upgrading pdfplumber to the latest version, 0.11.3? Using that version, I'm able to parse the PDF you've cited with no problems/errors.
Hi @jsvine,
Sorry that I mis-typed the version number in my previous message
I forced pdfplumber 0.10.2 and pdfminer.six 20240706
should be: I forced pdfplumber 0.11.3 and pdfminer.six 20240706 to coexist.
yes that combination works for me.
however, the issue is, the requirements.txt of pdfplumber depends on pdfminer.six 20231228, it is the latter throws this exception.
File ~/foo/bar/.venv/lib/python3.11/site-packages/pdfminer/pdftypes.py:373, in PDFStream.decode(self)
371 raise PDFNotImplementedError("Unsupported filter: %r" % f)
372 # apply predictors
--> 373 if params and "Predictor" in params:
374 pred = int_value(params["Predictor"])
375 if pred == 1:
376 # no predictor
TypeError: argument of type 'PDFObjRef' is not iterable
For in my production environment, in which poetry is used, I couldn't override the stated pdfminer.six version 20231228.
Hi @chenxi-briink and thanks for the clarification. That's strange; I'm running the exact same combination and seeing no error. First, I set up this fresh environment:
python -m venv venv
source venv/bin/activate
pip install pdfplumber==0.11.3
pip freeze | grep pdf
... which outputs:
pdfminer.six==20231228
pdfplumber==0.11.3
pypdfium2==4.30.0
Then I ran this:
import pdfplumber
pdf = pdfplumber.open("./ge2021_sustainability_report.pdf")
for page in pdf.pages:
assert len(pdf.objects)
... which completed without error.
Hi @jsvine,
Gee, by trying to replicate what you posted, I realised that the file I got turned out to be a modified version of the public available one I shared with you. For this modified file, the exception will occur when doing the same as you shared. (Sorry that I didn't double check cause I didn't expect there would be a modified version)
I uploaded this file to a public accessible GDrive folder , basically it's a shortened version of the original GE 2021 Sustainability Report. A PDF viewer could render it w/o problem.
Thanks for providing the updated PDF, @chenxi-briink. Using that one, I can indeed replicate the error.
In this case, however, I don't plan on upgrading the dependency until at least the next pdfminer.six release — although doing so might fix your situation, it will likely break others (as confirmed pdfplumber's test suite). @dhdaines's fix in https://github.com/pdfminer/pdfminer.six/pull/1027 handles your PDF well; perhaps you can use his fork in the meantime?
As context: pdfminer.six is a pinned dependency in pdfplumber because changes to that library can have breaking changes for this one. I realize it can cause issues when someone wants to use a different specific version of pdfminer.six, but that tradeoff is preferable to all new installations of pdfplumber breaking.
Hi @jsvine , I totally understand the rational for not upgrading. Thanks for explaining and pointing me to @dhdaines 's fork, I might find sometime to give it a try.
Any news on this?
Any update on this request? I have a project in which the packages I use rely on two different versions of pdfminer.six and unfortunately pdfplumber relies on the older version while the other package relies on newer version.
Any idea on when will pdfplumber be updated to support the 20240706 version of pdfminer-six
You can use PAVÉS now, it is mostly a drop-in replacement for pdfminer, except that it fixes a bunch of problems and is also somewhat faster.
Can you try this PR? https://github.com/jsvine/pdfplumber/pull/1272
Thanks for the suggestion @dhdaines ; let me see if I can give it a try
For what it is worth:
I previously installed unstructured-ingest project (which has gone for-profit now) with the hope to mine PDFs of court decisions which have caption boxes using ASCII: that makes for problematic ingestion.
I installed pdfplumber, but got the warning suggesting my unstructured-ingest would be broken. I upgraded pdfminer-six, and got a warning that I'm now breaking pdfplumber. Re-installing pdfplumber re-installed the older version of pdfminer-six, so the work-around, I suppose, if you're going to use either tool, to perform a pip install ahead of time to set the pdfminer-six version.
(map) jlpoole@ryzdesk ~ $ pip install pdfplumber
Collecting pdfplumber
Downloading pdfplumber-0.11.5-py3-none-any.whl.metadata (42 kB)
Collecting pdfminer.six==20231228 (from pdfplumber)
Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Requirement already satisfied: Pillow>=9.1 in ./map/lib/python3.12/site-packages (from pdfplumber) (10.4.0)
Requirement already satisfied: pypdfium2>=4.18.0 in ./map/lib/python3.12/site-packages (from pdfplumber) (4.30.1)
Requirement already satisfied: charset-normalizer>=2.0.0 in ./map/lib/python3.12/site-packages (from pdfminer.six==20231228->pdfplumber) (3.3.2)
Requirement already satisfied: cryptography>=36.0.0 in ./map/lib/python3.12/site-packages (from pdfminer.six==20231228->pdfplumber) (44.0.1)
Requirement already satisfied: cffi>=1.12 in ./map/lib/python3.12/site-packages (from cryptography>=36.0.0->pdfminer.six==20231228->pdfplumber) (1.17.1)
Requirement already satisfied: pycparser in ./map/lib/python3.12/site-packages (from cffi>=1.12->cryptography>=36.0.0->pdfminer.six==20231228->pdfplumber) (2.22)
Downloading pdfplumber-0.11.5-py3-none-any.whl (59 kB)
Downloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.6/5.6 MB 12.7 MB/s eta 0:00:00
Installing collected packages: pdfminer.six, pdfplumber
Attempting uninstall: pdfminer.six
Found existing installation: pdfminer.six 20240706
Uninstalling pdfminer.six-20240706:
Successfully uninstalled pdfminer.six-20240706
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
unstructured-inference 0.8.7 requires pdfminer-six==20240706, but you have pdfminer-six 20231228 which is incompatible.
Successfully installed pdfminer.six-20231228 pdfplumber-0.11.5
(map) jlpoole@ryzdesk ~ $ pip install pdfminer-six -U
Requirement already satisfied: pdfminer-six in ./map/lib/python3.12/site-packages (20231228)
Collecting pdfminer-six
Using cached pdfminer.six-20240706-py3-none-any.whl.metadata (4.1 kB)
Requirement already satisfied: charset-normalizer>=2.0.0 in ./map/lib/python3.12/site-packages (from pdfminer-six) (3.3.2)
Requirement already satisfied: cryptography>=36.0.0 in ./map/lib/python3.12/site-packages (from pdfminer-six) (44.0.1)
Requirement already satisfied: cffi>=1.12 in ./map/lib/python3.12/site-packages (from cryptography>=36.0.0->pdfminer-six) (1.17.1)
Requirement already satisfied: pycparser in ./map/lib/python3.12/site-packages (from cffi>=1.12->cryptography>=36.0.0->pdfminer-six) (2.22)
Using cached pdfminer.six-20240706-py3-none-any.whl (5.6 MB)
Installing collected packages: pdfminer-six
Attempting uninstall: pdfminer-six
Found existing installation: pdfminer.six 20231228
Uninstalling pdfminer.six-20231228:
Successfully uninstalled pdfminer.six-20231228
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pdfplumber 0.11.5 requires pdfminer.six==20231228, but you have pdfminer-six 20240706 which is incompatible.
Successfully installed pdfminer-six-20240706
(map) jlpoole@ryzdesk ~ $ pip install pdfplumber
Requirement already satisfied: pdfplumber in ./map/lib/python3.12/site-packages (0.11.5)
Collecting pdfminer.six==20231228 (from pdfplumber)
Using cached pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Requirement already satisfied: Pillow>=9.1 in ./map/lib/python3.12/site-packages (from pdfplumber) (10.4.0)
Requirement already satisfied: pypdfium2>=4.18.0 in ./map/lib/python3.12/site-packages (from pdfplumber) (4.30.1)
Requirement already satisfied: charset-normalizer>=2.0.0 in ./map/lib/python3.12/site-packages (from pdfminer.six==20231228->pdfplumber) (3.3.2)
Requirement already satisfied: cryptography>=36.0.0 in ./map/lib/python3.12/site-packages (from pdfminer.six==20231228->pdfplumber) (44.0.1)
Requirement already satisfied: cffi>=1.12 in ./map/lib/python3.12/site-packages (from cryptography>=36.0.0->pdfminer.six==20231228->pdfplumber) (1.17.1)
Requirement already satisfied: pycparser in ./map/lib/python3.12/site-packages (from cffi>=1.12->cryptography>=36.0.0->pdfminer.six==20231228->pdfplumber) (2.22)
Using cached pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
Installing collected packages: pdfminer.six
Attempting uninstall: pdfminer.six
Found existing installation: pdfminer.six 20240706
Uninstalling pdfminer.six-20240706:
Successfully uninstalled pdfminer.six-20240706
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
unstructured-inference 0.8.7 requires pdfminer-six==20240706, but you have pdfminer-six 20231228 which is incompatible.
Successfully installed pdfminer.six-20231228
(map) jlpoole@ryzdesk ~ $
I previously installed unstructured-ingest project (which has gone for-profit now) with the hope to mine PDFs of court decisions which have caption boxes using ASCII: that makes for problematic ingestion.
Unstructured.io monkey-patches pdfminer.six to get around some issues that they themselves introduced into the codebase... They used to "repair" PDFs that weren't broken, because their change to pdfminer.six caused them to fail to extract: https://github.com/Unstructured-IO/unstructured/issues/3815
Now it seems that they used to include pdfplumber (even though they didn't actually use it for anything) to get a transitive dependency on pdfminer.six and then changed it to pin the exact version at 20240706, thus preventing anybody from installing pdfplumber: https://github.com/Unstructured-IO/unstructured-inference/pull/406
Then they "fixed" this to just make it a lower bound, which of course solves no problems, since that version still has bugs that prevent it from working on a large number of real-world PDFs, and there is no newer version on the horizon: https://github.com/Unstructured-IO/unstructured-inference/pull/410
I see no indication that Unstructured as a company has any idea what they're doing with their open-source code releases.
Just released a new version.
Thanks, @pietermarsman! Just pushed pdfplumber==0.11.6, which updates the pinned pdfminer.six version.