pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

NotImplementedError: unsupported filter /JBIG2Decode

Open MartinThoma opened this issue 11 months ago • 18 comments

Explanation

I found an example for the /JBIG2Decode filter :-)

Code Example

PDF: https://github.com/py-pdf/pypdf/files/12090692/New.Jersey.Coinbase.staking.securities.charges.2023-0606_Coinbase-Penalty-and-C-D.pdf

from pypdf import PdfReader, __version__

print(f"pypdf=={__version__}")

reader = PdfReader("New.Jersey.Coinbase.staking.securities.charges.2023-0606_Coinbase-Penalty-and-C-D.pdf")

page = reader.pages[0]
for img in page.images:
    print(img.name)

gives

pypdf==3.12.2
Traceback (most recent call last):
  File "/home/moose/Downloads/pyissue/main.py", line 8, in <module>
    for img in page.images:
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2604, in __iter__
    yield self[i]
          ~~~~^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2600, in __getitem__
    return self.get_function(lst[index])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 522, in _get_image
    imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/filters.py", line 844, in _xobj_to_image
    data = x_object_obj.get_data()  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/generic/_data_structures.py", line 919, in get_data
    decoded._data = decode_stream_data(self)
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/filters.py", line 634, in decode_stream_data
    raise NotImplementedError(f"unsupported filter {filter_

MartinThoma avatar Jul 20 '23 16:07 MartinThoma

PDF found in https://github.com/py-pdf/pypdf/issues/1983

MartinThoma avatar Jul 20 '23 16:07 MartinThoma

from #951

Here is pdfminer implementation: https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/jbig2.py

ItDoesntWorkScan.pdf

pubpub-zz avatar Jul 20 '23 18:07 pubpub-zz

https://github.com/py-pdf/pypdf/issues/2502#issuecomment-1980190505 - here we have another example

MartinThoma avatar Mar 06 '24 21:03 MartinThoma

This might need a general design decision if I am not mistaken: Pillow does not seem to support JBIG2, while our implementation currently assumes that all images can be loaded as PIL.Image.Image (pdfminer.six does not use Pillow for saving images).

AFAIK there only is jbig2dec which would have to be used in a subprocess to get a "good" image format from the JBIG2 image embedded inside the PDF file (after adding the missing bytes, specifically "the JBIG2 file header, end-of-page segments, and end-of-file segment" which are not part of the XObject according to section 7.4.7 of the PDF 2.0 spec), although this might cause issues with masks etc. (jbig2dec itself is subject to APGL-3.0-or-later and with its strong copyleft effect (including SaaS) rather unlikely to become part of Pillow.) The alternative would be to parse the essential aspects like the pixel data from the JBIG2 image ourselves.

stefan6419846 avatar Apr 14 '24 19:04 stefan6419846