pdfreader icon indicating copy to clipboard operation
pdfreader copied to clipboard

Failure to extract image as Pillow image ("Not enough image data")

Open lisch opened this issue 2 years ago • 2 comments

lisch avatar Jul 21 '22 20:07 lisch

I can't upload the python script as a .py file, so I tacked on a .txt extension. Running the script as follows produces the traceback shown below when running with the indicated file.

$ ./test.py pdfreader-fail-1.pdf 
ERROR:root:Skipping broken stream
Traceback (most recent call last):
  File "/home/rhlisch/.local/lib/python3.7/site-packages/pdfreader/filters/lzw.py", line 29, in decode
    data = decompress(data)
  File "/home/rhlisch/.local/lib/python3.7/site-packages/pdfreader/filters/lzw.py", line 44, in decompress
    return decoder.decodefrombytes(compressed_bytes)
  File "/home/rhlisch/.local/lib/python3.7/site-packages/pdfreader/filters/lzw.py", line 72, in decodefrombytes
    clearbytes = self._decoder.decode(codepoints)
  File "/home/rhlisch/.local/lib/python3.7/site-packages/pdfreader/filters/lzw.py", line 199, in decode
    decoded += self._decode_codepoint(cp)
  File "/home/rhlisch/.local/lib/python3.7/site-packages/pdfreader/filters/lzw.py", line 227, in _decode_codepoint
    raise ValueError("End of information code not supported directly by this Decoder")
ValueError: End of information code not supported directly by this Decoder
Traceback (most recent call last):
  File "./test.py", line 9, in <module>
    image = viewer.canvas.images[name].to_Pillow()
  File "/home/rhlisch/.local/lib/python3.7/site-packages/pdfreader/pillow.py", line 82, in to_Pillow
    img = Image.frombytes(cs, size, bytes(self.filtered))
  File "/home/rhlisch/.local/lib/python3.7/site-packages/PIL/Image.py", line 2843, in frombytes
    im.frombytes(data, decoder_name, args)
  File "/home/rhlisch/.local/lib/python3.7/site-packages/PIL/Image.py", line 798, in frombytes
    raise ValueError("not enough image data")
ValueError: not enough image data

lisch avatar Jul 21 '22 20:07 lisch

@lisch The image is LZW-encoded and LZW decoder fails here https://github.com/maxpmaxp/pdfreader/blob/30818a2083b22624310fa83eb0101aefea60741c/pdfreader/filters/lzw.py#L227

Need to add support for END_OF_INFO_CODE symbol. Feel free to contribute :)

maxpmaxp avatar Aug 12 '22 00:08 maxpmaxp