pypdf Stream has ended unexpectedly error on certain PDF files

Stream has ended unexpectedly error on certain PDF files

Open LunkRat opened this issue 10 years ago • 30 comments

We process dozens of PDF files per day in our automated script that uses PyPDF2 version 1.21 as part of its process. A few files have been failing with the error pasted below. I can provide the PDF file that is having this error, just let me know how you would like me to send it. Thanks!

PdfReadWarning: Invalid stream (index 0) within object 62 0: Stream has ended unexpectedly [pdf.py:1128]
Traceback (most recent call last):
  File "d:\scripts\mtx-coverpage\mtx-coverpage.py", line 99, in <module>
    addpage.write(outfile)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\merger.py", line 209, in write
    self.output.write(fileobj)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 277, in write
    self._sweepIndirectReferences(externalReferenceMap, self._root)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 365, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 341, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 365, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 341, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 350, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, data[i])
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 365, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 341, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 379, in _sweepIndirectReferences
    newobj = self._sweepIndirectReferences(externMap, newobj)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 341, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 370, in _sweepIndirectReferences
    newobj = data.pdf.getObject(data)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 1149, in getObject
    retval = self._getObjectFromStream(indirectReference)
  File "D:\bin\Python27\lib\site-packages\PyPDF2\pdf.py", line 1131, in _getObjectFromStream
    raise utils.PdfReadError("Can't read object stream: %s"%e)
PyPDF2.utils.PdfReadError: Can't read object stream: Stream has ended unexpectedly

May 13 '14 15:05 LunkRat

The PDF file that triggered this error can be found here: https://drive.google.com/file/d/0B_P1mlgsZIJpRjNnQkxCenUzTkU/edit?usp=sharing

May 13 '14 18:05 LunkRat

Thank you for the detailed bug report! I will try to track down the issue soon, though I will be unavailable for the following week.

May 14 '14 22:05 mstamy2

Hello, Try passing strict = False to PdfFileReader(). I still need to track down exactly why the exception is thrown, but this should produce the output without error. (If not, let us know)

May 26 '14 21:05 mstamy2

Thanks for the response. I am using PdfFileMerger() in this script, and the error occurs on .write - how to I pass strict = False in this case, since I am not calling PdfFileReader() directly?

May 27 '14 18:05 LunkRat

PdfFileMerger() constructor takes a strict parameter as well. It should throw a warning instead of an exception when set to False.

May 27 '14 21:05 mstamy2

Thanks! I set strict = False in PdfFileMerger() (thought I had tried that already but must have been my error) and it solved our issue. I still get a warning, but the output file is written as expected so that's great. I'll let you close the issue when you feel that the underlying cause is resolved. Thanks again.

May 27 '14 21:05 LunkRat

Hello,

I have been recently getting a similar error.

can you please post me an example on how/where to implement the fix?

Thank you!

May 30 '14 17:05 andrewstolarz

Hello,

Just to give an update.... what I did was manually edit the pdf.py file to set strict = False (I was hoping not to do it this way as I don't want to run into issues later on when I upgrade.

However, after running the script again with strict set to false, it splits the PDF's no problem, however it still returns an error:

PdfReadWarning: Invalid stream (index 77) within object 1444 0: Stream has ended unexpectedly [pdf.py:1162] PdfReadWarning: Invalid stream (index 62) within object 2696 0: Stream has ended unexpectedly [pdf.py:1162]

Any ideas?

Jun 02 '14 15:06 andrewstolarz

Well, you don't have to change pdf.py in order to set strict to False. You can set the value of strict when you first create your PdfFileMerger() or PdfFileReader() object in its constructor, and it defaults to true if you don't specify a value. To specify False, use input = PdfFileReader([your file], strict = False)

When in strict mode, PyPDF2 quits when encountering this stream error and throws a PdfReadError. When strict is False, it ignores this error but instead gives a warning like you saw (then continues with rest of program as normal).

Ignoring the error doesn't seem to harm the output in any way (as you noticed), so we need to investigate why the error is thrown at all (maybe PyPDF2 is too strict on slightly 'irregular' PDFs?). Or maybe the error is significant but the output PDFs haven't displayed any symptoms?

Hope that made a little sense.

Jun 02 '14 20:06 mstamy2

maybe works for PdfFileReader but not with PdfFileMerger.

I try merger = PdfFileMerger(strict = False)

and also into merger.append(PdfFileReader(open(os.path.join(files_dir, f), "rb"), strict = False)) but it gives the same problem as before

  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 405, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "/usr/lib/python2.7/site-packages/PyPDF2/pdf.py", line 386, in _sweepIndirectReferences
    data[key] = value
  File "/usr/lib/python2.7/site-packages/PyPDF2/generic.py", line 487, in __setitem__
    if not isinstance(key, PdfObject):
RuntimeError: maximum recursion depth exceeded in **instancecheck**

I also set

    def __init__(self, stream, strict=False, warndest = None, overwriteWarnings = True):

at line 891 of pdf.py and doesn't works, any solution?

Jan 29 '15 15:01 cryptid11

Try merging your PDFs by using the 'append' and 'merge' functionality of PyPDF2 instead.

I faced the same issue and following approach worked for me -

from PyPDF2 import PdfFileMerger

merger = PdfFileMerger()

input1 = open("file1.pdf", "rb")
input2 = open("file2.pdf", "rb")


# add the first 3 pages of first file to output
merger.append(fileobj = input1, pages = (0,3))

# insert the first page of second file into the output beginning after the second page
merger.merge(position = 2, fileobj = input2, pages = (0,1))

# Write to an output PDF document
output = open("document-output.pdf", "wb")
merger.write(output)

Remove the 'pages' argument in 'append' and 'merge' functions to merge files instead of specific pages.

Jul 19 '18 14:07 appurwar

I just started to experience this issue when calling PdfFileReader. I haven't changed anything in the code, maybe a windows update? None of the above suggestions to set strict = False seem to help. I have to go in and comment out the file work inside _showwarning and pass on the function to get anywhere.

The only difference in my case different from the above is it only happens after I've run the code once. I'm calling calling this from within ArcGIS (mapping software). I have to close the software and re-open it to get the 1st successful run. This seems to indicate that something is being held onto after the 1st run...but again, it just started happening. I realize this probably doesn't help you move towards a fix: just reporting to up the user count for this.

Edit - "fix": Despite ignoring strict and setting overwriteWarnigngs to false, I'd still get the error. I found I can get around the error by resetting Python's built in warnings to the original stderr.

import warnings
warnings.resetwarnings()
warnings.sys.stderr = sys.__stderr__

Aug 09 '18 13:08 khibma

@appurwar the error is returned no matter if append or merge is used. The problem here seems to be the format of the PDF that is being appended, so it's not PyPDF2's fault. A sensible workaround seems to reformat the PDF in some other way before passing it to PyPDF2 once this is detected.

strict=False doesn't fix this either, I came here after the error happened with strict=False on.

Nov 11 '19 14:11 reportgunner

I'm closing this issue now as it seems to be mostly about using strict=False which is the current default. Let me know if you still have this issue (with a full Traceback + example code ... and a PDF if possible)

Jun 26 '22 09:06 MartinThoma

Thanks @mstamy2 @mstamy2. strict=False while reading the pdf from PdfFileReader() works great and the rewritten or merged file won't get harmed but if some workaround done on pdf file that might affect the pdf structure will cause the same error though the strict=False is done. Not a problem of this package

Jul 06 '22 07:07 puri-gagan

Hello I've this error with PdfFileReader() also i'm using strict=False any help

Traceback (most recent call last): File "/New Volume/projects/Files/PDF/scrap.py", line 30, in extracted_data=extracted_data+(pdfReader.getPage(z).extractText().splitlines()) File "/.local/lib/python3.8/site-packages/PyPDF2/_page.py", line 1045, in extractText return self.extract_text(Tj_sep=Tj_sep, TJ_sep=TJ_sep) File "/.local/lib/python3.8/site-packages/PyPDF2/_page.py", line 968, in extract_text content = ContentStream(content, self.pdf) File "/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1088, in init self.__parseContentStream(stream) File "/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1119, in __parseContentStream operands.append(read_object(stream, None)) File "/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1228, in read_object return readStringFromStream(stream) File "/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 382, in readStringFromStream raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY) PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

Aug 10 '22 12:08 Eslafif

Which version of PyPDF2 do you use?

Aug 10 '22 12:08 MartinThoma

Version 2.0.0

Aug 10 '22 12:08 Eslafif

@Eslafif, can you please upgrade to latest version to confirm the error is still present. If so, can you provide the PDF file, and precise on which page you are getting the issue

Aug 10 '22 13:08 pubpub-zz

Updated and same error exist

Aug 10 '22 14:08 Eslafif

Updated and same error exist

and can you provide the pdf file and the page. without, no analysis can be done

Aug 10 '22 14:08 pubpub-zz

RBL BANK.pdf

This's the page that gives the error

Aug 10 '22 16:08 Eslafif

@Eslafif I've tried the following code with your file successfully. import PyPDF2;p=PyPDF2.PdfReader("c:/RBL.BANK.pdf");p.pages[0].extract_text() Can you confirm that you are getting the same results

Aug 10 '22 18:08 pubpub-zz

Tested and giving the same error

Aug 10 '22 19:08 Eslafif

Can you share the output please

Aug 10 '22 19:08 pubpub-zz

Traceback (most recent call last):
  File "/media/New Volume/projects/bank statement/Banks statements/test.py", line 11, in <module>
    extracted_data=pdfReader.pages[17].extract_text()
  File "/home/.local/lib/python3.8/site-packages/PyPDF2/_page.py", line 968, in extract_text
    content = ContentStream(content, self.pdf)
  File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1088, in __init__
    self.__parseContentStream(stream)
  File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1119, in __parseContentStream
    operands.append(read_object(stream, None))
  File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1228, in read_object
    return readStringFromStream(stream)
  File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 382, in readStringFromStream
    raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY)
PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

Aug 10 '22 19:08 Eslafif

you are not using my code and the file you've provided. Can you tell what is the result with my program please

Aug 10 '22 19:08 pubpub-zz

Traceback (most recent call last): File "/media/New Volume/projects/bank statement/Banks statements/test.py", line 7, in p.pages[17].extract_text() File "/home/.local/lib/python3.8/site-packages/PyPDF2/_page.py", line 968, in extract_text content = ContentStream(content, self.pdf) File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1088, in init self.__parseContentStream(stream) File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1119, in __parseContentStream operands.append(read_object(stream, None)) File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 1228, in read_object return readStringFromStream(stream) File "/home/.local/lib/python3.8/site-packages/PyPDF2/generic.py", line 382, in readStringFromStream raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY) PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

Aug 10 '22 19:08 Eslafif

this's with your code

different that the file is big so i only attached the page with the problem

Aug 10 '22 19:08 Eslafif

@Eslafif, When you've extracted the page, the error in the pdf has been fixed. Can you confirm this assumption testing the code on your "small" file.

Meanwhile, looking at #454 I may have found a fix. as a patch can you modify generic.py line 495:

                if tok.isdigit():
                    # "The number ddd may consist of one, two, or three
                    # octal digits; high-order overflow shall be ignored.
                    # Three octal digits shall be used, with leading zeros
                    # as needed, if the next character of the string is also
                    # a digit." (PDF reference 7.3.4.2, p 16)
                    for _ in range(2):
                        ntok = stream.read(1)
                        if ntok.isdigit():
                            tok += ntok
                        else:
                            **stream.seek(-1,1)**    &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;           _<--- to be added_ 
                            break
                    tok = b_(chr(int(tok, base=8)))

I would like to confirm the fix before releasing the PR

Aug 11 '22 10:08 pubpub-zz

pypdf pypdf copied to clipboard

Stream has ended unexpectedly error on certain PDF files

pypdf
pypdf copied to clipboard