pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Invalid Elementary Object starting with b'C'

Open pitmanm4 opened this issue 1 year ago • 9 comments

Trying to save a bookmarked pdf, but is failing to save with the error "Invalid Elementary Object starting with b'C'". Code is taking in a pdf and a list of bookmark labels, and attempting to create a new pdf with those bookmarks included in it. We see the error after the bookmarks have been added and the pdf is attempting to save the file.

Environment

Python 3.8.13 PyPDF2 2.10.9

Code + PDF

This is a minimal, complete example that shows the issue:

 for page, bookmark in bookmarks_list:
        try:
            doc_new.add_bookmark(bookmark, page)

        except Exception as e:
            logger.error(f'Failed to add bookmark {bookmark} for page {page + 1}.')

            logger.exception(e)

            raise e
    
    logger.info(f'Saving {save_path}.')
    
    try:
        with open(save_path, 'wb') as f:
            doc_new.write(f)
    
    except PdfReadError as e:
        if not strict:
            logger.error(f'Failed to save {save_path}.')

            logger.exception(e)

            raise e

Traceback

doc_new.write(f) File "/usr/local/lib/python3.8/site-packages/PyPDF2/_writer.py", line 838, in write self.write_stream(stream) File "/usr/local/lib/python3.8/site-packages/PyPDF2/_writer.py", line 811, in write_stream self._sweep_indirect_references(self._root) File "/usr/local/lib/python3.8/site-packages/PyPDF2/_writer.py", line 960, in _sweep_indirect_references data = self._resolve_indirect_object(data) File "/usr/local/lib/python3.8/site-packages/PyPDF2/_writer.py", line 1005, in _resolve_indirect_object real_obj = data.pdf.get_object(data) File "/usr/local/lib/python3.8/site-packages/PyPDF2/_reader.py", line 1179, in get_object retval = read_object(self.stream, self) # type: ignore File "/usr/local/lib/python3.8/site-packages/PyPDF2/generic/_data_structures.py", line 835, in read_object return ArrayObject.read_from_stream(stream, pdf, forced_encoding) File "/usr/local/lib/python3.8/site-packages/PyPDF2/generic/_data_structures.py", line 119, in read_from_stream arr.append(read_object(stream, pdf, forced_encoding)) File "/usr/local/lib/python3.8/site-packages/PyPDF2/generic/_data_structures.py", line 862, in read_object raise PdfReadError( PyPDF2.errors.PdfReadError: Invalid Elementary Object starting with b'C'

TODO

pitmanm4 avatar Sep 21 '22 19:09 pitmanm4

the error is reported in the reader part : can you provide the pdf you are "bookmarking" ?

pubpub-zz avatar Sep 21 '22 19:09 pubpub-zz

the error is reported in the reader part : can you provide the pdf you are "bookmarking" ?

@pubpub-zz Hello I work with pitmanm4. We can't provide the pdf due to confidentiality.

james811223ad avatar Sep 21 '22 19:09 james811223ad

In order to get some data, can you modify the _data_structures.py, at about line 862:

    else:
       # temporary for debug
        stream.read(-20)                          
        xtract = stream.read(80)
        raise PdfReadError(
            f"Invalid Elementary Object starting with {tok} @{stream.tell()} : {xtract.__repr__()}"  # type: ignore
        )

and provide the output. This should respect the confidentiality

pubpub-zz avatar Sep 21 '22 19:09 pubpub-zz

@pubpub-zz reproduced the error with the suggested code change. Here is the entire error message, hopefully provides enough insight.

PdfReadError                              Traceback (most recent call last)
Input In [10], in add_bookmarks_to_pdf(pdf_path, bookmarks, rotations, save_path, strict)
     71     with open(save_path, 'wb') as f:
---> 72         doc_new.write(f)
     74 except PdfReadError as e:

File ~/SageMaker/PyPDF2/PyPDF2/_writer.py:841, in PdfWriter.write(self, stream)
    839     my_file = True
--> 841 self.write_stream(stream)
    843 if self.with_as_usage:

File ~/SageMaker/PyPDF2/PyPDF2/_writer.py:814, in PdfWriter.write_stream(self, stream)
    806 # PDF objects sometimes have circular references to their /Page objects
    807 # inside their object tree (for example, annotations).  Those will be
    808 # indirect references to objects that we've recreated in this PDF.  To
   (...)
    812 # trees to reference the correct new object location, rather than
    813 # copying in a new copy of the page object.
--> 814 self._sweep_indirect_references(self._root)
    816 object_positions = self._write_header(stream)

File ~/SageMaker/PyPDF2/PyPDF2/_writer.py:963, in PdfWriter._sweep_indirect_references(self, root)
    962 elif isinstance(data, IndirectObject):
--> 963     data = self._resolve_indirect_object(data)
    965     if str(data) not in discovered:

File ~/SageMaker/PyPDF2/PyPDF2/_writer.py:1008, in PdfWriter._resolve_indirect_object(self, data)
   1007 # Get real object indirect object
-> 1008 real_obj = data.pdf.get_object(data)
   1010 if real_obj is None:

File ~/SageMaker/PyPDF2/PyPDF2/_reader.py:1179, in PdfReader.get_object(self, indirect_reference)
   1178     assert generation == indirect_reference.generation
-> 1179 retval = read_object(self.stream, self)  # type: ignore
   1181 # override encryption is used for the /Encrypt dictionary

File ~/SageMaker/PyPDF2/PyPDF2/generic/_data_structures.py:835, in read_object(stream, pdf, forced_encoding)
    834 elif tok == b"[":
--> 835     return ArrayObject.read_from_stream(stream, pdf, forced_encoding)
    836 elif tok == b"t" or tok == b"f":

File ~/SageMaker/PyPDF2/PyPDF2/generic/_data_structures.py:119, in ArrayObject.read_from_stream(stream, pdf, forced_encoding)
    118     # read and append obj
--> 119     arr.append(read_object(stream, pdf, forced_encoding))
    120 return arr

File ~/SageMaker/PyPDF2/PyPDF2/generic/_data_structures.py:865, in read_object(stream, pdf, forced_encoding)
    864 xtract = stream.read(80)
--> 865 raise PdfReadError(
    866     f"Invalid Elementary Object starting with {tok} @{stream.tell()} : {xtract.__repr__()}"  # type: ignore
    867 )

PdfReadError: Invalid Elementary Object starting with b'C' @215640173 : b''

During handling of the above exception, another exception occurred:

PdfReadError                              Traceback (most recent call last)
Input In [11], in <cell line: 1>()
----> 1 add_bookmarks_to_pdf('0000384654.pdf', [], [], 'output.pdf', strict=True)

Input In [10], in add_bookmarks_to_pdf(pdf_path, bookmarks, rotations, save_path, strict)
     80         raise e
     82     logger.warning(f'Trying {fun_mes} with strict=False.')
---> 84     add_bookmarks_to_pdf(pdf_path, bookmarks, rotations, save_path, strict=False)
     86 except Exception as e:
     87     logger.error(f'Failed to save {save_path}.')

Input In [10], in add_bookmarks_to_pdf(pdf_path, bookmarks, rotations, save_path, strict)
     76     logger.error(f'Failed to save {save_path}.')
     78     logger.exception(e)
---> 80     raise e
     82 logger.warning(f'Trying {fun_mes} with strict=False.')
     84 add_bookmarks_to_pdf(pdf_path, bookmarks, rotations, save_path, strict=False)

Input In [10], in add_bookmarks_to_pdf(pdf_path, bookmarks, rotations, save_path, strict)
     70 try:
     71     with open(save_path, 'wb') as f:
---> 72         doc_new.write(f)
     74 except PdfReadError as e:
     75     if not strict:

File ~/SageMaker/PyPDF2/PyPDF2/_writer.py:841, in PdfWriter.write(self, stream)
    838     stream = FileIO(stream, "wb")
    839     my_file = True
--> 841 self.write_stream(stream)
    843 if self.with_as_usage:
    844     stream.close()

File ~/SageMaker/PyPDF2/PyPDF2/_writer.py:814, in PdfWriter.write_stream(self, stream)
    804     self._root = self._add_object(self._root_object)
    806 # PDF objects sometimes have circular references to their /Page objects
    807 # inside their object tree (for example, annotations).  Those will be
    808 # indirect references to objects that we've recreated in this PDF.  To
   (...)
    812 # trees to reference the correct new object location, rather than
    813 # copying in a new copy of the page object.
--> 814 self._sweep_indirect_references(self._root)
    816 object_positions = self._write_header(stream)
    817 xref_location = self._write_xref_table(stream, object_positions)

File ~/SageMaker/PyPDF2/PyPDF2/_writer.py:963, in PdfWriter._sweep_indirect_references(self, root)
    954         stack.append(
    955             (
    956                 value,
   (...)
    960             )
    961         )
    962 elif isinstance(data, IndirectObject):
--> 963     data = self._resolve_indirect_object(data)
    965     if str(data) not in discovered:
    966         discovered.append(str(data))

File ~/SageMaker/PyPDF2/PyPDF2/_writer.py:1008, in PdfWriter._resolve_indirect_object(self, data)
   1005     raise ValueError(f"I/O operation on closed file: {data.pdf.stream.name}")
   1007 # Get real object indirect object
-> 1008 real_obj = data.pdf.get_object(data)
   1010 if real_obj is None:
   1011     logger_warning(
   1012         f"Unable to resolve [{data.__class__.__name__}: {data}], "
   1013         "returning NullObject instead",
   1014         __name__,
   1015     )

File ~/SageMaker/PyPDF2/PyPDF2/_reader.py:1179, in PdfReader.get_object(self, indirect_reference)
   1177 if self.strict:
   1178     assert generation == indirect_reference.generation
-> 1179 retval = read_object(self.stream, self)  # type: ignore
   1181 # override encryption is used for the /Encrypt dictionary
   1182 if not self._override_encryption and self._encryption is not None:
   1183     # if we don't have the encryption key:

File ~/SageMaker/PyPDF2/PyPDF2/generic/_data_structures.py:835, in read_object(stream, pdf, forced_encoding)
    833         return read_hex_string_from_stream(stream, forced_encoding)
    834 elif tok == b"[":
--> 835     return ArrayObject.read_from_stream(stream, pdf, forced_encoding)
    836 elif tok == b"t" or tok == b"f":
    837     return BooleanObject.read_from_stream(stream)

File ~/SageMaker/PyPDF2/PyPDF2/generic/_data_structures.py:119, in ArrayObject.read_from_stream(stream, pdf, forced_encoding)
    117     stream.seek(-1, 1)
    118     # read and append obj
--> 119     arr.append(read_object(stream, pdf, forced_encoding))
    120 return arr

File ~/SageMaker/PyPDF2/PyPDF2/generic/_data_structures.py:865, in read_object(stream, pdf, forced_encoding)
    863 stream.read(-20)                          
    864 xtract = stream.read(80)
--> 865 raise PdfReadError(
    866     f"Invalid Elementary Object starting with {tok} @{stream.tell()} : {xtract.__repr__()}"  # type: ignore
    867 )

PdfReadError: Invalid Elementary Object starting with b'C' @215640173 : b''

pitmanm4 avatar Sep 21 '22 20:09 pitmanm4

Oups!!! My fault 🙄😳 (hard to provide a code without being able to test it) I made a mistake in the code:

    else:
       # temporary for debug
        stream.seek(-20,1)                          
        xtract = stream.read(80)
        raise PdfReadError(
            f"Invalid Elementary Object starting with {tok} @{stream.tell()} : {xtract.__repr__()}"  # type: ignore
        )

I'm just looking the report of the exception

pubpub-zz avatar Sep 21 '22 20:09 pubpub-zz

@pubpub-zz Newest error message

PdfReadError: Invalid Elementary Object starting with b'C' @20493876 : b'ation /PANTONE 7455 C /DeviceRGB <<\n/FunctionType 2\n/Range [ 0 1 0 1 0 1 ]\n/C0 ['

pitmanm4 avatar Sep 21 '22 21:09 pitmanm4

You can see in the output that their is a /PANTONE which is interpreted as a Name (starting with / ) which therefore stopped the interpretation value, 7455 is interpreted as the value associated with key /PANTONE and after the C is expected as a new key but is not starting with the expected key or any other character than can be interpreted.

With such case, I do not see how to cope with such a file If you can use and hexadecimal editor such as bless , you should be able to locate the /PANTONE and replace the / with _

pubpub-zz avatar Sep 21 '22 21:09 pubpub-zz

@james811223ad, should I consider my proposal as a fix ?

pubpub-zz avatar Sep 22 '22 11:09 pubpub-zz

@james811223ad, should I consider my proposal as a fix ?

I would say no. Reason is that we're processing lots of files. We can't possibly do this every time this error comes up. We do appreciate you look into it. @pubpub-zz

james811223ad avatar Sep 23 '22 02:09 james811223ad

I propose to close this issue. feel free to ask to reopen it if you have more data

pubpub-zz avatar Sep 27 '22 20:09 pubpub-zz