pypdf
pypdf copied to clipboard
Invalid Elementary Object starting with b'C'
Trying to save a bookmarked pdf, but is failing to save with the error "Invalid Elementary Object starting with b'C'". Code is taking in a pdf and a list of bookmark labels, and attempting to create a new pdf with those bookmarks included in it. We see the error after the bookmarks have been added and the pdf is attempting to save the file.
Environment
Python 3.8.13 PyPDF2 2.10.9
Code + PDF
This is a minimal, complete example that shows the issue:
for page, bookmark in bookmarks_list:
try:
doc_new.add_bookmark(bookmark, page)
except Exception as e:
logger.error(f'Failed to add bookmark {bookmark} for page {page + 1}.')
logger.exception(e)
raise e
logger.info(f'Saving {save_path}.')
try:
with open(save_path, 'wb') as f:
doc_new.write(f)
except PdfReadError as e:
if not strict:
logger.error(f'Failed to save {save_path}.')
logger.exception(e)
raise e
Traceback
doc_new.write(f) File "/usr/local/lib/python3.8/site-packages/PyPDF2/_writer.py", line 838, in write self.write_stream(stream) File "/usr/local/lib/python3.8/site-packages/PyPDF2/_writer.py", line 811, in write_stream self._sweep_indirect_references(self._root) File "/usr/local/lib/python3.8/site-packages/PyPDF2/_writer.py", line 960, in _sweep_indirect_references data = self._resolve_indirect_object(data) File "/usr/local/lib/python3.8/site-packages/PyPDF2/_writer.py", line 1005, in _resolve_indirect_object real_obj = data.pdf.get_object(data) File "/usr/local/lib/python3.8/site-packages/PyPDF2/_reader.py", line 1179, in get_object retval = read_object(self.stream, self) # type: ignore File "/usr/local/lib/python3.8/site-packages/PyPDF2/generic/_data_structures.py", line 835, in read_object return ArrayObject.read_from_stream(stream, pdf, forced_encoding) File "/usr/local/lib/python3.8/site-packages/PyPDF2/generic/_data_structures.py", line 119, in read_from_stream arr.append(read_object(stream, pdf, forced_encoding)) File "/usr/local/lib/python3.8/site-packages/PyPDF2/generic/_data_structures.py", line 862, in read_object raise PdfReadError( PyPDF2.errors.PdfReadError: Invalid Elementary Object starting with b'C'
TODO
the error is reported in the reader part : can you provide the pdf you are "bookmarking" ?
the error is reported in the reader part : can you provide the pdf you are "bookmarking" ?
@pubpub-zz Hello I work with pitmanm4. We can't provide the pdf due to confidentiality.
In order to get some data, can you modify the _data_structures.py, at about line 862:
else:
# temporary for debug
stream.read(-20)
xtract = stream.read(80)
raise PdfReadError(
f"Invalid Elementary Object starting with {tok} @{stream.tell()} : {xtract.__repr__()}" # type: ignore
)
and provide the output. This should respect the confidentiality
@pubpub-zz reproduced the error with the suggested code change. Here is the entire error message, hopefully provides enough insight.
PdfReadError Traceback (most recent call last)
Input In [10], in add_bookmarks_to_pdf(pdf_path, bookmarks, rotations, save_path, strict)
71 with open(save_path, 'wb') as f:
---> 72 doc_new.write(f)
74 except PdfReadError as e:
File ~/SageMaker/PyPDF2/PyPDF2/_writer.py:841, in PdfWriter.write(self, stream)
839 my_file = True
--> 841 self.write_stream(stream)
843 if self.with_as_usage:
File ~/SageMaker/PyPDF2/PyPDF2/_writer.py:814, in PdfWriter.write_stream(self, stream)
806 # PDF objects sometimes have circular references to their /Page objects
807 # inside their object tree (for example, annotations). Those will be
808 # indirect references to objects that we've recreated in this PDF. To
(...)
812 # trees to reference the correct new object location, rather than
813 # copying in a new copy of the page object.
--> 814 self._sweep_indirect_references(self._root)
816 object_positions = self._write_header(stream)
File ~/SageMaker/PyPDF2/PyPDF2/_writer.py:963, in PdfWriter._sweep_indirect_references(self, root)
962 elif isinstance(data, IndirectObject):
--> 963 data = self._resolve_indirect_object(data)
965 if str(data) not in discovered:
File ~/SageMaker/PyPDF2/PyPDF2/_writer.py:1008, in PdfWriter._resolve_indirect_object(self, data)
1007 # Get real object indirect object
-> 1008 real_obj = data.pdf.get_object(data)
1010 if real_obj is None:
File ~/SageMaker/PyPDF2/PyPDF2/_reader.py:1179, in PdfReader.get_object(self, indirect_reference)
1178 assert generation == indirect_reference.generation
-> 1179 retval = read_object(self.stream, self) # type: ignore
1181 # override encryption is used for the /Encrypt dictionary
File ~/SageMaker/PyPDF2/PyPDF2/generic/_data_structures.py:835, in read_object(stream, pdf, forced_encoding)
834 elif tok == b"[":
--> 835 return ArrayObject.read_from_stream(stream, pdf, forced_encoding)
836 elif tok == b"t" or tok == b"f":
File ~/SageMaker/PyPDF2/PyPDF2/generic/_data_structures.py:119, in ArrayObject.read_from_stream(stream, pdf, forced_encoding)
118 # read and append obj
--> 119 arr.append(read_object(stream, pdf, forced_encoding))
120 return arr
File ~/SageMaker/PyPDF2/PyPDF2/generic/_data_structures.py:865, in read_object(stream, pdf, forced_encoding)
864 xtract = stream.read(80)
--> 865 raise PdfReadError(
866 f"Invalid Elementary Object starting with {tok} @{stream.tell()} : {xtract.__repr__()}" # type: ignore
867 )
PdfReadError: Invalid Elementary Object starting with b'C' @215640173 : b''
During handling of the above exception, another exception occurred:
PdfReadError Traceback (most recent call last)
Input In [11], in <cell line: 1>()
----> 1 add_bookmarks_to_pdf('0000384654.pdf', [], [], 'output.pdf', strict=True)
Input In [10], in add_bookmarks_to_pdf(pdf_path, bookmarks, rotations, save_path, strict)
80 raise e
82 logger.warning(f'Trying {fun_mes} with strict=False.')
---> 84 add_bookmarks_to_pdf(pdf_path, bookmarks, rotations, save_path, strict=False)
86 except Exception as e:
87 logger.error(f'Failed to save {save_path}.')
Input In [10], in add_bookmarks_to_pdf(pdf_path, bookmarks, rotations, save_path, strict)
76 logger.error(f'Failed to save {save_path}.')
78 logger.exception(e)
---> 80 raise e
82 logger.warning(f'Trying {fun_mes} with strict=False.')
84 add_bookmarks_to_pdf(pdf_path, bookmarks, rotations, save_path, strict=False)
Input In [10], in add_bookmarks_to_pdf(pdf_path, bookmarks, rotations, save_path, strict)
70 try:
71 with open(save_path, 'wb') as f:
---> 72 doc_new.write(f)
74 except PdfReadError as e:
75 if not strict:
File ~/SageMaker/PyPDF2/PyPDF2/_writer.py:841, in PdfWriter.write(self, stream)
838 stream = FileIO(stream, "wb")
839 my_file = True
--> 841 self.write_stream(stream)
843 if self.with_as_usage:
844 stream.close()
File ~/SageMaker/PyPDF2/PyPDF2/_writer.py:814, in PdfWriter.write_stream(self, stream)
804 self._root = self._add_object(self._root_object)
806 # PDF objects sometimes have circular references to their /Page objects
807 # inside their object tree (for example, annotations). Those will be
808 # indirect references to objects that we've recreated in this PDF. To
(...)
812 # trees to reference the correct new object location, rather than
813 # copying in a new copy of the page object.
--> 814 self._sweep_indirect_references(self._root)
816 object_positions = self._write_header(stream)
817 xref_location = self._write_xref_table(stream, object_positions)
File ~/SageMaker/PyPDF2/PyPDF2/_writer.py:963, in PdfWriter._sweep_indirect_references(self, root)
954 stack.append(
955 (
956 value,
(...)
960 )
961 )
962 elif isinstance(data, IndirectObject):
--> 963 data = self._resolve_indirect_object(data)
965 if str(data) not in discovered:
966 discovered.append(str(data))
File ~/SageMaker/PyPDF2/PyPDF2/_writer.py:1008, in PdfWriter._resolve_indirect_object(self, data)
1005 raise ValueError(f"I/O operation on closed file: {data.pdf.stream.name}")
1007 # Get real object indirect object
-> 1008 real_obj = data.pdf.get_object(data)
1010 if real_obj is None:
1011 logger_warning(
1012 f"Unable to resolve [{data.__class__.__name__}: {data}], "
1013 "returning NullObject instead",
1014 __name__,
1015 )
File ~/SageMaker/PyPDF2/PyPDF2/_reader.py:1179, in PdfReader.get_object(self, indirect_reference)
1177 if self.strict:
1178 assert generation == indirect_reference.generation
-> 1179 retval = read_object(self.stream, self) # type: ignore
1181 # override encryption is used for the /Encrypt dictionary
1182 if not self._override_encryption and self._encryption is not None:
1183 # if we don't have the encryption key:
File ~/SageMaker/PyPDF2/PyPDF2/generic/_data_structures.py:835, in read_object(stream, pdf, forced_encoding)
833 return read_hex_string_from_stream(stream, forced_encoding)
834 elif tok == b"[":
--> 835 return ArrayObject.read_from_stream(stream, pdf, forced_encoding)
836 elif tok == b"t" or tok == b"f":
837 return BooleanObject.read_from_stream(stream)
File ~/SageMaker/PyPDF2/PyPDF2/generic/_data_structures.py:119, in ArrayObject.read_from_stream(stream, pdf, forced_encoding)
117 stream.seek(-1, 1)
118 # read and append obj
--> 119 arr.append(read_object(stream, pdf, forced_encoding))
120 return arr
File ~/SageMaker/PyPDF2/PyPDF2/generic/_data_structures.py:865, in read_object(stream, pdf, forced_encoding)
863 stream.read(-20)
864 xtract = stream.read(80)
--> 865 raise PdfReadError(
866 f"Invalid Elementary Object starting with {tok} @{stream.tell()} : {xtract.__repr__()}" # type: ignore
867 )
PdfReadError: Invalid Elementary Object starting with b'C' @215640173 : b''
Oups!!! My fault 🙄😳 (hard to provide a code without being able to test it) I made a mistake in the code:
else:
# temporary for debug
stream.seek(-20,1)
xtract = stream.read(80)
raise PdfReadError(
f"Invalid Elementary Object starting with {tok} @{stream.tell()} : {xtract.__repr__()}" # type: ignore
)
I'm just looking the report of the exception
@pubpub-zz Newest error message
PdfReadError: Invalid Elementary Object starting with b'C' @20493876 : b'ation /PANTONE 7455 C /DeviceRGB <<\n/FunctionType 2\n/Range [ 0 1 0 1 0 1 ]\n/C0 ['
You can see in the output that their is a /PANTONE which is interpreted as a Name (starting with / ) which therefore stopped the interpretation value, 7455 is interpreted as the value associated with key /PANTONE and after the C is expected as a new key but is not starting with the expected key or any other character than can be interpreted.
With such case, I do not see how to cope with such a file If you can use and hexadecimal editor such as bless , you should be able to locate the /PANTONE and replace the / with _
@james811223ad, should I consider my proposal as a fix ?
@james811223ad, should I consider my proposal as a fix ?
I would say no. Reason is that we're processing lots of files. We can't possibly do this every time this error comes up. We do appreciate you look into it. @pubpub-zz
I propose to close this issue. feel free to ask to reopen it if you have more data