fix: refuse to execute circular references to streams
Pull request
- When invoking Form XObjects, do not simply duplicate the interpreter but create a sub-interpreter that keeps track of the object IDs of the calling content stream.
- When executing content streams, do not execute content streams from any parent interpreter.
Fixes: #1113
How Has This Been Tested?
With the world renouned "evil xobjects" PDF, and a bestiary of other evil PDFs
Checklist
- [x] I have read CONTRIBUTING.md.
- [x] I have added a concise human-readable description of the change to CHANGELOG.md.
- [x] I have tested that this fix is effective or that this feature works.
- [x] I have added docstrings to newly created methods and classes.
- [x] I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.
When can we get this released?
This is a bug which results in an infinite loop, which may completely break systems if there is a malformed pdf.
@pietermarsman
Can you please look about merging in a patch to fix this bug into master soon.
If a system encounters a pdf with circular streams then it results in infinite loop, causing any system that uses pdfminer to be very fragile.
A single PDF with a circular reference will cause the pdfminer code to hang indefinitely. This is a DOS attack vulnerability.
Question about streams without object IDs
I noticed that in the execute() method (lines 1261-1263), streams without object IDs are being skipped:
if stream.objid is None:
# Inline streams without object IDs can't be tracked for circular refs
continue
This means inline streams or streams that haven't been assigned an object ID won't be executed at all. While these streams can't participate in circular reference detection, they should still be processed normally.
Questions:
- Was this intentional?
- Have you tested this with PDFs containing inline content streams to ensure they're still processed correctly?
Potential fix: The code should only skip streams that are actually circular references, not all streams without object IDs:
for obj in streams:
stream = stream_value(obj)
if stream.objid is not None:
if stream.objid in self.parent_stream_ids:
log.warning(
"Refusing to execute circular reference to content stream %d",
stream.objid,
)
continue # Skip ONLY circular references
self.stream_ids.add(stream.objid)
valid_streams.append(stream) # Always add non-circular streams
Let me know your thoughts!
Hi! Deleted my previous comment because now I see what's going on here.
- Was this intentional?
No, but it's not actually a problem, because inline content streams aren't a thing in PDF.
pdfminer.six does have PDFStream objects which correspond to inline images, because this is a convenient way to represent inline and non-inline (i.e. XObject) images with the same code. But these aren't content streams, just image data, so they will never contain PDF operators that can be executed.
So in fact if execute is called on an inline image it shouldn't just be ignored, it should actually be an error of some sort.
- Have you tested this with PDFs containing inline content streams to ensure they're still processed correctly?
That, I haven't done! I think there are some inline image related samples in pdfminer.six, so just in case, we should test them with image extraction to make sure we get the expected results.
- Have you tested this with PDFs containing inline content streams to ensure they're still processed correctly?
That, I haven't done! I think there are some inline image related samples in pdfminer.six, so just in case, we should test them with image extraction to make sure we get the expected results.
Ok, confirmed now that pdf2txt.py --output-dir images samples/contrib/issue-1008-inline-ascii85.pdf gives the same number of output image files with the same size.
I've added a warning to the effect above, that if something tries to execute a stream that isn't an indirect object, this is probably a bug or a corrupt file, but perhaps you might even want to make it an assert.