pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

fix: refuse to execute circular references to streams

Open dhdaines opened this issue 6 months ago • 2 comments

Pull request

  1. When invoking Form XObjects, do not simply duplicate the interpreter but create a sub-interpreter that keeps track of the object IDs of the calling content stream.
  2. When executing content streams, do not execute content streams from any parent interpreter.

Fixes: #1113

How Has This Been Tested?

With the world renouned "evil xobjects" PDF, and a bestiary of other evil PDFs

Checklist

  • [x] I have read CONTRIBUTING.md.
  • [x] I have added a concise human-readable description of the change to CHANGELOG.md.
  • [x] I have tested that this fix is effective or that this feature works.
  • [x] I have added docstrings to newly created methods and classes.
  • [x] I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

dhdaines avatar Jun 25 '25 15:06 dhdaines

When can we get this released?

This is a bug which results in an infinite loop, which may completely break systems if there is a malformed pdf.

cmyers009 avatar Jul 23 '25 19:07 cmyers009

@pietermarsman

Can you please look about merging in a patch to fix this bug into master soon.

If a system encounters a pdf with circular streams then it results in infinite loop, causing any system that uses pdfminer to be very fragile.

A single PDF with a circular reference will cause the pdfminer code to hang indefinitely. This is a DOS attack vulnerability.

cmyers009 avatar Nov 10 '25 15:11 cmyers009

Question about streams without object IDs

I noticed that in the execute() method (lines 1261-1263), streams without object IDs are being skipped:

if stream.objid is None:
    # Inline streams without object IDs can't be tracked for circular refs
    continue

This means inline streams or streams that haven't been assigned an object ID won't be executed at all. While these streams can't participate in circular reference detection, they should still be processed normally.

Questions:

  1. Was this intentional?
  2. Have you tested this with PDFs containing inline content streams to ensure they're still processed correctly?

Potential fix: The code should only skip streams that are actually circular references, not all streams without object IDs:

for obj in streams:
    stream = stream_value(obj)
    if stream.objid is not None:
        if stream.objid in self.parent_stream_ids:
            log.warning(
                "Refusing to execute circular reference to content stream %d",
                stream.objid,
            )
            continue  # Skip ONLY circular references
        self.stream_ids.add(stream.objid)
    valid_streams.append(stream)  # Always add non-circular streams

Let me know your thoughts!

pietermarsman avatar Dec 14 '25 15:12 pietermarsman

Hi! Deleted my previous comment because now I see what's going on here.

  1. Was this intentional?

No, but it's not actually a problem, because inline content streams aren't a thing in PDF.

pdfminer.six does have PDFStream objects which correspond to inline images, because this is a convenient way to represent inline and non-inline (i.e. XObject) images with the same code. But these aren't content streams, just image data, so they will never contain PDF operators that can be executed.

So in fact if execute is called on an inline image it shouldn't just be ignored, it should actually be an error of some sort.

  1. Have you tested this with PDFs containing inline content streams to ensure they're still processed correctly?

That, I haven't done! I think there are some inline image related samples in pdfminer.six, so just in case, we should test them with image extraction to make sure we get the expected results.

dhdaines avatar Dec 15 '25 22:12 dhdaines

  1. Have you tested this with PDFs containing inline content streams to ensure they're still processed correctly?

That, I haven't done! I think there are some inline image related samples in pdfminer.six, so just in case, we should test them with image extraction to make sure we get the expected results.

Ok, confirmed now that pdf2txt.py --output-dir images samples/contrib/issue-1008-inline-ascii85.pdf gives the same number of output image files with the same size.

I've added a warning to the effect above, that if something tries to execute a stream that isn't an indirect object, this is probably a bug or a corrupt file, but perhaps you might even want to make it an assert.

dhdaines avatar Dec 15 '25 22:12 dhdaines