pdf-issues icon indicating copy to clipboard operation
pdf-issues copied to clipboard

Unambiguous representation and processing of inline images is impossible pre-PDF 2.0

Open dhdaines opened this issue 8 months ago • 7 comments

When inline image data is not encoded with an ASCII filter, or encoded with ASCII85Decode, it will frequently contain the two-byte sequence EI. This makes it difficult for a conforming reader to reliably determine the extent of an inline image stream, particularly since there are some imprecisions in the standard.

Since ID and EI are operators and therefore subject to the lexical conventions of name objects, they must be surrounded by by whitespace or delimiter characters. But, non-ASCII encoded image data can obviously also contain whitespace and delimiter characters, and all of the delimiter characters are valid in ASCII85 encoding. This is partially adressed by the standard in section 8.9.7 paragraph 4, for the starting ID operator: ID must be followed by one and only one whitespace character, unless the final filter (that is, the first name in the Filter list...) is ASCII85Decode or ASCIIHexDecode, since these filters ignore whitespace.

However, for the ending EI operator, there is a major ambiguity in section 8.9.7. In the case of ASCIIHexDecode there is no problem since EI is not a valid hex sequence. For ASCII85Decode there is also no ambiguity in PDF 2.0 since, even though the two-byte sequence EI frequently occurs in ASCII85 encoding, the ~> terminator is now required, so in practice the end of an inline image can be found with the regular expression ~>\s*EI (note that in real world implementations, a fallback to \<EI is necessary in case ~> is absent).

But in the case of unencoded image data:

  • the data can (and will) contain the two-byte sequence EI, possibly preceded or followed by one or more whitespace or delimiter characters which are part of the image data, but..
  • section 8.9.7 paragraph 6 implies that the ending EI is delimited by whitespace, and...
  • PDF syntax conventions (section 7.2.3, section 7.8.3) imply that the EI operator is delimited by whitespace or delimiter characters

This means that it is impossible to unambiguously represent inline image data without ASCII encoding it, and impossible to process non-ASCII encoded inline images in content streams without resorting to heuristics and possible (though potentially improbable) data loss.

The problem is solved by the Length / L key in PDF 2.0 (and it seems that the whitespace before EI becomes optional, according to section 8.9.7 paragraph 8?), but the standard should include a recommended method for handling inline images without Length, including the notes above about ASCIIHexDecode and ASCII85Decode as well as guidance for implementations that create inline images that may be interpreted by readers conforming to previous revisions of PDF. I would humbly suggest that this guidance should be to always ASCII-encode inline images (and to always terminate ASCII85 data with ~>).

dhdaines avatar Apr 28 '25 21:04 dhdaines

Note also that if the guidance is to ASCII-encode inline images, then this comment in section 7.4.1 should also be revised:

ASCII filters serve no useful purpose in a PDF file that is encrypted; see 7.6, "Encryption".

dhdaines avatar Apr 29 '25 11:04 dhdaines

Definitely a known issue, and "EI" is not uncommon in Flate and LZW encoded inline images. I'm pretty confident the original design is high on everyone's list of "ideas that should not have been had".

The guidance for implementers is that the "L" key is now is required, so there's no need for ASCII encoding. Without the "L" key the only solution is to decode the image, see if you have enough data, and if not read until the next EI and try again.

EDIT: I want to note you can add an L key to PDF 1.7 files, without breaking anything, and that I suspect any guidance would be applied to PDF 2.0 rather than being "backported" to the older specs anyway.

faceless2 avatar Apr 29 '25 12:04 faceless2

There are two separate issues:

a) Cannot find the end of an inline image reliably without knowing how to decompress and decode it and doing so; and

b) Cannot find the end of an inline image reliably at all.

The first obviously exists, but what is an example of the second? Have never seen one.

Plainly it would be nice to allow a PDF processor to process content streams even if it doesn't support all compression methods which might occur in an inline image. And that's what /L does.

But where's the actual ambiguity? You just don't stop reading on EI if you haven't stream-decompressed and read enough image data yet, given the image parameters. For things like JPEG, you can find the end of image by well-known means without needed to do JPEG decompression.

johnwhitington avatar May 26 '25 12:05 johnwhitington

Indeed, you're absolutely right - finding the end of an inline image pre-PDF 2.0 isn't so much ambiguous as it is exceedingly complex, for the surprisingly common case where you just want to extract text from a PDF and don't care about images at all (this is, obviously, the use case I personally care about!).

But where's the actual ambiguity? You just don't stop reading on EI if you haven't stream-decompressed and read enough image data yet, given the image parameters. For things like JPEG, you can find the end of image by well-known means without needed to do JPEG decompression.

Yes, but this should probably be mentioned in the standard:

  • In the case of uncompressed images it's obviously very simple, you can infer /L based on /W, /H, and /BPC.
  • For /DCTDecode and /JPXDecode, well... the PDF standard already defers to the JPEG standards, where presumably these "well-known means" are documented (for everything else, there's MasterCardStackOverflow)
  • If there's Flate or LZW compression then basically you have to catch a decompression error and try again at the next instance of EI.
  • For /CCITTFaxDecode ... probably the same thing, try it and catch errors? For JBIG2Decode? Presumably that too.

Anyway! Feel free to close the issue but it would be helpful to include something like the "algorithm" above, perhaps in a NOTE 3 in section 8.9.7.

dhdaines avatar May 26 '25 17:05 dhdaines

For Flate, CCITT, LZW you can feed your zlib or other library the data one byte at a time. PDF doesn't recommend inline images be larger than 4k, so this is generally fast enough. When the codec says it's finished, you are at the end of the image.

On a practical note for your use-case, I believe qpdf can turn all inline images in a PDF into ordinary external images. So you could pre-process with that, then you are guaranteed not to have any inline images to deal with.

johnwhitington avatar May 26 '25 18:05 johnwhitington

On a practical note for your use-case, I believe qpdf can turn all inline images in a PDF into ordinary external images. So you could pre-process with that, then you are guaranteed not to have any inline images to deal with.

Thanks! That seems like a useful kind of "repair" to do.

dhdaines avatar May 26 '25 19:05 dhdaines

The consensus to date has been to not add any wording related to repairing or recovering files, but I will ask the PDF TWG if they wish to add any informative statement for this situation.

petervwyatt avatar Jun 05 '25 05:06 petervwyatt