pdf-issues icon indicating copy to clipboard operation
pdf-issues copied to clipboard

UTF-16LE strings are supported by vast majority of PDF processors and should be permitted

Open petervwyatt opened this issue 3 years ago • 2 comments

Based on recent SafeDocs research, it is apparent that the vast majority of PDF processors silently support UTF-16LE encoded strings as well as UTF-16BE, resulting in significantly different visual appearances for things like outlines and OCG layer names. Over 18 PDF processors were tested with only 2 "correctly" processing UTF-16LE as PDFDocEncoding (being Slim PDF Reader and iText). Real PDFs in the wild with UTF-16LE do also exist, such as mentioned here. Also, no validators seem to detect (reject or warn) about PDFs with UTF-16LE.

Thus the PDF spec should be updated to acknowledge this reality by permitting UTF-16LE

I have created a PDF test file here that uses non-trivial UTF-16LE strings: https://github.com/pdf-association/safedocs/tree/main/Miscellaneous%20Targeted%20Test%20PDFs#utf16le-testpdf

petervwyatt avatar Mar 19 '22 05:03 petervwyatt

I'm not really sure this is a good idea.

Making LE support official would require all PDF processors (not only those that apparently correctly implement the current state of the specification) to check whether they correctly support LE in all contexts as you surely have not tested every context in which a string in all those PDF processors might be relevant.

I'd more consider your observation a warning: If all those PDF processors "silently support UTF-16LE", who knows which other text encodings they also incorrectly "support"...

mkl-public avatar Mar 20 '22 14:03 mkl-public

Agree with @mkl-public - don't touch this!

lrosenthol avatar Jun 30 '22 19:06 lrosenthol