Fixes/write strings correctly

Open synercoder opened this issue 1 year ago • 0 comments

I was working on my own PDF library with strings & encodings. And I remembered my previous PR here. The current way Unicode & BigEndianUnicode strings are written is incorrect.

My PR has 2 changes:

1) Unicode encoding in PDF is done with UTF-16BE Technically PDF does not support UTF-16LE, however during my experiments with my own library, Adobe Acrobat does understand it when you create a PDF with Little Endian strings. However using the PDF1.7 spec, only UTF-16BE is supported.

Thus I have changed the default string encoding when writing non-ASCII strings to BigEndianUnicode.

2) When writing string literals, the individual bytes for parenthesis need to be escape. The current approach of replacing "(" with "\\(" and ")" with "\\)" works when working with an ASCII string, but this approach breaks down when using unicode.

UTF-16 uses 2 bytes for every character, Thus the string ( ) would become \( \) then converted into unicode would become 00 5C 00 28 00 20 00 5C 00 29 (hex).

When reading this string back, the parenthesis bytes are not preceded by an escape character, and thus weird results show up.

My last change will first encode the bytes, and then when writing the bytes, check for parenthesis bytes, and escape those. meaning the above string would be encoded as: 00 5C 28 00 20 00 5C 29. Which PDF readers (including Wisp) will read correctly.

Aug 16 '24 17:08 synercoder