dnfile icon indicating copy to clipboard operation
dnfile copied to clipboard

fix: allow UTF-16 surrogates to be passed through

Open dbushong opened this issue 3 months ago • 4 comments

The UserStrings can be UTF-16-LE encoded values with "odd" surrogate code points. Per the Wikipedia page on UTF-16:

The official Unicode standard says that no UTF forms, including UTF-16, can encode the surrogate code points. Since these will never be assigned a character, there should be no reason to encode them. However, Windows allows unpaired surrogates in filenames and other places, which generally means they have to be supported by software in spite of their exclusion from the Unicode standard.

This change makes it so at least we get them back as valid python unicode characters, rather than omitting the string.

dbushong avatar Sep 12 '25 20:09 dbushong

Can you share a file that contains such a string? I want to add a test case.

malwarefrank avatar Sep 14 '25 02:09 malwarefrank

I do not think surrogatepass is the right choice for everyone. What about replace or backslashreplace? I would rather give people the option to choose an error handler. What would you think of changing your PR to add an errors="strict" argument to the UserString.__init__() and to the UserStringHeap.get() and propagating those to the line that you changed instead of hardcoding surrogatepass? This would preserve current behavior and give you the choice of using surrogatepass in your application?

malwarefrank avatar Sep 14 '25 03:09 malwarefrank

OK, I added an error_handler: str = "strict" to the UserString constructor as well as UserStringHeap#get

I'm trying to add test coverage to match, but I'm unsure how the exe fixtures are being generated?

dbushong avatar Sep 15 '25 17:09 dbushong

Here's an example of a unicode string with unpaired surrogates: "\ud8b3例"

Here's a round-tripping example:

>>> "\ud8b3例"
'\ud8b3例'
>>> "\ud8b3例".encode('utf-16-le')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-le' codec can't encode character '\ud8b3' in position 0: surrogates not allowed
encoding with 'utf-16-le' codec failed
>>> "\ud8b3例".encode('utf-16-le', errors='surrogatepass')
b'\xb3\xd8\xb5\xf9'
>>> "\ud8b3例".encode('utf-16-le', errors='surrogatepass').decode('utf-16-le')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dbushong/.local/share/uv/python/cpython-3.12.9-macos-aarch64-none/lib/python3.12/encodings/utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 0-1: illegal UTF-16 surrogate
decoding with 'utf-16-le' codec failed
>>> "\ud8b3例".encode('utf-16-le', errors='surrogatepass').decode('utf-16-le', errors='surrogatepass')
'\ud8b3例'

dbushong avatar Sep 15 '25 17:09 dbushong