rotunicode icon indicating copy to clipboard operation
rotunicode copied to clipboard

rotunicode should also provide transformations beyond the BMP

Open Boldewyn opened this issue 10 years ago • 3 comments

Many Unicode errors in applications stem from software assuming, that Unicode ends with U+FFFF (see, e.g., MySQLs misnamed utf8 charset).

It would be great for testing, if rotunicode could provide an option to switch to those astral Unicode characters.

A set, that (almost) fits like a glove, is found in the "Mathematical Alphanumeric Symbols" block:

https://codepoints.net/U+1D400..U+1D433,U+1D7CE..U+1D7D7

I'd love to provide a pull request for it, but I am uncertain as of how to add this to the existing code: extra parameter to rotunicode.RotUnicode.encode()? New encoder rotunicode.RotUnicodeAstral()?

Boldewyn avatar Oct 07 '15 13:10 Boldewyn

@Boldewyn the goal for rotunicode is make it easy to catch problems with non-ASCII characters. The current rot_unicode_alphabet was chosen to resemble the corresponding ASCII characters for better readability. The set that you suggest will be true to the goal including the problem you raise about software assuming Unicode ends with U+FFFF and has better readability. I'm thinking it makes sense to replace the current rot_unicode_alphabet with it.

Your other suggestion i.e. an option to switch to a custom alphabet set makes sense too. Building that ability will allow others to revert the behavior back if they don't like the new set.

I'll try to do implement these and submit pull requests.

kunalparmar avatar Oct 10 '15 21:10 kunalparmar

I like this idea, but I'm not sure we should replace the current alphabet.

On one hand, this will allow users to catch a wider class of unicode problems, which is a great thing. I fully support at least adding an ability to use a non-BMP alphabet.

On the other hand, adding this extra may cause our users' previously passing tests, to start failing, if they have problems with the non-BMP characters. Being completely unfamiliar with this aspect of unicode until @Boldewyn opened this ticket, I don't know what percentage of Python programs will have problems with this, or in what ways such problems might manifest. It could be that errors are rare and obvious, or they may be common and puzzling.

Furthermore, since rotunicode is a dev/testing dependency and not a production dependency, it is unlikely that our users are pinning to a specific version (we don't pin any versions of our dev dependencies: https://github.com/box/rotunicode/blob/master/requirements-dev.txt). If this was a production dependency or users were otherwise pinning versions, we could just do a major version bump, add a release note about the potential breakage, and users would be able to handle any problems if/when they upgrade. But since our users probably don't pin to a rotunicode version, their builds may start to fail at any time, and they won't be able to bisect the problem.

Because of this, I think I would lean away from changing the default alphabet. But I still support adding support for a non-BMP alphabet.

If we still go ahead with changing the default alphabet, I think we should do a major version bump.

jmoldow avatar Oct 13 '15 15:10 jmoldow

Yes, that basically sums up my uncertainty on how to implement that. On top, I think it a nice feature, that the current alphabet adds some "optical feedback" in the form of diacritics and combining marks.

The astral characters, that I suggested, look quite close to the real ASCII ones that they replace. They are great (or, as I laid out, even better) for detecting problems with Unicode, but they lack a bit of the fun of it (using .encode() to do visual mangling of ASCII characters).

Boldewyn avatar Oct 13 '15 19:10 Boldewyn