colander icon indicating copy to clipboard operation
colander copied to clipboard

Use colander.Length() to validate emoji grapheme clusters

Open dwt opened this issue 6 years ago • 8 comments

Not sure this is the right object to start from, but since python has not so much support for unicode, maybe this is the right start?

Our use case is that we want to have an initials field where people can enter up to 2 characters to be rendered on their user icon.

Of course Emojis are a great choice for this, but frequently fail the length test, as they can be combined of many characters. E.g. "🤔 🙈 me así, se 😌 ds 💕👭👙 hello 👩🏾‍🎓 emoji hello 👨‍👩‍👦‍👦 how are 😊 you today🙅🏽🙅🏽"

The problem with colander.Length() is that it is naive, in the sense that it only counts code-points, while we wold like it to count grapheme-clusters, to find how many characters would be rendered from that string.

Does that make sense? Do you guys have a proposal how to handle that better?

dwt avatar Feb 18 '19 09:02 dwt

Have you tried a custom validator?

stevepiercy avatar Feb 18 '19 16:02 stevepiercy

I did, but the problem is that it is quite a hard problem to work with grapheme clusters in python (especially in python 2), which is why I would very much like it if the validation library knew what that is and could handle it.

I found a python3 library grapheme (that I can't use) which seems like it could help, but still I think it would be really nice if I where able to express the fact that I would like an input have just a certain number of visible characters.

Ideally that would also take care (and allow) stuff like this:

Ȳ̶̧̙̺̪͕̰̬̹̫̟̫̥̺̓́̍͜͜͠e̸͇̽̊̇͆̐ä̸̛̠́̽̑̃̃̃̈́̐̏͘̕͜͠h̷̨̡̛̦̲̯̰̪̜̭͎̠̹̏̈́̌̉̽͌̌͜ ̷̨͈͚̬̮͈̦́͒̍͂͘ͅẗ̶̨̮̩̭̘͕̤͈̰̣͔̝͝h̶̭̹̘̰͚̬͖̗͐i̵̮͍͓̰̣̱͎̤͕̽̀s̸̜͐̽͛̅̀͑̎̅̕͠͝͠͠ ̵̢̧̘͇̱͇̠̝͚͔̱̙͔̀̀̀͗ͅi̶͉̱̐̿͗͂͋s̶̮̫̝͇͓̤̲̼̮̟̝̫̳̫̿̀́̍͂̋͌̽̂͊̈́͛̚͠ ̷̧̘̘̙̳̬̻̱͑̄̇̊̒͌͒t̴͕͙̜͕̦͚̥͉̳̿̿͑̓̈́͐͘h̸͓̬̱̙̎͊͛ͅę̶̧̢̼͇͈͖̘̼̜̠͊̍̊́̕ͅͅ ̶̧͓͖̥̗̝̤̜̣̣̘̓̍́̌̉̉̔̂̈́̽̓͗̀̕̚s̴̨͕̳͕̟͇̬̳͚͔̻̦̺̟͌̓ͅţ̵̡̧͍̙̺̳̪͇̟̝̫͚̺́ụ̵̹͔̝̩͊͌͐f̷̗̦̗̟͇̃̓̉f̵̨̡̨̪̗̯̩̞͇̞̞̫͔̏̈̏̈́́̑͗̃͋͘̕͘͠ͅ!̵̛̻͕̓̀̀͛͂̃̈͘

.

dwt avatar Feb 19 '19 07:02 dwt

@dwt, thanks for pointing this out! Also, you might be interested in Python bug 30717 which I think is related to this…

jenstroeger avatar Feb 19 '19 13:02 jenstroeger

That is indeed interesting, but the bug report looks dead, and I think it will take a long time for python to convert it's string handling around proper unicode grapheme support. People are still burnt from the unicode -> str transition.

dwt avatar Feb 19 '19 14:02 dwt

@dwt this is not something that we are likely to pick up and work on... especially for Python 2.7.

If you are still deploying on Python 2.7 I would recommend you look at porting, and then you may take advantage of the grapheme library.

Unless you can point me at documentation in the standard library that we can take advantage of, colander.Length() will continue to stay naive and a custom validator using an external third-party library will be your best bet.

digitalresistor avatar Feb 20 '19 00:02 digitalresistor

While doing a quick search I also found https://pypi.org/project/uniseg/ which is Python 2.7 compatible.

digitalresistor avatar Feb 20 '19 00:02 digitalresistor

yes, but unfortunately that library supports an old version of the unicode spec. :-(

Ah well, I'd be perfectly happy if you had grapheme counting support on python3, as indeed I'll be switching soon.

I still think that in this world a validation library for strings should know about the concept of grapheme clusters to allow people to enter one smiley in a field that requires a one to two character input.

dwt avatar Feb 20 '19 06:02 dwt

@dwt I will be happy to review patches/PR. I am not saying we won't ship or provide that, but it is not a use case that the core developers have and is not something that is easy to add because of the standard around it.

digitalresistor avatar Feb 20 '19 17:02 digitalresistor