core
core copied to clipboard
Invalid or incomplete documentation in many String functions (characters vs UTF-16 code units)
Documentation of these functions says that they work with characters, but actually they work with UTF-16 code units:
-
String.dropLeft
,String.dropRight
,String.left
,String.right
These functions work with characters, but this is not documented:
-
String.foldl
,String.foldr
,String.uncons
These functions work with UTF-16 code units, but this is not documented:
-
String.length
(#1047),String.slice
These functions work with characters and this is documented (no problem here, except that it's inconsistent for some functions to work with characters while others work with UTF-16 code units):
-
String.all
,String.any
,String.filter
,String.map
,String.toList
Related: #977
Is there any change to the output, that a function using UTF-16 code units would produce and a function using characters, which would do essentially the same thing?
Especially with String.length, is the length not the number of characters?
@CSDUMMI String.length
is not the number of characters in all cases.
Only Unicode characters from "Basic Multilingual Plane" (Plane 0) are encoded as single UTF-16 code unit. All other Unicode characters are encoded as two UTF-16 code units, so if string contains any of these characters then number of UTF-16 code units is not the same as number of characters.
For example many emojis are from Plane 1, and so e.g. String.length "👩"
returns 2 even through that is a single character.
More information about Unicode Planes: https://en.wikipedia.org/wiki/Plane_(Unicode)
Then I think you should open a PR with more or less this information added to the docs. This might be very relevant to some apps. Could a text editor, that counts the number of characters in a file not use this function?
@CSDUMMI Elm only accepts PRs from core contributors, so it would be useless for me to make a PR. It would never be accepted.
@malaire There are 96 contributors to this library. I would like Elm to have 96 core contributors but this is not the case. Some contributors PRs have been merged just a few commits ago. It's particularly true for typo fixes, but not only, look at at closed and merged pull requests.
That said, there is no guarantee, it can take time, and it's true that PRs are more often used as inspiration for the actual fixes than merged as is. That does not mean that there are not useful. For bug fixes, a complete and rigorous analysis might be more useful than the actual PR though.
I just got an idea of how to fix this issue. Someone could create a package, called for example InsaneString
, which mirrors String
exacly and documents properly just how insane this mix of Unicode Characters and Unicode CodePoints is.
Then if Elm project values outside contributions, documentation could be just copied from InsaneString
to String
. And if Elm project spits on outside contributions as I've read, then documentation would still be available for those who want it.
I'll consider creating this myself in time, but for now I have other projects going on, for example SafeInt
which fixes Elm Int
.
Fixing this would definitely require serious breaking changes, so it can only be done in a MAJOR release. I understand that the current behavior is confusing, but I also want to be careful not to have frequent MAJOR changes since that has cascading effects across the whole package ecosystem.
This issue was opened after 0.19.1 came out, and I plan revisit these problems is when a 0.19.2 or 0.20.0 is slated to release. A patch could potentially be made to document this better though. Is there a PR for that?