julia
julia copied to clipboard
Ensure non-ASCII decimal digits are also `isdigit`
I noticed that
julia> '4' |> isdigit
false
even though
julia> '4'
'4': Unicode U+FF14 (category Nd: Number, decimal digit)
The docstring doesn't say that this only checks ASCII, and the unicode category (see here) is an exact match for the definition (ten glyphs making up the numbers zero through nine, in various scripts). So while some things will still slip through (e.g. 四, which is considered a letter by Unicode), this implementation is a tiny bit more correct.
However, it's not just roses and sunshine; there is a downside, in that this won't vectorize anymore due to the ccall behind category_code. We could bring the double table lookup into pure Julia, but I'm not sure whether that would restore performance here. Maybe it could? As is, this PR comes with a 100x regression in a microbenchmark.
v1.11-alpha2:
julia> @benchmark count(isdigit, v) setup=(v=rand(Char, 10_000))
BenchmarkTools.Trial: 10000 samples with 202 evaluations.
Range (min … max): 392.871 ns … 509.455 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 400.743 ns ┊ GC (median): 0.00%
Time (mean ± σ): 405.332 ns ± 12.212 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▅█▆▇▆▄▂ ▁▁ ▁▃▂▂▂ ▂
▆▂▅▅██████████▇▆▅▄▄▄▄▂▄▄▄▅▅▅▆▆▇██▇▇▆███████▇▇▇▆▇▆▆▇▆▆▆▇█████▇ █
393 ns Histogram: log(frequency) by time 446 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
This PR:
julia> @benchmark count(isdigit, v) setup=(v=rand(Char, 10_000))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 40.870 μs … 483.089 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 42.200 μs ┊ GC (median): 0.00%
Time (mean ± σ): 45.848 μs ± 19.561 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▆▃▃▂ ▁
████████▆▆▄▄▆▄▅▅▅▄▅▅▇▄▄▄▄▄▁▁▁▃▁▃▁▁▁▁▁▁▁▁▃▇██▇▅▆▃▆█▄▄▄▃▁▁▁▁▁█ █
40.9 μs Histogram: log(frequency) by time 129 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
In a single-invocation benchmark, we "merely" have a 2-3x/10ns regression:
v1.11-alpha2:
julia> @benchmark isdigit(v) setup=(v=rand(Char)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 10.000 ns … 40.000 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 20.000 ns ┊ GC (median): 0.00%
Time (mean ± σ): 17.548 ns ± 4.491 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█
▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▂
10 ns Histogram: frequency by time 20 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
This PR:
julia> @benchmark isdigit(v) setup=(v=rand(Char)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 20.000 ns … 520.000 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 30.000 ns ┊ GC (median): 0.00%
Time (mean ± σ): 32.170 ns ± 31.893 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▇ █ ▂ ▂ ▂ ▁
█▁▁█▁▁█▁▁▇▁▁▁▅▁▁▄▁▁▁▁▁▁▁▁▁▆▁▁█▁▁▁█▁▁█▁▁▅▁▁▁▄▁▁▃▁▁▄▁▁▁▃▁▁▇▁▁▅ █
20 ns Histogram: log(frequency) by time 200 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
That being said, we already check category_code pretty much everywhere else, and this microbenchmark is extremely unlikely to be representative of an actual workload.
The current (ASCII only) definition of isdigit is used a few places in Base and stdlibs. If this PR should land, then the rest of Julia Base should be updated. For example, Base.TOML.isvalid_barekey_char assumes that isidigt means ASCII.
Also, there are probably other uses where the new implementation would still work, but is unnecessary.
That may be true, but on the other hand, these uses relying on ASCII-only are all undocumented assumptions :/ Worse, isdigit refers to isletter, which does specify that it's about the Unicode category. It's very reasonable to infer that this also is true for isdigit.
The failures in the Unicode stdlib are real - I only tested with test/unicode/utf8.jl from Base (I don't know why these differ). In particular, the failures are with these characters, which are a mix of decimal digits and non-digit numerics:
julia> for c in ['٣', '٥', '٨', '¹', 'ⅳ']
show(stdout, MIME"text/plain"(), c)
println()
end
'٣': Unicode U+0663 (category Nd: Number, decimal digit)
'٥': Unicode U+0665 (category Nd: Number, decimal digit)
'٨': Unicode U+0668 (category Nd: Number, decimal digit)
'¹': Unicode U+00B9 (category No: Number, other)
'ⅳ': Unicode U+2173 (category Nl: Number, letter)
Doubly worse, the test failing here is in a @testset specifically handling utf8proc predicates.. The issue referenced there also mentions isdigit, but that doesn't seem to have actually happened, judging by the PR that added the utf8proc queries.
Would adding a predicate isasciidigit be an option? It's really weird to me to have lots of utf8proc-based predicates and then a single one that's ASCII-only, while not mentioning that fact in the docstring.
Looking through all those old PRs/issue, maybe @stevengj has some additional context/guidance here :)
The docs for isdigit states it only includes characters 0-9.
Tests whether a character is a decimal digit (0-9).
julia> c = '4'
'4': Unicode U+FF14 (category Nd: Number, decimal digit)
julia> c in '0':'9'
false
So I don't see an issue with the current behavior.
There are more than just the ASCII decimal digits in Unicode, hence the example above with a 4 from a different script ;) Both that 4 and the ASCII 4 are decimal digits, and would thus fit the docstring.
See e.g. here for a list.
Yes, the docstring should be updated to be explicit about only including ASCII decimal digit then, but I have code that would break if this function returns true for other chars out of the range '0':'9'
If I remember, this was following the C isdigit function. I think the rationale was that isdigit checks are mostly use for parsing tasks that expect ASCII numbers.
I would favor @Seelengrab's proposed behaviour, for the following reasons:
- It mirrors other functions like
isletter,islowercase,isuppercase,isspace,ispunct, andisprint, which work using Unicode categories. Havingisdigitonly apply for ASCII is a strange and unexpected exception to the rule. The existing behaviour would be much clearer under a name likeis_ascii_digit. - The documentation of
isdigitsays that it checks if a character "a decimal digit (0-9)". To me that clearly means whether it represents one of the numbers 0-9. If it was supposed to mean whether the digit was in'0':'9', then surely the documentation would have said "whether a character is a decimal digit (in the range'0':'9')." - Whether using the ASCII or Unicode definition is more or less useful in parsing tasks is not an important criterion IMO. Just because 80% of users only care about ASCII for specific use cases doesn't mean we don't need to also consider the 20% of the cases.
The TOML parser is bugged with this:
julia> TOML.parse("4= \"value\"")
ERROR: TOML Parser error:
none:1:1 error: invalid bare key character: '4'
4= "value"
^
Stacktrace:
...
julia> @eval Base.Unicode begin
isdigit(c::AbstractChar) = category_code(c) == UTF8PROC_CATEGORY_ND
end;
julia> TOML.parse("4= \"value\"")
ERROR: StringIndexError: invalid index [3], valid nearby indices [1]=>'4', [4]=>'='
Stacktrace:
[1] string_index_err(s::String, i::Int64)
@ Base ./strings/string.jl:12
[2] SubString{String}(s::String, i::Int64, j::Int64)
@ Base ./strings/substring.jl:35
[3] SubString
@ ./strings/substring.jl:49 [inlined]
[4] SubString
@ ./strings/substring.jl:52 [inlined]
[5] take_substring
@ ./toml_parser.jl:440 [inlined]
[6] _parse_key(l::Base.TOML.Parser)
@ Base.TOML ./toml_parser.jl:623
this microbenchmark is extremely unlikely to be representative of an actual workload.
Checking for digits feels quite common inside hot parsing loops. Also, if existing uses assumed that isdigit was only true for ASCII this could potentially cause segfaults/UB since a string consisting of only digits can no longer be assumed to have a certain size (which is probably the cause of the TOML failure).
Instead of slowing down all uses of this in the ecosystem for the reason "this implementation is a tiny bit more correct" could a new function be added (and referred to from isdigit in a note)? It just doesn't seem worth modifying this pragmatically.
The TOML parser is bugged with this:
ERROR: StringIndexError: invalid index [3], valid nearby indices [1]=>'4', [4]=>'='
a string consisting of only digits can no longer be assumed to have a certain size (which is probably the cause of the TOML failure).
IMO that is more an issue with assuming l.prevpos-1 is always a valid index into the String:
take_substring(l::Parser) = SubString(l.str, l.marker:(l.prevpos-1))
which isn't the case in Unicode. IMO this should use prevind instead, but that's a bit off topic here. To be fair, I don't know my way around the TOML parser so I can't tell whether that use is ordinarily ok or not :shrug:
There's this bit about isvalid_barekey_char:
https://github.com/JuliaLang/julia/blob/50063122b44df1ba80e83fb5abee3f05074f0223/base/toml_parser.jl#L589-L592
but that seems inconsistent with the spec:
Bare keys may contain any letter-like or number-like Unicode character from any Unicode script, as well as ASCII digits, dashes and underscores.
I'm happy to add tests & fixes to Base here, but the fact that only the Unicode-stdlib tests failed suggest to me that this is either not well tested, or not that big of a change after all.
Checking for digits feels quite common inside hot parsing loops Instead of slowing down all uses of this in the ecosystem for the reason "this implementation is a tiny bit more correct"
Having written some of those hot parsing loops myself, it's quite common to replace the category code-based functions from Base with custom ones specialized for ASCII already - what's one more? IMO, utility functionality provided by Base, working on types defined in Base, should handle the canonical interpretation of those types, and not a special case. That is, the utility functions in this case should follow what Unicode considers a decimal digit.
could a new function be added (and referred to from isdigit in a note)?
That's also an option, but then we have the odd situation that only isdigit is the odd one out with having to use a different function to check what Unicode considers a digit, when everything else already does that by default. String is a UTF-8 string after all, not an ASCIIString, and Char is a Unicode codepoint, not an ASCII character.
To be clear, the usual arithmetic people sometimes do (e.g. '0' + 1 == '1') should still work with everything Unicode considers a decimal digit, according to the category definition linked above. So if code breaks as a result of this, it either assumed it was working on ASCII already (which should already be guarded behind an isascii check or other guarantee that the input is in fact ASCII), or it wasn't correctly handling Unicode string indexing :thinking:
The TOML spec is here https://toml.io/en/v1.0.0#keys
Bare keys may only contain ASCII letters, ASCII digits, underscores, and dashes (A-Za-z0-9_-). Note that bare keys are allowed to be composed of only ASCII digits, e.g. 1234, but are always interpreted as strings.
So if code breaks as a result of this, it either assumed it was working on ASCII already (which should already be guarded behind an isascii check or other guarantee that the input is in fact ASCII), or it wasn't correctly handling Unicode string indexing
As I wrote and demonstrated, there is code that assumes that isdigit does what it used to do. Making isdigit slower and changing it in such a way that it can introduce bugs by making old code have to handle new data is unlikely to be beneficial from a pragmatic point of view.
The TOML spec is here https://toml.io/en/v1.0.0#keys
Thanks for pointing that out! Seems like the version I linked is the in-development version, so a future version of TOML is likely going to support Unicode there.
As I wrote and demonstrated, there is code that assumes that isdigit does what it used to do. Making isdigit slower and changing it in such a way that it can introduce bugs by making old code have to handle new data is unlikely to be beneficial from a pragmatic point of view.
I understand that, but this is a largely undocumented assumption. @jakobnissen also found an issue with the current implementation, which doesn't handle malformed data properly:
julia> isdigit(reinterpret(Char, 0x30000001))
true
which this PR does handle, returning false. Naively, that would be much more catastrophic for the TOML parser (or any other parser, really) than just checking isascii (which, again, I'm happy to insert).
Clearly, changing isdigit to return true for non-ASCII digits would be breaking and can't happen in 1.x. In the short term we should simply document the current behavior more clearly (#54492).
The question is whether we should export a new function that returns true for any character in Unicode category Nd ("Number, decimal digit"). What applications would need this functionality?
It seems more useful, and more general, to simply export some variant of Base.Unicode.category_code, maybe returning an enum, so that users can test for whatever category they want.
whether we should export a new function that returns true for any character in Unicode category Nd ("Number, decimal digit").
It seems we already have it: isnumeric
isnumeric(c::AbstractChar) -> Bool
Tests whether a character is numeric. A character is classified as numeric if it belongs to the Unicode general category Number, i.e. a character whose category code begins with 'N'.
Note that this broad category includes characters such as ¾ and ௰. Use isdigit to check whether a character a decimal digit between 0 and 9.
https://github.com/JuliaLang/julia/blob/2877cbcc72496e5daab2b14adfb0987674d144b3/stdlib/Unicode/test/runtests.jl#L132-L142
It seems we already have it: isnumeric
No, that returns true for any category that's numeric, this Pr is specifically about numeric decimal digits. E.g. ⅳ is isnumeric, but is not a decimal digit. Its unicode category is Nl (Numeric, letter) not Nd (Numeric, decimal digit):
julia> 'ⅳ'
'ⅳ': Unicode U+2173 (category Nl: Number, letter)
julia> 'ⅳ' |> isnumeric
true
julia> 'ⅳ' |> Base.Unicode.category_code
10
julia> Base.Unicode.UTF8PROC_CATEGORY_ND
9
This PR is about UTF8PROC_CATEGORY_ND specifically, and would return false for 'ⅳ'.
This would just be all kinds of breaking. Just skim across all the usages of isdigit throughout the ecosystem — so many packages do things that are effectively isdigit(c) && c - '0' or isdigit(c) && parse(Int, c). Documented or not, this is the behavior and folks rely on it.