julia
                                
                                 julia copied to clipboard
                                
                                    julia copied to clipboard
                            
                            
                            
                        `isuppercase`/`islowercase` fail on invalid characters
MWE:
julia> isuppercase('\xf0\x8e\x80\x80')
ERROR: Base.InvalidCharError{Char}('\xf0\x8e\x80\x80')
Stacktrace:
 [1] throw_invalid_char(c::Char)
   @ Base ./char.jl:86
 [2] UInt32
   @ ./char.jl:133 [inlined]
 [3] isuppercase(c::Char)
   @ Base.Unicode ./strings/unicode.jl:403
 [4] top-level scope
   @ REPL[12]:1
julia> Base.ismalformed('\xf0\x8e\x80\x80')
false
Either this is a requirement, or we can safely return false here, as is done for malformed characters. Does utf8proc handle invalid/malformed chars on its own? The docs aren't clear about this.
I think we should clearly be returning false here, similar to malformed characters.
Malformed chars can never get passed to utf8proc in the first place — if there is no way to convert them to a UInt32 codepoint, you can't pass them to the utf8proc API.
On invalid codepoints, utf8proc_isupper(codepoint) should return false.
Isn't this a bug in ismalformed?   If it can't be converted to a codepoint, isn't it malformed?
Or should we have another predicate in this case, where it's failing because it is an overlong encoding (Base.is_overlong_enc is returning true in UInt32(c))?
Maybe https://github.com/JuliaLang/julia/blob/dbf0bab59ddc28f1c240fa618bf0e23194954bbe/base/strings/unicode.jl#L414-L415
should just be calling isvalid(c) instead of ismalformed(c)?
Or better yet just (ismalformed(c) | isoverlong(c)) since utf8proc checks for the other cases.
Or better yet, shouldn't we have a predicate
hascodepoint(c::AbstractChar) = !(ismalformed(c) | isoverlong(c))
to check whether one can call codepoint(c) (== UInt32(c))?
As discussed in #54393, the conclusion is that codepoint(c) should succeed whenever !ismalformed(c), including for overlong encodings.
Isn't this a denial-of-service possibility, and thus a security bug? And should be labelled as such, and maybe add to 1.11.1 milestone? Will 1.11.0 be released first, or it skipped for some reason?
At least for Strings, you can and do use uppercase to compare case-insensitively, but I confirmed it also fails (there's probably a better option that needs also be considered).
I suppose fixing this solves Strings indirectly, but does it make sense to fix Strings only for now, somehow, but not for Char?