julia icon indicating copy to clipboard operation
julia copied to clipboard

`isuppercase`/`islowercase` fail on invalid characters

Open Seelengrab opened this issue 1 year ago • 3 comments

MWE:

julia> isuppercase('\xf0\x8e\x80\x80')
ERROR: Base.InvalidCharError{Char}('\xf0\x8e\x80\x80')
Stacktrace:
 [1] throw_invalid_char(c::Char)
   @ Base ./char.jl:86
 [2] UInt32
   @ ./char.jl:133 [inlined]
 [3] isuppercase(c::Char)
   @ Base.Unicode ./strings/unicode.jl:403
 [4] top-level scope
   @ REPL[12]:1

julia> Base.ismalformed('\xf0\x8e\x80\x80')
false

Either this is a requirement, or we can safely return false here, as is done for malformed characters. Does utf8proc handle invalid/malformed chars on its own? The docs aren't clear about this.

Seelengrab avatar May 03 '24 09:05 Seelengrab

I think we should clearly be returning false here, similar to malformed characters.

Malformed chars can never get passed to utf8proc in the first place — if there is no way to convert them to a UInt32 codepoint, you can't pass them to the utf8proc API.

On invalid codepoints, utf8proc_isupper(codepoint) should return false.

stevengj avatar May 07 '24 17:05 stevengj

Isn't this a bug in ismalformed? If it can't be converted to a codepoint, isn't it malformed?

Or should we have another predicate in this case, where it's failing because it is an overlong encoding (Base.is_overlong_enc is returning true in UInt32(c))?

stevengj avatar May 07 '24 17:05 stevengj

Maybe https://github.com/JuliaLang/julia/blob/dbf0bab59ddc28f1c240fa618bf0e23194954bbe/base/strings/unicode.jl#L414-L415

should just be calling isvalid(c) instead of ismalformed(c)?

Or better yet just (ismalformed(c) | isoverlong(c)) since utf8proc checks for the other cases.

Or better yet, shouldn't we have a predicate

hascodepoint(c::AbstractChar) = !(ismalformed(c) | isoverlong(c))

to check whether one can call codepoint(c) (== UInt32(c))?

stevengj avatar May 07 '24 17:05 stevengj

As discussed in #54393, the conclusion is that codepoint(c) should succeed whenever !ismalformed(c), including for overlong encodings.

stevengj avatar Jul 17 '24 15:07 stevengj

Isn't this a denial-of-service possibility, and thus a security bug? And should be labelled as such, and maybe add to 1.11.1 milestone? Will 1.11.0 be released first, or it skipped for some reason?

At least for Strings, you can and do use uppercase to compare case-insensitively, but I confirmed it also fails (there's probably a better option that needs also be considered).

I suppose fixing this solves Strings indirectly, but does it make sense to fix Strings only for now, somehow, but not for Char?

PallHaraldsson avatar Oct 07 '24 23:10 PallHaraldsson