lean4
lean4 copied to clipboard
fix: validate UTF-8 at C++ -> Lean boundary
Continuation of #3958. To ensure that lean code is able to uphold the invariant that Strings are valid UTF-8 (which is assumed by the lean model), we have to make sure that no lean objects are created with invalid UTF-8. #3958 covers the case of lean code creating strings via fromUTF8Unchecked, but there are still many cases where C++ code constructs strings from a const char * or std::string with unclear UTF-8 status.
To address this and minimize accidental missed validation, the (lean_)mk_string function is modified to validate UTF-8. The original function is renamed to mk_string_unchecked, with several other variants depending on whether we know the string is UTF-8 or ASCII and whether we have the length and/or utf8 char count on hand. I reviewed every function which leads to mk_string or its variants in the C code, and used the appropriate validation function, defaulting to mk_string if the provenance is unclear.
This PR adds no new error handling paths, meaning that incorrect UTF-8 will still produce incorrect results in e.g. IO functions, they are just not causing unsound behavior anymore. A subsequent PR will handle adding better error reporting for bad UTF-8.
!bench
(disabled auto-merge to run a benchmark first)
Mathlib CI status (docs):
- ❗ Std/Mathlib CI will not be attempted unless your PR branches off the
nightly-with-mathlibbranch. Trygit rebase 62cdb51ed5b9d8487877d5a4adbcd4659d81fc6a --onto 291bb84c972dae470e8f5602729e9d5a5e9433a2. (2024-04-20 20:29:09) - ❗ Std/Mathlib CI will not be attempted unless your PR branches off the
nightly-with-mathlibbranch. Trygit rebase fb135b8cfe83c8d9286a9d0198114b7cb306a67a --onto 62cdb51ed5b9d8487877d5a4adbcd4659d81fc6a. (2024-04-23 21:32:17)
Here are the benchmark results for commit 88cad9a33495173f56f41b79db791d4bea1f8470. There were significant changes against commit 62cdb51ed5b9d8487877d5a4adbcd4659d81fc6a:
Benchmark Metric Change
================================================
+ stdlib tactic execution -1.9% (-302.2 σ)
The affected tests are regression tests from #301, #1707, #1690, all of which appear to have been strange behaviors of lean caused by invalid UTF-8. The new behavior is for the C code to throw an unspecific error before even calling lean code; this is inevitable because the lean parser/elaborator assumes it can store the file in a String so we can't really use most of lean's error handling.
I pushed a new version of the PR. Summary of changes (w.r.t master, not ver.1):
lean_mk_string_unchecked(const char * s, size_t sz, size_t utf8_chars)was previouslylean_mk_string_core- Use this when we have both the length and utf8_length values available and also know the input is valid UTF-8.
lean_mk_string_from_bytes_unchecked(const char * s, size_t sz)waslean_mk_string_from_bytes- Use this when we have the length and know the input is valid UTF-8, but need to compute the utf8_chars.
lean_mk_string_from_bytes(const char * s, size_t sz)does validation now, and if validation fails it converts invalid UTF-8 sequences toU+FFFDinstead of just returning the empty string.- Use this when we have a (char*, len) pair but do not know it encodes valid UTF-8
lean_mk_string(const char * s)is a wrapper aroundlean_mk_string_from_bytes- Use this when we have char*, not necessarily UTF-8 and need
strlento get the length
- Use this when we have char*, not necessarily UTF-8 and need
lean_string_from_utf8_uncheckedwaslean_string_from_utf8- This is the implementation of
String.fromUTF8, you should probably not call it directly. I added_uncheckedto the name because the C version doesn't validate, even though the lean version takes a validity proof witness as an argument.
- This is the implementation of
lean_mk_ascii_string_unchecked(char const * s)waslean_mk_ascii_string- Use this when you know the string is valid ASCII but need
strlento get the length.
- Use this when you know the string is valid ASCII but need
lean::mk_ascii_string_unchecked(std::string const & s)waslean::mk_ascii_string- Use this when you know the string is valid ASCII and have a
std::string(which counts the length).
- Use this when you know the string is valid ASCII and have a
Other (non-API) changes:
- The code generator produces calls to
lean_mk_string_uncheckednow instead oflean_mk_string_from_bytes, since we don't want to validate string literals, plus we can calculate the utf8_chars in advance. bool validate_utf8(uint8_t const * str, size_t size, size_t & pos, size_t & i)also returns the byte position of the first bad character, which can be used for better error reporting in addition to its use here as a fast path for the replacement character algorithm.- I removed the special error handling for invalid UTF-8 in lean files. These cases, in addition to the IO uses of
mk_string, are handled by the replacement algorithm, meaning that lean will complain in the bad UTF-8 tests with a "more normal" error saying thatU+FFFDis not a token; if you stick it inside a comment or guillemets it will be accepted like any other unicode character. This is not necessarily a good choice, but at least it's a less damaging default behavior than erasing the whole string. You can however do separate validation, by callingvalidate_utf8andlean_mk_string_uncheckedinstead oflean_mk_string.
Sounds like a nice and thorough cleanup, thanks a lot!
I'd lean towards rejecting invalid utf8 in lean files (not replacing, not using the empty string). Who knows what kind of surprises will occur when people accidentally have invalid utf8 and lean does something, but not what they expect to.
I'm happy as well modulo below -
I'd lean towards rejecting invalid utf8 in lean files (not replacing, not using the empty string)
Agreed, and more generally I/O functions should at least by default throw an error when decoding fails. I'm pretty sure this is the default in all the big stdlibs out there as well, no?
I agree, but can we agree that this is an improvement on the status quo and do that in a follow up PR? Handling UTF-8 errors properly means having new kinds of error reporting functions we don't have currently, as well as touching a bunch more code. The main purpose of this PR is to avoid undefined behavior in the present version caused by invalid lean objects being created, not to add new error reporting mechanisms.
Sure, I just want to make sure it is documented and discussed that this is a partial solution in that way. That is very relevant context for reviewing.
Sounds like as soon as the PR description reflects the current state I can press the button?
PR description updated
Linux-debug fails with
1658/1862 Test #1661: leanruntest_subset.lean ...................................***Failed 1.07 sec
../../common.sh: line 42: 48489 Aborted (core dumped) LEAN_BACKTRACE=0 "$@" 2>&1
48490 Done | perl -pe 's/(\?(\w|_\w+))\.[0-9]+/\1/g' > "$f.produced.out"
Unexpected return code 134 executing 'lean -Dlinter.all=false subset.lean'; expected 0. Output:
LEAN ASSERTION VIOLATION
File: /home/runner/work/lean4/lean4/build/stage1/include/lean/lean.h
Line: 473
lean_is_string(o)
(C)ontinue, (A)bort/exit, (S)top/trap
@digama0 Do you have time to look into the assertion failure?
Sorry for the delay, the assertion failure should be fixed now.