lean4 fix: validate UTF-8 at C++ -> Lean boundary

trafficstars

Continuation of #3958. To ensure that lean code is able to uphold the invariant that Strings are valid UTF-8 (which is assumed by the lean model), we have to make sure that no lean objects are created with invalid UTF-8. #3958 covers the case of lean code creating strings via fromUTF8Unchecked, but there are still many cases where C++ code constructs strings from a const char * or std::string with unclear UTF-8 status.

To address this and minimize accidental missed validation, the (lean_)mk_string function is modified to validate UTF-8. The original function is renamed to mk_string_unchecked, with several other variants depending on whether we know the string is UTF-8 or ASCII and whether we have the length and/or utf8 char count on hand. I reviewed every function which leads to mk_string or its variants in the C code, and used the appropriate validation function, defaulting to mk_string if the provenance is unclear.

This PR adds no new error handling paths, meaning that incorrect UTF-8 will still produce incorrect results in e.g. IO functions, they are just not causing unsound behavior anymore. A subsequent PR will handle adding better error reporting for bad UTF-8.

Apr 20 '24 20:04 digama0

!bench

Apr 20 '24 20:04 digama0

(disabled auto-merge to run a benchmark first)

Apr 20 '24 20:04 digama0

Mathlib CI status (docs):

❗ Std/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase 62cdb51ed5b9d8487877d5a4adbcd4659d81fc6a --onto 291bb84c972dae470e8f5602729e9d5a5e9433a2. (2024-04-20 20:29:09)
❗ Std/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase fb135b8cfe83c8d9286a9d0198114b7cb306a67a --onto 62cdb51ed5b9d8487877d5a4adbcd4659d81fc6a. (2024-04-23 21:32:17)

Apr 20 '24 20:04 leanprover-community-mathlib4-bot

Here are the benchmark results for commit 88cad9a33495173f56f41b79db791d4bea1f8470. There were significant changes against commit 62cdb51ed5b9d8487877d5a4adbcd4659d81fc6a:

  Benchmark   Metric             Change
  ================================================
+ stdlib      tactic execution    -1.9% (-302.2 σ)

Apr 20 '24 20:04 leanprover-bot

The affected tests are regression tests from #301, #1707, #1690, all of which appear to have been strange behaviors of lean caused by invalid UTF-8. The new behavior is for the C code to throw an unspecific error before even calling lean code; this is inevitable because the lean parser/elaborator assumes it can store the file in a String so we can't really use most of lean's error handling.

Apr 20 '24 21:04 digama0

I pushed a new version of the PR. Summary of changes (w.r.t master, not ver.1):

lean_mk_string_unchecked(const char * s, size_t sz, size_t utf8_chars) was previously lean_mk_string_core
- Use this when we have both the length and utf8_length values available and also know the input is valid UTF-8.
lean_mk_string_from_bytes_unchecked(const char * s, size_t sz) was lean_mk_string_from_bytes
- Use this when we have the length and know the input is valid UTF-8, but need to compute the utf8_chars.
lean_mk_string_from_bytes(const char * s, size_t sz) does validation now, and if validation fails it converts invalid UTF-8 sequences to U+FFFD instead of just returning the empty string.
- Use this when we have a (char*, len) pair but do not know it encodes valid UTF-8
lean_mk_string(const char * s) is a wrapper around lean_mk_string_from_bytes
- Use this when we have char*, not necessarily UTF-8 and need strlen to get the length
lean_string_from_utf8_unchecked was lean_string_from_utf8
- This is the implementation of String.fromUTF8, you should probably not call it directly. I added _unchecked to the name because the C version doesn't validate, even though the lean version takes a validity proof witness as an argument.
lean_mk_ascii_string_unchecked(char const * s) was lean_mk_ascii_string
- Use this when you know the string is valid ASCII but need strlen to get the length.
lean::mk_ascii_string_unchecked(std::string const & s) was lean::mk_ascii_string
- Use this when you know the string is valid ASCII and have a std::string (which counts the length).

Other (non-API) changes:

The code generator produces calls to lean_mk_string_unchecked now instead of lean_mk_string_from_bytes, since we don't want to validate string literals, plus we can calculate the utf8_chars in advance.
bool validate_utf8(uint8_t const * str, size_t size, size_t & pos, size_t & i) also returns the byte position of the first bad character, which can be used for better error reporting in addition to its use here as a fast path for the replacement character algorithm.
I removed the special error handling for invalid UTF-8 in lean files. These cases, in addition to the IO uses of mk_string, are handled by the replacement algorithm, meaning that lean will complain in the bad UTF-8 tests with a "more normal" error saying that U+FFFD is not a token; if you stick it inside a comment or guillemets it will be accepted like any other unicode character. This is not necessarily a good choice, but at least it's a less damaging default behavior than erasing the whole string. You can however do separate validation, by calling validate_utf8 and lean_mk_string_unchecked instead of lean_mk_string.

Apr 23 '24 21:04 digama0

Sounds like a nice and thorough cleanup, thanks a lot!

I'd lean towards rejecting invalid utf8 in lean files (not replacing, not using the empty string). Who knows what kind of surprises will occur when people accidentally have invalid utf8 and lean does something, but not what they expect to.

Apr 24 '24 07:04 nomeata

I'm happy as well modulo below -

I'd lean towards rejecting invalid utf8 in lean files (not replacing, not using the empty string)

Agreed, and more generally I/O functions should at least by default throw an error when decoding fails. I'm pretty sure this is the default in all the big stdlibs out there as well, no?

Apr 24 '24 08:04 Kha

I agree, but can we agree that this is an improvement on the status quo and do that in a follow up PR? Handling UTF-8 errors properly means having new kinds of error reporting functions we don't have currently, as well as touching a bunch more code. The main purpose of this PR is to avoid undefined behavior in the present version caused by invalid lean objects being created, not to add new error reporting mechanisms.

Apr 24 '24 17:04 digama0

Sure, I just want to make sure it is documented and discussed that this is a partial solution in that way. That is very relevant context for reviewing.

Apr 25 '24 07:04 Kha

Sounds like as soon as the PR description reflects the current state I can press the button?

Apr 25 '24 08:04 nomeata

PR description updated

Apr 25 '24 16:04 digama0

Linux-debug fails with

1658/1862 Test #1661: leanruntest_subset.lean ...................................***Failed    1.07 sec
../../common.sh: line 42: 48489 Aborted                 (core dumped) LEAN_BACKTRACE=0 "$@" 2>&1
     48490 Done                    | perl -pe 's/(\?(\w|_\w+))\.[0-9]+/\1/g' > "$f.produced.out"
Unexpected return code 134 executing 'lean -Dlinter.all=false subset.lean'; expected 0. Output:
LEAN ASSERTION VIOLATION
File: /home/runner/work/lean4/lean4/build/stage1/include/lean/lean.h
Line: 473
lean_is_string(o)
(C)ontinue, (A)bort/exit, (S)top/trap

Apr 25 '24 18:04 nomeata

@digama0 Do you have time to look into the assertion failure?

May 21 '24 09:05 Kha

Sorry for the delay, the assertion failure should be fixed now.

Jun 18 '24 21:06 digama0

lean4 lean4 copied to clipboard

fix: validate UTF-8 at C++ -> Lean boundary

lean4
lean4 copied to clipboard