`Feature`: implemented all `utf8` functions in stdlib
All functions are based on and fully comply with the Lua 5.3 manual (https://www.lua.org/manual/5.3/).
Features
- Implemented functions
utf8.char,utf8.charpattern(string),utf8.codes,utf8.codepoint,utf8.len,utf8.offset. - Added tests in
utf8.luafor all new functions.
Note that Lua 5.4 has different behavior from Lua 5.3 for the utf8 module. In Lua 5.3, all of the functions can work with over-long utf8 code points (code points above 10FFFF, with encodings longer than 4 bytes), while Lua 5.4 disallows them by default and adds a lax parameter to conditionally allow them. The main issue is that Rust's String/str/char types only support fully valid UTF-8, so the lax mode has to be custom.
For more info, see the Lua 5.4 docs on the utf8 module.
The main question is whether piccolo should even support the lax mode, since if we don't, it simplifies the implementation a lot; most parts could be implemented with byte slices and utf_chunks()/Utf8Chunks.
For the utf8.codes iterator, the implementation can be simplified a bit by using an iterator function rather than interior mutability
Here's a description of the iterator/generic for loop approach, from the Lua 5.3 docs (but it's roughly the same in 5.4).
A for statement like
for var_1, ···, var_n in explist do block endis equivalent to the code:
do local f, s, var = explist while true do local var_1, ···, var_n = f(s, var) if var_1 == nil then break end var = var_1 block end end
The standard implementation of utf8.codes returns a step function, the string, and an initial index of 0; then the for loop with call the step function with the string and the index as arguments. This lets you avoid the interior mutability / sequence stuff and instead have a single callback function.
I've simplified utf8.codes, but it still has some bugs, which I'll fix as soon as possible. Regarding lax mode, I think it's better not to add it. However, we should mention somewhere (for example, in COMPATIBILITY.md, under a general heading for utf8, rather than for each specific function) that it's not supported. I'll simplify the other functions as much as I can.
Thank you! I agree with not supporting lax mode, since it makes everything simpler.
If it helps with the indexing math, I'm pretty sure adjust_index has to be different depending on whether it's a starting index or an ending index of a range, due to differences in the wrapping behavior.
fn convert_index(i: i64, len: usize) -> Option<usize> {
let val = match i {
0 => 0,
v @ 1.. => v - 1,
v @ ..=-1 => (len as i64 + v).max(0),
};
usize::try_from(val).ok()
}
fn convert_index_end(i: i64, len: usize) -> Option<usize> {
let val = match i {
v @ 0.. => v,
v @ ..=-1 => (len as i64 + v + 1).max(0),
};
usize::try_from(val).ok()
}
fn sub(string: &[u8], i: i64, j: i64) -> &[u8] {
let len = string.len();
let i = convert_index(i, len).unwrap_or(usize::MAX);
let j = convert_index_end(j, len).unwrap_or(usize::MAX).min(len);
let slice = if i > j { &[] } else { &string[i..j] };
slice
}
IIRC the above code worked when I tested it, and should simplify utf8.len and utf8.codepoint (call sub(i.unwrap_or(1), j.unwrap_or(len)) for len and roughly sub(i.unwrap_or(1), j.unwrap_or(i)) for codepoint).
Here are a few other ideas for ways of simplifying it:
lencould usestd::str::from_utf8, and if it's invalid utf8, you can get the index of the invalid byte fromstd::str::Utf8Error. When it's valid, you can count the characters withstr.chars().count(). It's simpler and should optimize to basically the same inner loop as the current version.codepointshould also be able to usefrom_utf8as well; if there's an error, return the index, but otherwise you can dostack.extend(str.chars().map(|c| Value::Integer(c as i64))
Also, the s.to_str() call in the codes callback can be removed, since the chunks part already correctly handles invalid utf8.
I've simplified utf8.codepoint and I think we need to simplify utf8.offset because it doesn't look good. What do you think about it?
For len and codepoint, the from_utf8 should be done after the subslice to have it error instead of panicking when the indices are in the middle of code points.
I'm not really sure if there's a clean way to improve utf8.offset. There are three cases there:
- if n = 0, find the start of the character containing the index; I don't know of a clean way to do this with the rust stdlib
- if n > 0, find the start of the nth char; this could be done with
str.char_indices().nth(n - 1) - if n < 0, find the start of the nth char from the end; this should be possible with
str.char_indices().rev().nth(n - 1)? The issue is that this requires checking that the entire string is valid utf8, which would be slow with large strings.
Given that the docs say
This function assumes that s is a valid UTF-8 string.
I think the current approach of scanning for the continuation bytes is fine.
Hi! I have implemented all of your recommendations. May we proceed with the integration?