piccolo icon indicating copy to clipboard operation
piccolo copied to clipboard

`Feature`: implemented all `utf8` functions in stdlib

Open reloginn opened this issue 7 months ago • 7 comments

All functions are based on and fully comply with the Lua 5.3 manual (https://www.lua.org/manual/5.3/).

Features

  • Implemented functions utf8.char, utf8.charpattern (string), utf8.codes, utf8.codepoint, utf8.len, utf8.offset.
  • Added tests in utf8.lua for all new functions.

reloginn avatar May 02 '25 19:05 reloginn

Note that Lua 5.4 has different behavior from Lua 5.3 for the utf8 module. In Lua 5.3, all of the functions can work with over-long utf8 code points (code points above 10FFFF, with encodings longer than 4 bytes), while Lua 5.4 disallows them by default and adds a lax parameter to conditionally allow them. The main issue is that Rust's String/str/char types only support fully valid UTF-8, so the lax mode has to be custom.

For more info, see the Lua 5.4 docs on the utf8 module.

The main question is whether piccolo should even support the lax mode, since if we don't, it simplifies the implementation a lot; most parts could be implemented with byte slices and utf_chunks()/Utf8Chunks.

Aeledfyr avatar May 25 '25 10:05 Aeledfyr

For the utf8.codes iterator, the implementation can be simplified a bit by using an iterator function rather than interior mutability

Here's a description of the iterator/generic for loop approach, from the Lua 5.3 docs (but it's roughly the same in 5.4).

A for statement like

for var_1, ···, var_n in explist do block end

is equivalent to the code:

do
 local f, s, var = explist
 while true do
   local var_1, ···, var_n = f(s, var)
   if var_1 == nil then break end
   var = var_1
   block
 end
end

The standard implementation of utf8.codes returns a step function, the string, and an initial index of 0; then the for loop with call the step function with the string and the index as arguments. This lets you avoid the interior mutability / sequence stuff and instead have a single callback function.

Aeledfyr avatar May 25 '25 10:05 Aeledfyr

I've simplified utf8.codes, but it still has some bugs, which I'll fix as soon as possible. Regarding lax mode, I think it's better not to add it. However, we should mention somewhere (for example, in COMPATIBILITY.md, under a general heading for utf8, rather than for each specific function) that it's not supported. I'll simplify the other functions as much as I can.

reloginn avatar Jun 02 '25 18:06 reloginn

Thank you! I agree with not supporting lax mode, since it makes everything simpler.

If it helps with the indexing math, I'm pretty sure adjust_index has to be different depending on whether it's a starting index or an ending index of a range, due to differences in the wrapping behavior.

fn convert_index(i: i64, len: usize) -> Option<usize> {
    let val = match i {
        0 => 0,
        v @ 1.. => v - 1,
        v @ ..=-1 => (len as i64 + v).max(0),
    };
    usize::try_from(val).ok()
}

fn convert_index_end(i: i64, len: usize) -> Option<usize> {
    let val = match i {
        v @ 0.. => v,
        v @ ..=-1 => (len as i64 + v + 1).max(0),
    };
    usize::try_from(val).ok()
}

fn sub(string: &[u8], i: i64, j: i64) -> &[u8] {
    let len = string.len();
    let i = convert_index(i, len).unwrap_or(usize::MAX);
    let j = convert_index_end(j, len).unwrap_or(usize::MAX).min(len);
    let slice = if i > j { &[] } else { &string[i..j] };
    slice
}

IIRC the above code worked when I tested it, and should simplify utf8.len and utf8.codepoint (call sub(i.unwrap_or(1), j.unwrap_or(len)) for len and roughly sub(i.unwrap_or(1), j.unwrap_or(i)) for codepoint).

Here are a few other ideas for ways of simplifying it:

  • len could use std::str::from_utf8, and if it's invalid utf8, you can get the index of the invalid byte from std::str::Utf8Error. When it's valid, you can count the characters with str.chars().count(). It's simpler and should optimize to basically the same inner loop as the current version.
  • codepoint should also be able to use from_utf8 as well; if there's an error, return the index, but otherwise you can do stack.extend(str.chars().map(|c| Value::Integer(c as i64))

Also, the s.to_str() call in the codes callback can be removed, since the chunks part already correctly handles invalid utf8.

Aeledfyr avatar Jun 02 '25 23:06 Aeledfyr

I've simplified utf8.codepoint and I think we need to simplify utf8.offset because it doesn't look good. What do you think about it?

reloginn avatar Jun 03 '25 17:06 reloginn

For len and codepoint, the from_utf8 should be done after the subslice to have it error instead of panicking when the indices are in the middle of code points.

I'm not really sure if there's a clean way to improve utf8.offset. There are three cases there:

  • if n = 0, find the start of the character containing the index; I don't know of a clean way to do this with the rust stdlib
  • if n > 0, find the start of the nth char; this could be done with str.char_indices().nth(n - 1)
  • if n < 0, find the start of the nth char from the end; this should be possible with str.char_indices().rev().nth(n - 1)? The issue is that this requires checking that the entire string is valid utf8, which would be slow with large strings.

Given that the docs say

This function assumes that s is a valid UTF-8 string.

I think the current approach of scanning for the continuation bytes is fine.

Aeledfyr avatar Jun 03 '25 17:06 Aeledfyr

Hi! I have implemented all of your recommendations. May we proceed with the integration?

reloginn avatar Aug 10 '25 15:08 reloginn