Smaller lengths array?

Open Andersama opened this issue 2 years ago • 1 comments

Found this code referenced inside imgui, so far as I can tell I'm not sure why the lengths array needs to contain 32 results.

The reason being is that the bits that control the length of a utf8 sequence are the leading 1s up at the front of a byte with a terminating 0 (assuming there was software out there dealing with utf8 in a bitstream).

0 -> ascii -> 1 byte 10 -> continuation -> 1 byte 110 -> 2 byte sequence 1110 -> 3 byte sequence 11110 -> 4 byte sequence

111110 -> presumably a 5 byte sequence 1111110 -> presumably a 6 byte sequence 11111110 -> presumably a 7 byte sequence 11111111 -> presumably an 8 byte sequence

However, currently utf8 only deals with at worst 4 byte code points so while the pattern could continue at the moment there aren't 5 byte sequences. Which means you could just drop the terminating 0 for the 4 byte sequence and work from there.

Now I'm not sure about the rest of the code dealing with errors and masks and shifting around...but presumably...an equivalent function exists, but with a smaller table.

    static const char lengths[] = {
1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 3, 4
    };
    
    static const int masks[]  = {0x00, 0x7f, 0x1f, 0x0f, 0x07};
    static const uint32_t mins[] = {4194304, 0, 128, 2048, 65536};
    static const int shiftc[] = {0, 18, 12, 6, 0};
    static const int shifte[] = {0, 6, 4, 2, 0};

    unsigned char *s = buf;
    int len = lengths[s[0] >> 4]; // here we just grab the upper nibble (the 4 bits which determine the length)
    // I kept the 0s which appear to contribute to determining the error.
    // in theory the code past this point works in a similar fashion except for the error handling of the erroneous sequence of 11111

For the remaining decoding section here:

    *c  = (uint32_t)(s[0] & masks[len]) << 18;
    *c |= (uint32_t)(s[1] & 0x3f) << 12;
    *c |= (uint32_t)(s[2] & 0x3f) <<  6;
    *c |= (uint32_t)(s[3] & 0x3f) <<  0;
    *c >>= shiftc[len];

I haven't exactly tested this...but it strikes me that the masking operation doesn't need a table.

    *c  = (uint32_t)((s[0] << len) >> len) << 18;
    // alternatively
    *c  = (uint32_t)(s[0] & (0xff >> len)) << 18;

Similar concepts could apply for the shiftc and shifte tables as these are multiples of 6 and 2.

   //*c >>= shiftc[len];
   *c >>= 24 - 6*len;

   //*e >>= shifte[len];
   *e >>= 8 - 2*len;

I'm guessing you've probably written code like this already, but I was curious.

Sep 21 '23 17:09 Andersama

The extra steps are for error checking, so that invalidity is preserved all the way into the error accumulator. The tests are thorough and trivial to run: Either "make check" or just compile and run "test/tests.c" with your favorite C compiler. At minimum any changes must pass the tests, otherwise they've omitted a critical check. So I encourage you try out your changes in the tests!

The 5-bit, 32-element length table is because the 4-byte sequence prefix is actually 5 bits: 11110. If that last bit is not zero, it's invalid, which is captured by being "zero length." If only 4 bits are examined, that won't be checked.

Without masking, a bit that's supposed to be zero may shift to a position that turns an invalid input (e.g. an overlong encoding) into something that looks valid. Though perhaps some masking redundant, as it would be caught by other checks anyway.

Your suggestion for replacing shiftc with some arithmetic appears to valid. It just trips the error checks differently. Though it introduces a multiplication, and the current implementation has none. Perhaps that's worth doing to remove shiftc.

Your suggestion for replacing shifte works only if it's also masked, as otherwise it shifts out error checks:

e >>= (8 - 2len) & 7;

Perhaps that's also worth some arithmetic to drop accessing shifte. (Of course, that 2*len doesn't count as multiplication.)

Sep 23 '23 21:09 skeeto