fury [Rust] Support convert utf16 encoded string to utf8 string

Is your feature request related to a problem? Please describe.

Currently Fury xlang serialization use utf8 for string encoding, which is not performance efficient in many languages.

We introduced utf16 in https://fury.apache.org/docs/specification/fury_xlang_serialization_spec#string . But rust native string doesn't support utf16, it's is utf-8 encoded.

We should support to transcode utf16 encoded string to utf8 string in fury rust deserialization.

Describe the solution you'd like

Implement utf16 to utf8 convertion in fury rust. The implementation should use SIMD to provide faster speed.

Additional context

#1413

Apr 19 '24 04:04 chaokunyang

Hi, I've created a basic demo. Since Rust's string encoding is UTF-8, we can directly use the String's API to convert UTF-16 encoded data into a string. However, this method doesn't utilize SIMD. So, I'm wondering what else needs to be done on top of this.

let bytes = [
        0b01101000, // 'h'
        0b00000000, 
        0b01100101, // 'e'
        0b00000000, 
        0b01101100, // 'l'
        0b00000000, 
        0b01101100, // 'l'
        0b00000000, 
        0b01101111, // 'o'
        0b00000000, 
        0b00010110, // '世' in UTF-16 little-endian
        0b01001110, 
        0b01001100, // '界' in UTF-16 little-endian
        0b01110101, 
    ];
    let utf16_vec: Vec<u16> = bytes
        .chunks_exact(2)
        .map(|chunk| u16::from_le_bytes([chunk[0], chunk[1]]))
        .collect();
    let utf8_string = String::from_utf16(&utf16_vec).expect("Invalid UTF-16 sequence");
    println!("{}", utf8_string);

I'm not familiar with high-performance computing and I've only found the std::simd library, however, it is a nightly-only experimental API.

Jul 06 '24 03:07 urlyy

SIMD can be left in later pr. This pr can implement basic function only. The some for this method should not be difficult. We may implement some on fury directly instead of depend on a library. In this way, We can minimize dependencies.

Jul 06 '24 04:07 chaokunyang

So for current step, my task is to implement a function like fn utf16_to_string(utf16_data: &[u8], is_little_endian: bool) -> Result<String, Error> instead of using String's API ? Although String is a part of std library. Got it.

Jul 06 '24 05:07 urlyy

I just wrote a demo about this , using cpp, and only big endian for UTF-16. I have a question about whether the Byte Order of UTF-16 encoding and decoding in Xlang has been unified in Fury. I'll create a rust version soon.

#include <iostream>
#include <vector>
#include <codecvt>
#include <codecvt>
#include <locale>

std::vector<uint8_t> utf16_to_utf8(const std::vector<uint16_t> &utf16)
{
    std::vector<uint8_t> utf8;
    for (size_t i = 0; i < utf16.size(); ++i)
    {
        uint16_t wc = utf16[i];
        if (wc < 0x80)
        {
            // 1-byte UTF-8
            utf8.push_back(static_cast<uint8_t>(wc));
            std::cout << 1 << "\n";
        }
        else if (wc < 0x800)
        {
            // 2-byte UTF-8
            // 110????? 10??????
            // need 11 bit suffix of wc
            uint8_t second = static_cast<uint8_t>(wc & 0b111111 | 0b10000000);
            uint8_t first = static_cast<uint8_t>(wc >> 6 & 0b11111 | 0b11000000);
            utf8.push_back(first);
            utf8.push_back(second);
            std::cout << 2 << "\n";
        }
        else if (wc >= 0xD800 && wc <= 0xDBFF)
        {
            // Surrogate pair (4-byte UTF-8)
            if (i + 1 < utf16.size())
            {
                // need extra byte
                uint16_t wc2 = utf16[++i];
                // utf16 to unicode
                uint32_t code_point = (((wc - 0xD800) << 10) | (wc2 - 0xDC00)) + 0x10000;
                // 11110??? 10?????? 10?????? 10??????
                // need 21 bit suffix of code_point
                uint8_t fourth = static_cast<uint8_t>(code_point & 0b111111 | 0b10000000);
                uint8_t third = static_cast<uint8_t>(code_point >> 6 & 0b111111 | 0b10000000);
                uint8_t second = static_cast<uint8_t>(code_point >> 12 & 0b111111 | 0b10000000);
                uint8_t first = static_cast<uint8_t>(code_point >> 18 & 0b111 | 0b11110000);
                utf8.push_back(first);
                utf8.push_back(second);
                utf8.push_back(third);
                utf8.push_back(fourth);
                std::cout << 3 << "\n";
            }
            else
            {
                throw std::runtime_error("Invalid UTF-16 string");
            }
        }
        else
        {
            // 3-byte UTF-8
            // 1110???? 10?????? 10??????
            // need 16 bit suffix of wc, as same as wc itself
            uint8_t third = static_cast<uint8_t>(wc & 0b111111 | 0b10000000);
            uint8_t second = static_cast<uint8_t>(wc >> 6 & 0b111111 | 0b10000000);
            uint8_t first = static_cast<uint8_t>(wc >> 12 | 0b11100000);
            utf8.push_back(first);
            utf8.push_back(second);
            utf8.push_back(third);
            std::cout << 4 << "\n";
        }
    }
    return utf8;
}

int main()
{
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> convert;
    std::u16string utf16_s = convert.from_bytes("Hé€lo, 世界!😀");
    std::vector<uint16_t> utf16;
    std::cout << "=====init utf16:" << std::endl;
    for (uint16_t c : utf16_s)
    {
        printf("0x%04x,", c);
        utf16.push_back(c);
    }
    std::cout << "\n";
    //   ====================================
    std::vector<uint8_t> utf8 = utf16_to_utf8(utf16);
    //   ====================================
    std::cout << "=====utf8:" << std::endl;
    for (uint8_t byte : utf8)
    {
        printf("0x%02x,", byte);
    }
    std::cout << std::endl;
    // final UTF-8 string
    std::cout << "final string: " << std::string(utf8.begin(), utf8.end());
    return 0;
}

print as follows

=====init utf16:
0x0048,0x00e9,0x20ac,0x006c,0x006f,0x002c,0x0020,0x4e16,0x754c,0x0021,0xd83d,0xde00,
=====utf8:
0x48,0xc3,0xa9,0xe2,0x82,0xac,0x6c,0x6f,0x2c,0x20,0xe4,0xb8,0x96,0xe7,0x95,0x8c,0x21,0xf0,0x9f,0x98,0x80,
final string: Hé€lo, 世界!😀

Jul 10 '24 08:07 urlyy

The byte order are little endian currently, but we plan to add big endian support later to support zero-copy for string encoding. So maybe we can left an option in current implementation.

Jul 10 '24 09:07 chaokunyang

Hi, I'd like to continue to implement the simd approach.

Jul 23 '24 16:07 urlyy