[Rust] Support convert utf16 encoded string to utf8 string
Is your feature request related to a problem? Please describe.
Currently Fury xlang serialization use utf8 for string encoding, which is not performance efficient in many languages.
We introduced utf16 in https://fury.apache.org/docs/specification/fury_xlang_serialization_spec#string . But rust native string doesn't support utf16, it's is utf-8 encoded.
We should support to transcode utf16 encoded string to utf8 string in fury rust deserialization.
Describe the solution you'd like
Implement utf16 to utf8 convertion in fury rust. The implementation should use SIMD to provide faster speed.
Additional context
#1413
Hi, I've created a basic demo. Since Rust's string encoding is UTF-8, we can directly use the String's API to convert UTF-16 encoded data into a string. However, this method doesn't utilize SIMD. So, I'm wondering what else needs to be done on top of this.
let bytes = [
0b01101000, // 'h'
0b00000000,
0b01100101, // 'e'
0b00000000,
0b01101100, // 'l'
0b00000000,
0b01101100, // 'l'
0b00000000,
0b01101111, // 'o'
0b00000000,
0b00010110, // '世' in UTF-16 little-endian
0b01001110,
0b01001100, // '界' in UTF-16 little-endian
0b01110101,
];
let utf16_vec: Vec<u16> = bytes
.chunks_exact(2)
.map(|chunk| u16::from_le_bytes([chunk[0], chunk[1]]))
.collect();
let utf8_string = String::from_utf16(&utf16_vec).expect("Invalid UTF-16 sequence");
println!("{}", utf8_string);
I'm not familiar with high-performance computing and I've only found the std::simd library, however, it is a nightly-only experimental API.
SIMD can be left in later pr. This pr can implement basic function only. The some for this method should not be difficult. We may implement some on fury directly instead of depend on a library. In this way, We can minimize dependencies.
So for current step, my task is to implement a function like fn utf16_to_string(utf16_data: &[u8], is_little_endian: bool) -> Result<String, Error> instead of using String's API ? Although String is a part of std library. Got it.
I just wrote a demo about this , using cpp, and only big endian for UTF-16. I have a question about whether the Byte Order of UTF-16 encoding and decoding in Xlang has been unified in Fury. I'll create a rust version soon.
#include <iostream>
#include <vector>
#include <codecvt>
#include <codecvt>
#include <locale>
std::vector<uint8_t> utf16_to_utf8(const std::vector<uint16_t> &utf16)
{
std::vector<uint8_t> utf8;
for (size_t i = 0; i < utf16.size(); ++i)
{
uint16_t wc = utf16[i];
if (wc < 0x80)
{
// 1-byte UTF-8
utf8.push_back(static_cast<uint8_t>(wc));
std::cout << 1 << "\n";
}
else if (wc < 0x800)
{
// 2-byte UTF-8
// 110????? 10??????
// need 11 bit suffix of wc
uint8_t second = static_cast<uint8_t>(wc & 0b111111 | 0b10000000);
uint8_t first = static_cast<uint8_t>(wc >> 6 & 0b11111 | 0b11000000);
utf8.push_back(first);
utf8.push_back(second);
std::cout << 2 << "\n";
}
else if (wc >= 0xD800 && wc <= 0xDBFF)
{
// Surrogate pair (4-byte UTF-8)
if (i + 1 < utf16.size())
{
// need extra byte
uint16_t wc2 = utf16[++i];
// utf16 to unicode
uint32_t code_point = (((wc - 0xD800) << 10) | (wc2 - 0xDC00)) + 0x10000;
// 11110??? 10?????? 10?????? 10??????
// need 21 bit suffix of code_point
uint8_t fourth = static_cast<uint8_t>(code_point & 0b111111 | 0b10000000);
uint8_t third = static_cast<uint8_t>(code_point >> 6 & 0b111111 | 0b10000000);
uint8_t second = static_cast<uint8_t>(code_point >> 12 & 0b111111 | 0b10000000);
uint8_t first = static_cast<uint8_t>(code_point >> 18 & 0b111 | 0b11110000);
utf8.push_back(first);
utf8.push_back(second);
utf8.push_back(third);
utf8.push_back(fourth);
std::cout << 3 << "\n";
}
else
{
throw std::runtime_error("Invalid UTF-16 string");
}
}
else
{
// 3-byte UTF-8
// 1110???? 10?????? 10??????
// need 16 bit suffix of wc, as same as wc itself
uint8_t third = static_cast<uint8_t>(wc & 0b111111 | 0b10000000);
uint8_t second = static_cast<uint8_t>(wc >> 6 & 0b111111 | 0b10000000);
uint8_t first = static_cast<uint8_t>(wc >> 12 | 0b11100000);
utf8.push_back(first);
utf8.push_back(second);
utf8.push_back(third);
std::cout << 4 << "\n";
}
}
return utf8;
}
int main()
{
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> convert;
std::u16string utf16_s = convert.from_bytes("Hé€lo, 世界!😀");
std::vector<uint16_t> utf16;
std::cout << "=====init utf16:" << std::endl;
for (uint16_t c : utf16_s)
{
printf("0x%04x,", c);
utf16.push_back(c);
}
std::cout << "\n";
// ====================================
std::vector<uint8_t> utf8 = utf16_to_utf8(utf16);
// ====================================
std::cout << "=====utf8:" << std::endl;
for (uint8_t byte : utf8)
{
printf("0x%02x,", byte);
}
std::cout << std::endl;
// final UTF-8 string
std::cout << "final string: " << std::string(utf8.begin(), utf8.end());
return 0;
}
print as follows
=====init utf16:
0x0048,0x00e9,0x20ac,0x006c,0x006f,0x002c,0x0020,0x4e16,0x754c,0x0021,0xd83d,0xde00,
=====utf8:
0x48,0xc3,0xa9,0xe2,0x82,0xac,0x6c,0x6f,0x2c,0x20,0xe4,0xb8,0x96,0xe7,0x95,0x8c,0x21,0xf0,0x9f,0x98,0x80,
final string: Hé€lo, 世界!😀
The byte order are little endian currently, but we plan to add big endian support later to support zero-copy for string encoding. So maybe we can left an option in current implementation.
Hi, I'd like to continue to implement the simd approach.