crystal
crystal copied to clipboard
Native text encoding conversions
Crystal currently relies on iconv or GNU libiconv for conversions between text encodings. This has a few problems:
- iconv does not guarantee the support for any encoding at all, yet it doesn't provide a standard way to query or enumerate this information. (The nonstandard
iconvlistorlibiconvlistis present in BSD libc and GNU libiconv respectively.) For all we know, an iconv implementation that doesn't support UTF-8 nor UTF-16 is still POSIX-compliant. The same goes for theinvalid: :skipoption. - The standard library already has separate APIs to deal with UTF-16, and technically UTF-32 too if we consider
Charto be equivalent toInt32, yet they are not integrated into the usual transcoding APIs likeString#encodeandIO#set_encoding. In particular, it makes sense that these encodings should remain supported in those places, even when-Dwithout_iconvis defined. - Some system iconv implementations are known to be buggy, such as the macOS one and the Android one (Bionic libc, API level 28+).
- GNU libiconv being licensed under LGPLv2.1 complicates certain deployment scenarios.
The essence of, for example, UTF-16 to UTF-8 conversion can be implemented on top of iconv's function signature as:
def iconv_utf16_to_utf8(in_buffer : UInt8**, in_buffer_left : Int32*, out_buffer : UInt8**, out_buffer_left : Int32*)
utf16_slice = in_buffer.value.to_slice(in_buffer_left.value).unsafe_slice_of(UInt16)
String.each_utf16_char(utf16_slice) do |ch|
in_bytesize = ch.ord >= 0x10000 ? 4 : 2
ch_bytesize = ch.bytesize
break unless out_buffer_left.value >= ch_bytesize
ch.each_byte do |b|
out_buffer.value.value = b
out_buffer.value += 1
end
in_buffer.value += in_bytesize
in_buffer_left.value -= in_bytesize
out_buffer_left.value -= ch_bytesize
end
end
str = Bytes[0x61, 0x00, 0x62, 0x00, 0x3D, 0xD8, 0x02, 0xDE, 0x63, 0x00]
bytes = uninitialized UInt8[32]
in_buffer = str.to_unsafe
in_buffer_left = str.bytesize
out_buffer = bytes.to_unsafe
out_buffer_left = bytes.size
iconv_utf16_to_utf8(pointerof(in_buffer), pointerof(in_buffer_left), pointerof(out_buffer), pointerof(out_buffer_left))
String.new(bytes.to_slice[0, bytes.size - out_buffer_left]) # => "ab😂c"
Going in the opposite direction would need something like #13639 to be equally concise, but the point is that we could indeed achieve this without using iconv at all. If both the source and destination encodings are one of UTF-8, UTF-16, UTF-32, or maybe ASCII, then we could use our own native transcoders instead of iconv; or if we are ambitious enough, we could port the entire set of ICU character set mapping tables in an automated manner, and remove our dependency on iconv.
A pure crystal implementation would be lovely. For the sake of the argument, are there alternatives to libiconv?
- ICU4C's
ucnv_*API, main problem is either the source or the destination has to be UTF-16 - encoding_c, bindings for the encoding_rs Rust library implementing the W3C Encoding Standard (by the way this is a good baseline of what a standard library should probably provide if we end up not having the same coverage as GNU libiconv)
- Most other alternatives are C++-only, a comparison is available at https://thephd.dev/the-c-c++-rust-string-text-encoding-api-landscape
Thank you 🙇
The W3C Encoding Standard already sets the bar quite high, but seems to support a good list of general encodings :+1:
There's a part 2 to the comparison article that focuses on C and presents ztd.cuneicode. I'm not saying we should use it, but it sounds like a solid reference, and both articles are treasure trove of information.