crystal Native text encoding conversions

trafficstars

Crystal currently relies on iconv or GNU libiconv for conversions between text encodings. This has a few problems:

iconv does not guarantee the support for any encoding at all, yet it doesn't provide a standard way to query or enumerate this information. (The nonstandard iconvlist or libiconvlist is present in BSD libc and GNU libiconv respectively.) For all we know, an iconv implementation that doesn't support UTF-8 nor UTF-16 is still POSIX-compliant. The same goes for the invalid: :skip option.
The standard library already has separate APIs to deal with UTF-16, and technically UTF-32 too if we consider Char to be equivalent to Int32, yet they are not integrated into the usual transcoding APIs like String#encode and IO#set_encoding. In particular, it makes sense that these encodings should remain supported in those places, even when -Dwithout_iconv is defined.
Some system iconv implementations are known to be buggy, such as the macOS one and the Android one (Bionic libc, API level 28+).
GNU libiconv being licensed under LGPLv2.1 complicates certain deployment scenarios.

The essence of, for example, UTF-16 to UTF-8 conversion can be implemented on top of iconv's function signature as:

def iconv_utf16_to_utf8(in_buffer : UInt8**, in_buffer_left : Int32*, out_buffer : UInt8**, out_buffer_left : Int32*)
  utf16_slice = in_buffer.value.to_slice(in_buffer_left.value).unsafe_slice_of(UInt16)
  String.each_utf16_char(utf16_slice) do |ch|
    in_bytesize = ch.ord >= 0x10000 ? 4 : 2
    ch_bytesize = ch.bytesize
    break unless out_buffer_left.value >= ch_bytesize

    ch.each_byte do |b|
      out_buffer.value.value = b
      out_buffer.value += 1
    end

    in_buffer.value += in_bytesize
    in_buffer_left.value -= in_bytesize
    out_buffer_left.value -= ch_bytesize
  end
end

str = Bytes[0x61, 0x00, 0x62, 0x00, 0x3D, 0xD8, 0x02, 0xDE, 0x63, 0x00]
bytes = uninitialized UInt8[32]

in_buffer = str.to_unsafe
in_buffer_left = str.bytesize
out_buffer = bytes.to_unsafe
out_buffer_left = bytes.size
iconv_utf16_to_utf8(pointerof(in_buffer), pointerof(in_buffer_left), pointerof(out_buffer), pointerof(out_buffer_left))

String.new(bytes.to_slice[0, bytes.size - out_buffer_left]) # => "ab😂c"

Going in the opposite direction would need something like #13639 to be equally concise, but the point is that we could indeed achieve this without using iconv at all. If both the source and destination encodings are one of UTF-8, UTF-16, UTF-32, or maybe ASCII, then we could use our own native transcoders instead of iconv; or if we are ambitious enough, we could port the entire set of ICU character set mapping tables in an automated manner, and remove our dependency on iconv.

Sep 14 '24 12:09 HertzDevil

A pure crystal implementation would be lovely. For the sake of the argument, are there alternatives to libiconv?

Sep 14 '24 13:09 ysbaddaden

ICU4C's ucnv_* API, main problem is either the source or the destination has to be UTF-16
encoding_c, bindings for the encoding_rs Rust library implementing the W3C Encoding Standard (by the way this is a good baseline of what a standard library should probably provide if we end up not having the same coverage as GNU libiconv)
Most other alternatives are C++-only, a comparison is available at https://thephd.dev/the-c-c++-rust-string-text-encoding-api-landscape

Sep 14 '24 13:09 HertzDevil

Thank you 🙇

Sep 14 '24 14:09 ysbaddaden

The W3C Encoding Standard already sets the bar quite high, but seems to support a good list of general encodings :+1:

There's a part 2 to the comparison article that focuses on C and presents ztd.cuneicode. I'm not saying we should use it, but it sounds like a solid reference, and both articles are treasure trove of information.

Sep 16 '24 16:09 ysbaddaden

crystal crystal copied to clipboard

Native text encoding conversions

crystal
crystal copied to clipboard