AtomVM icon indicating copy to clipboard operation
AtomVM copied to clipboard

Encode atoms using UTF-8

Open fadushin opened this issue 2 years ago • 1 comments

As of OTP-26, atoms are encoded using UTF-8 encoding tags, when using the term_to_binary/1,2 Nif.

This PR adopts the same behavior for AtomVM. In addition, this PR adds support for term_to_binary/2, with the {minor_version, 1} option, which will encode atoms using latin1 encoding, if they do not contain any UTF-8 extended characters.

This PR address issue #1004.

These changes are made under both the "Apache 2.0" and the "GNU Lesser General Public License 2.1 or later" license terms (dual license).

SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later

fadushin avatar Jan 01 '24 16:01 fadushin

FWIW I've been sending Erlang terms from current AtomVM master using this branch to OTP 26.x applications without any issue when it comes to atom encoding.

arpunk avatar Feb 07 '24 10:02 arpunk

I think we need to evolve this PR to something a bit different:

About serialization:

  1. we must stop using ATOM_EXT (our implementation wasn't even correct) and replace it with SMALL_ATOM_UTF8_EXT
  2. there is no reason to support latin1 when doing serialization, we never supported it for real (if you try that right now garbage is produced). So I don't think it really makes sense to accept serialization options right now.
  3. We don't need to support ATOM_UTF8_EXT in serialization codepath, since we don't support atoms longer than 255 bytes.

About deserialization:

  1. I'm not sure if exists any practical scenario in which ATOM_UTF8_EXT is used for encoding an atom shorter than 256 bytes, longer atoms cannot be supported from us.
  2. ATOM_EXT must be kept in deserialization codepath, but it must encode received buffer to real utf-8. (I can take care of this)
  3. SMALL_ATOM_UTF8_EXT must verify that received buffer is valid utf8. (I can take care of this)

bettio avatar Feb 24 '24 11:02 bettio

About serialization:

  1. we must stop using ATOM_EXT (our implementation wasn't even correct) and replace it with SMALL_ATOM_UTF8_EXT

I believe that is what this PR does, in essence (unless you specify latin1 encoding).

  1. there is no reason to support latin1 when doing serialization, we never supported it for real (if you try that right now garbage is produced). So I don't think it really makes sense to accept serialization options right now.

Interesting. Even if the atom is, for example, ASCII? I mean, it does work in that case.

  1. We don't need to support ATOM_UTF8_EXT in serialization codepath, since we don't support atoms longer than 255 bytes.

I agree.

About deserialization:

  1. I'm not sure if exists any practical scenario in which ATOM_UTF8_EXT is used for encoding an atom shorter than 256 bytes, longer atoms cannot be supported from us.

Agreed.

  1. ATOM_EXT must be kept in deserialization codepath, but it must encode received buffer to real utf-8. (I can take care of this)

Yes, and has not been removed in this PR -- it was always there:

        case ATOM_EXT: {
            uint16_t atom_len = READ_16_UNALIGNED(external_term_buf + 1);

            int global_atom_id = globalcontext_insert_atom_maybe_copy(glb, (AtomString) (external_term_buf + 2), copy);

            *eterm_size = 3 + atom_len;
            return term_from_atom_index(global_atom_id);
        }
  1. SMALL_ATOM_UTF8_EXT must verify that received buffer is valid utf8. (I can take care of this)

Do you mean in parse_external_term?

fadushin avatar Feb 24 '24 18:02 fadushin