Encode atoms using UTF-8
As of OTP-26, atoms are encoded using UTF-8 encoding tags, when using the term_to_binary/1,2 Nif.
This PR adopts the same behavior for AtomVM. In addition, this PR adds support for term_to_binary/2, with the {minor_version, 1} option, which will encode atoms using latin1 encoding, if they do not contain any UTF-8 extended characters.
This PR address issue #1004.
These changes are made under both the "Apache 2.0" and the "GNU Lesser General Public License 2.1 or later" license terms (dual license).
SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later
FWIW I've been sending Erlang terms from current AtomVM master using this branch to OTP 26.x applications without any issue when it comes to atom encoding.
I think we need to evolve this PR to something a bit different:
About serialization:
- we must stop using
ATOM_EXT(our implementation wasn't even correct) and replace it withSMALL_ATOM_UTF8_EXT - there is no reason to support
latin1when doing serialization, we never supported it for real (if you try that right now garbage is produced). So I don't think it really makes sense to accept serialization options right now. - We don't need to support
ATOM_UTF8_EXTin serialization codepath, since we don't support atoms longer than 255 bytes.
About deserialization:
- I'm not sure if exists any practical scenario in which
ATOM_UTF8_EXTis used for encoding an atom shorter than 256 bytes, longer atoms cannot be supported from us. -
ATOM_EXTmust be kept in deserialization codepath, but it must encode received buffer to real utf-8. (I can take care of this) -
SMALL_ATOM_UTF8_EXTmust verify that received buffer is valid utf8. (I can take care of this)
About serialization:
- we must stop using
ATOM_EXT(our implementation wasn't even correct) and replace it withSMALL_ATOM_UTF8_EXT
I believe that is what this PR does, in essence (unless you specify latin1 encoding).
- there is no reason to support
latin1when doing serialization, we never supported it for real (if you try that right now garbage is produced). So I don't think it really makes sense to accept serialization options right now.
Interesting. Even if the atom is, for example, ASCII? I mean, it does work in that case.
- We don't need to support
ATOM_UTF8_EXTin serialization codepath, since we don't support atoms longer than 255 bytes.
I agree.
About deserialization:
- I'm not sure if exists any practical scenario in which
ATOM_UTF8_EXTis used for encoding an atom shorter than 256 bytes, longer atoms cannot be supported from us.
Agreed.
ATOM_EXTmust be kept in deserialization codepath, but it must encode received buffer to real utf-8. (I can take care of this)
Yes, and has not been removed in this PR -- it was always there:
case ATOM_EXT: {
uint16_t atom_len = READ_16_UNALIGNED(external_term_buf + 1);
int global_atom_id = globalcontext_insert_atom_maybe_copy(glb, (AtomString) (external_term_buf + 2), copy);
*eterm_size = 3 + atom_len;
return term_from_atom_index(global_atom_id);
}
SMALL_ATOM_UTF8_EXTmust verify that received buffer is valid utf8. (I can take care of this)
Do you mean in parse_external_term?