sext icon indicating copy to clipboard operation
sext copied to clipboard

Unicode atoms are not supported

Open seriyps opened this issue 1 year ago • 1 comments

I don't think it is a particularly popular feature, but would be nice to have it just for completeness:

> kdb_sext:encode('привет').
** exception error: bad argument
     in function  list_to_binary/1
        called as list_to_binary([1087,1088,1080,1074,1077,1090])
        *** argument 1: not an iolist term
     in call from mnesia_eleveldb_sext:encode_atom/1 (/Users/***/src/mnesia_eleveldb_sext.erl, line 374)

(note, this is our internal fork of sext, differences are minimal)

I believe smth like this should solve the problem? Can open a PR if you don't see any obvious problem with this.

diff --git a/src/mnesia_eleveldb_sext.erl b/src/mnesia_eleveldb_sext.erl
index 0a7409b..337e794 100644
--- a/src/mnesia_eleveldb_sext.erl
+++ b/src/mnesia_eleveldb_sext.erl
@@ -371,7 +371,7 @@ encode_ref(Name, Rest) ->
     <<?reference, NameEnc/binary, RestEnc/binary>>.
 
 encode_atom(A) ->
-    Bin = list_to_binary(atom_to_list(A)),
+    Bin = atom_to_binary(A, utf8),
     Enc = encode_bin_elems(Bin),
     <<?atom, Enc/binary>>.
 
@@ -760,7 +760,7 @@ partial_decode(Other) ->
 
 decode_atom(B) ->
     {Bin, Rest} = decode_binary(B),
-    {list_to_atom(binary_to_list(Bin)), Rest}.
+    {binary_to_atom(Bin, utf8), Rest}.
 
 decode_tuple(Sz, Elems) ->
     decode_tuple(Sz,Elems,[]).

At least it works for decoding/encoding for me with this patch. Didn't check if ordering is still correct, guess it should (utf-8):

> kdb_sext:decode(kdb_sext:encode('привет')).
'привет'
> kdb_sext:decode(kdb_sext:encode(hello)).
hello

seriyps avatar Oct 16 '24 11:10 seriyps

Given the core mission of sext (preserving sort order), the question is whether there is any one way to do it re. Unicode (I guess the general answer to that question for Unicode is 'no').

From the Erlang docs:

Note that UTF-8 is not compatible with bytewise representation for code points between 128 and 255, so a ISO-Latin-1 bytewise representation is not generally compatible with UTF-8.

Perhaps an option would need to be introduced ...

uwiger avatar Oct 16 '24 12:10 uwiger