fury icon indicating copy to clipboard operation
fury copied to clipboard

[Go] Implement meta string encoding algorithm for golang

Open chaokunyang opened this issue 10 months ago • 8 comments

Is your feature request related to a problem? Please describe.

We've implemented meta string encoding algorithm in https://fury.apache.org/docs/specification/fury_xlang_serialization_spec#meta-string for java in #1514 , it's time to implement it in golang.

Describe the solution you'd like

Java implementation in #1514 can be taken as a reference. But note that the meta string encoding algorithm is used for encode field name only, so the special charater can't be . or $, thus the implementation will be simpler

Additional context

#1413

chaokunyang avatar Apr 19 '24 04:04 chaokunyang

Could you assign it to me? This is my first try of open source and I'm very interested in this task. Thanks.

qingoba avatar Apr 20 '24 04:04 qingoba

Great, thanks for the willingness to contribute to Fury

chaokunyang avatar Apr 20 '24 04:04 chaokunyang

In function public MetaString encode(String input, Encoding encoding) in file MetaStringEncoder.java, there is a section of code:

default:
  byte[] bytes = input.getBytes(StandardCharsets.UTF_8);
  return new MetaString(
      input, Encoding.UTF_8, specialChar1, specialChar2, bytes, bytes.length * 8, 0);

why the numBits is 0, rather bytes.length * 8 ? why the numChars is bytes.length * 8, rather bytes.length ?

qingoba avatar Apr 22 '24 07:04 qingoba

hmm, this is a bug, UTF-8 is barely used in meta string. Acutally, most chars are ascii chars. So it's not covered in Fury serialization tests. We need to fix it and add some unit tests.

Thanks for pointing out this bug @qingoba

chaokunyang avatar Apr 22 '24 13:04 chaokunyang

I have a new idea, we can add a bit to incidate whether strip last char in encoded meta string if the encoding is not UTF-8. In this way, we don't have to store num bits and num chars in MetaString

chaokunyang avatar Apr 23 '24 02:04 chaokunyang

Exactly. Because 5 + 5 > 8, in the last byte, there is at most one empty character. Suppose we use empty to mark whether last char is empty, then the actual number of characters is equal to len(bytes) * 8 / 5 - empty

qingoba avatar Apr 24 '24 05:04 qingoba

In this way, the Decoder does not need to accept numBits arguments.

qingoba avatar Apr 24 '24 06:04 qingoba

I have a new idea, we can add a bit to incidate whether strip last char in encoded meta string if the encoding is not UTF-8. In this way, we don't have to store num bits and num chars in MetaString

Hi @qingoba , I added stip last char flag to spec in #1565 . I believe this will make the implementation simpler

chaokunyang avatar Apr 24 '24 11:04 chaokunyang