avro-schema icon indicating copy to clipboard operation
avro-schema copied to clipboard

Support UTF-8 in record names, field names and enums

Open Totktonada opened this issue 6 years ago • 0 comments

  1. Should we check utf-8 validity?
    • I think yes, because is seems that there are no way to ban certain symbols in encoding-unaware way.
    • But once we checked it is valid utf8 we can still use built-in regexps (it allows to don't rewrite internals a lot).
  2. Should we check for some symbols like period or zero byte?
    • Period at least, see, say, fullname (frontend.lua).
  3. How to better organize this feature with utf8_enums flag?
    • I think we should just keep this flag and prefer this behaviour when both flags are provided. But the deletion unlikely will hurt anyone.
  4. Use tarantool facilities for identifiers?
    • No cost way: don't use tarantool identifiers, don't perform any validity check.
    • Use tarantool identifiers. It seems to be the good way. There are two possible approaches (both requires new utf8 module):
      • Add forbidden symbols into identifier_check* and expose identifier.c into Lua (add to utf8 module).
      • Expose identifier.c into Lua (add to utf8 module) and perform the identifier traversal using utf8.next for forbidden symbols.

Blocked by: https://github.com/tarantool/tarantool/issues/3405

The feature is to enable under flag, because of the spec compatibility.

Totktonada avatar May 16 '18 10:05 Totktonada