korektor icon indicating copy to clipboard operation
korektor copied to clipboard

Change internal representation to UTF-8

Open foxik opened this issue 10 years ago • 2 comments

Currently, we are using UCS-2 as internal encoding, which disallows us to use Unicode characters outside of BMP.

We should change the internal representation, the current plans is to use UTF-8:

  • we will use char and string datatypes
  • input and output will be in UTF-8 (as it is today)
  • tokenizer will work on input UTF-8 string and the created tokens will be pointers to the original text
  • lexicon will contain words in UTF-8 (and transitively language models and morphology will use UTF-8)
  • error model will be in UTF-8, i.e. it will contain variable-length strings instead of tuples or triples Unicode characters
  • the SimWordsFinder::Find will have to interpret the UTF-8 encoding and understand that one Unicode character can be represented as multiple code units. Maybe the input word will be converted to UTF-32, but I do not think so, because both lexicon and error model will be in UTF-8

The alternative to UTF-8 is to use UTF-32, but

  • using UTF-8 is a standard solution, it is being used in Python/Perl (and for example in Python UTF-16/UTF-32 were used at some point in the past)
  • the UTF-8 representation is much more compact
  • even though UTF-8 disallow constant time random access, we only access word characters sequentially in Korektor; moreover, se can always perform UTF-8 <-> UTF-32 conversion

foxik avatar Apr 29 '15 11:04 foxik

UTF-8 is a good choice for internal representation, nevertheless, the modules that are responsible for similar words finding should in my opinion use UTF-32 internally:

  • Lexicon - the current implementation based on TRIE requires character of fixed length
  • SimWordsFinder - this class uses direct character access by index
  • ErrorModel - should use the same encoding as the SimWordsFinder (since error model is queried by SimWordsFinder)

Pros of using UTF-32 internally in the above classes

  • faster code
  • simpler code
  • less code changes required

Cons of using UTF-32 internally in the above classes

  • higher memory consumption (only Lexicon matters, error models are small in comparison)

I think that the pros far outweight the cons.

michalisek avatar May 20 '15 09:05 michalisek

From my point of view:

  • I am not sure the code will be faster with UTF-32, as the keys of the ErrorModel will be larger (4x for ASCII, ~3x for Czech)

  • The UTF-8 will require more complicated code, but

    • Lexicon will be unaffected (it will store bytes of UTF-8 encoding without understanding), except for GetSimilarWords_impl
    • ErrorModel will be unaffected (it will store bytes of UTF-8 encoding without understanding)
    • SimWordsFinder (which only handles casing) accesses the characters sequentially, so it will be simple to modify

    The most complicated method will be Lexicon::GetSimilarWords_impl, because it will have to deal with

    • when adding/replacing a character, it has to add possibly multiple bytes from the Lexicon trie
    • when deleting/replacing character from input string, it will have to remove possibly multiple bytes (from the end of the string)
  • the language models will eventually be in UTF-8 (either when we use library like kenlm, or when we rewrite them to use hashes)

  • eventually I want to rewrite Lexicon structure (it currently takes more time to find the suggestions than to query the language models), and UTF-8 will be much more suited for the new representation I have in mind

foxik avatar May 20 '15 10:05 foxik