Change internal representation to UTF-8
Currently, we are using UCS-2 as internal encoding, which disallows us to use Unicode characters outside of BMP.
We should change the internal representation, the current plans is to use UTF-8:
- we will use
charandstringdatatypes - input and output will be in UTF-8 (as it is today)
- tokenizer will work on input UTF-8 string and the created tokens will be pointers to the original text
- lexicon will contain words in UTF-8 (and transitively language models and morphology will use UTF-8)
- error model will be in UTF-8, i.e. it will contain variable-length strings instead of tuples or triples Unicode characters
- the
SimWordsFinder::Findwill have to interpret the UTF-8 encoding and understand that one Unicode character can be represented as multiple code units. Maybe the input word will be converted to UTF-32, but I do not think so, because both lexicon and error model will be in UTF-8
The alternative to UTF-8 is to use UTF-32, but
- using UTF-8 is a standard solution, it is being used in Python/Perl (and for example in Python UTF-16/UTF-32 were used at some point in the past)
- the UTF-8 representation is much more compact
- even though UTF-8 disallow constant time random access, we only access word characters sequentially in Korektor; moreover, se can always perform UTF-8 <-> UTF-32 conversion
UTF-8 is a good choice for internal representation, nevertheless, the modules that are responsible for similar words finding should in my opinion use UTF-32 internally:
- Lexicon - the current implementation based on TRIE requires character of fixed length
- SimWordsFinder - this class uses direct character access by index
- ErrorModel - should use the same encoding as the SimWordsFinder (since error model is queried by SimWordsFinder)
Pros of using UTF-32 internally in the above classes
- faster code
- simpler code
- less code changes required
Cons of using UTF-32 internally in the above classes
- higher memory consumption (only Lexicon matters, error models are small in comparison)
I think that the pros far outweight the cons.
From my point of view:
-
I am not sure the code will be faster with UTF-32, as the keys of the ErrorModel will be larger (4x for ASCII, ~3x for Czech)
-
The UTF-8 will require more complicated code, but
- Lexicon will be unaffected (it will store bytes of UTF-8 encoding without understanding), except for GetSimilarWords_impl
- ErrorModel will be unaffected (it will store bytes of UTF-8 encoding without understanding)
- SimWordsFinder (which only handles casing) accesses the characters sequentially, so it will be simple to modify
The most complicated method will be Lexicon::GetSimilarWords_impl, because it will have to deal with
- when adding/replacing a character, it has to add possibly multiple bytes from the Lexicon trie
- when deleting/replacing character from input string, it will have to remove possibly multiple bytes (from the end of the string)
-
the language models will eventually be in UTF-8 (either when we use library like kenlm, or when we rewrite them to use hashes)
-
eventually I want to rewrite Lexicon structure (it currently takes more time to find the suggestions than to query the language models), and UTF-8 will be much more suited for the new representation I have in mind