picrin
picrin copied to clipboard
RFC: Flexible String Representation
Related to #211.
Python community published a proposal named PEP 0393, "Flexible String Representation" and implemented in 3.3. picrin doesn't use wchar_t now, but we might benefit from any similar representation. What do you think?
Reference: https://www.python.org/dev/peps/pep-0393/
Once wasabiz and I discussed what picrin's internal representation of unicode would be. There are some pros of UTF-8 and several cons of UTF-16 (or UTF-32).
- As picrin is a lightweight implementation, memory consuming UTF-16 (or UTF-32) is not suitable.
wchar_tis specified to be at least 16bit width, that means you cannot usewchar_tto represent UTF-32- Because picrin uses ropes to hold strings, UTF-8's O(n) problem does not matter so much.
Thus remaining char and using UTF-8 will do, I think. How do you think @wasabiz ?
@KeenS Thank you. I agree with your opinion.
Some additions from my perspective:
- Caching UTF-32 sequences may boost performance of character-oriented operations, at the cost of memory efficiency as @KeenS pointed out.
- Strictly speaking,
wchar_tis not for storing Unicode characters, but wide characters. The portable usage ofwchar_tis treating it as an opaque type and not relying on its bit pattern.- Yes, the big three platforms all treat it as UTF-{16,32} code unit, but e.g. NetBSD doesn't.
@omasanori @KeenS
Sounds nice. I don't mind changing string internal representation unless it compiles on freestanding environment. UTF-32 sequence is good for programs doing heavy string modifications, but I think such a case is no more than 1% of the total. Providing a different structure is rational.
@omasanori @KeenS
Even if it breaks no-libc rule, if we can switch it off with macros, it'll probably be ok.