picrin RFC: Flexible String Representation

Related to #211.

Python community published a proposal named PEP 0393, "Flexible String Representation" and implemented in 3.3. picrin doesn't use wchar_t now, but we might benefit from any similar representation. What do you think?

Reference: https://www.python.org/dev/peps/pep-0393/

Jul 23 '15 14:07 omasanori

Once wasabiz and I discussed what picrin's internal representation of unicode would be. There are some pros of UTF-8 and several cons of UTF-16 (or UTF-32).

As picrin is a lightweight implementation, memory consuming UTF-16 (or UTF-32) is not suitable.
wchar_t is specified to be at least 16bit width, that means you cannot use wchar_t to represent UTF-32
Because picrin uses ropes to hold strings, UTF-8's O(n) problem does not matter so much.

Thus remaining char and using UTF-8 will do, I think. How do you think @wasabiz ?

Jul 23 '15 14:07 KeenS

@KeenS Thank you. I agree with your opinion.

Some additions from my perspective:

Caching UTF-32 sequences may boost performance of character-oriented operations, at the cost of memory efficiency as @KeenS pointed out.
Strictly speaking, wchar_t is not for storing Unicode characters, but wide characters. The portable usage of wchar_t is treating it as an opaque type and not relying on its bit pattern.
- Yes, the big three platforms all treat it as UTF-{16,32} code unit, but e.g. NetBSD doesn't.

Jul 23 '15 15:07 omasanori

@omasanori @KeenS

Sounds nice. I don't mind changing string internal representation unless it compiles on freestanding environment. UTF-32 sequence is good for programs doing heavy string modifications, but I think such a case is no more than 1% of the total. Providing a different structure is rational.

Jul 24 '15 07:07 nyuichi

@omasanori @KeenS

Even if it breaks no-libc rule, if we can switch it off with macros, it'll probably be ok.

Jul 24 '15 07:07 nyuichi

picrin picrin copied to clipboard

RFC: Flexible String Representation

picrin
picrin copied to clipboard