Unicode support
What Unicode to choose for chars and strings?
UTF-8
Pros: Backward-compatible with ASCII, no need to support both "narrow" and "wide" strings
Cons: Chars have variable width, ambiguous len, sizeof and indexing. Poor support on Windows
UTF-16 Pros: Native for Windows. Fixed char width Cons: Unnatural for Linux. Incompatible with ASCII. Not all Unicode chars can be represented
UTF-32 Pros: Native for Linux. Fixed char width. Complete Unicode supported Cons: Unnatural for Windows. Incompatible with ASCII
Now we have a rudimentary support for UTF-8, as the latest terminals and C runtime libraries on Windows 10 and Linux support the C.UTF-8 or similar locale strings. String length returned by len() is in bytes, not in characters. Go does the same, though it is inconvenient.
fn main() {
s := "Привет" + ',' + " мир!"
printf("Строка: " + s + ", длина: " + repr(len(s)) + '\n')
}
...:~/umka-lang/umka_linux$ ./umka -locale C.UTF-8 ../test.um
Строка: Привет, мир!, длина: 21
On Windows, this feaure is available under MSVC, but not under MinGW (older runtime?). It seems that the MSVC runtime is also buggy: scanf() fails to read non-ASCII UTF-8.
On Linux everything works as expected.
@marekmaskarinec Please notice the API change: umkaInit() now requires locale, which can be NULL.
Need to consider creating a module like utf8 in Go: https://pkg.go.dev/unicode/utf8
@marekmaskarinec Do Umka's printf() and scanf() work correctly with non-ASCII UTF-8 strings on Void Linux? Everything is fine on Ubuntu 20, but not on Windows 10.
This program:
fn main() {
s := ""
scanf("%s", &s)
printf("%s\n", repr([]char(s)))
printf("%s\n", s)
}
Produces this (input included):
🬀🬾
{ 0xFFFFFFF0 0xFFFFFF9F 0xFFFFFFAC 0xFFFFFF80 0xFFFFFFF0 0xFFFFFF9F 0xFFFFFFAC 0xFFFFFFBE 0x00 }
🬀🬾
I did not touch the locale.
@marekmaskarinec And what if you set -locale C.UTF-8?
It doesn't seem to work.
[ tests ]$ umka -locale C.UTF-8 test.um
Error test.um (1, 1): Cannot set locale
I think the characters I used to test aren't UTF-8. Should I test with utf-8 characters?
Here is a test with some czech characters, which are utf-8.
řášďéě
{ 0xFFFFFFC5 0xFFFFFF99 0xFFFFFFC3 0xFFFFFFA1 0xFFFFFFC5 0xFFFFFFA1 0xFFFFFFC4 0xFFFFFF8F 0xFFFFFFC3 0xFFFFFFA9 0xFFFFFFC4 0xFFFFFF9B 0x00 }
řášďéě
@marekmaskarinec Thank you. I doubt if there any characters in Unicode which are not UTF-8. And what does the Linux shell command locale -a print on your machine?
[ ~ ]$ locale -a
C
POSIX
en_GB.utf8
en_US.utf8
@marekmaskarinec When running utf8test.um on my Windows machine, I get
bytes: 9
characters: 4
▀: U+2580
€: U+20ac
$: U+24
¢: U+a2
whereas, according to expected.log, it should be
bytes: 6
characters: 2
▀: U+2580
€: U+20ac
I'm not sure that expected.log is correct.
Another problem is that when I print the output to the console rather than a file, the characters are interpreted as Windows-1251 instead of UTF-8:
bytes: 9
characters: 4
тЦА: U+2580
тВм: U+20ac
$: U+24
┬в: U+a2
But as I said in another place, this is probably a problem with the MinGW C runtime.
I'm not sure that expected.log is correct.
Yes. I added some additional character so expected.log is incorrect.
@marekmaskarinec I have tested utf8.um on a Cyrillic string. The behavior seems to be incorrect:
string: ▀€$¢
bytes: 9
characters: 4
▀: U+2580
€: U+20ac
$: U+24
¢: U+a2
string: Привет, мир!
bytes: 21
characters: 12
ҟ: U+49f
?: U+4c0
Ҹ: U+4b8
Ҳ: U+4b2
ҵ: U+4b5
?: U+4c2
,: U+2c
: U+20
Ҽ: U+4bc
Ҹ: U+4b8
?: U+4c0
!: U+21
A third-party UTF-8 encoder gives the following representation for "Привет, мир!":
\xD0\x9F\xD1\x80\xD0\xB8\xD0\xB2\xD0\xB5\xD1\x82\x2C\x20\xD0\xBC\xD0\xB8\xD1\x80\x21
@marekmaskarinec Two other things to consider:
r^ < 0x7fetc. Shouldn't it ber^ <= 0x7f?1 << 8. Shouldn't it be1 << 7?
I fixed those things, bit with no effect. As far as I know, the problem is in getNextRune. Encoding works as intended.
Update: the problem might be with characters that have significant bits set to 1 in the first byte.
Update 2: turns out it was problem with the mask. I fixed it and now all except two characters decode corretly.
@marekmaskarinec Are you going to commit the changes? Or you hope to first figure out what has happened with the two remaining characters?
The changes are currently in my fork in branch utf8. I tried with one of the not working letters - CYRILLIC CAPITAL LETTER ER. It is generating 0x440, but the correct codepoint is 0x420. What I found out is that the byte I was getting was d1, but it's supposed to be d0.
im for utf8 to be honestly, either that or UTF-32, but given the poor support of UTF-32, i'd choose utf-8, as utf-16 can't represent all characters in 2 bytes anyway, nature of utf-8 makes it opt in, you either have an ascii string, but if you want, you add a foreign character, in this case it makes use of 8th bit, which allows for it to not conflict with ascii
@ishdx2 Yes, this is what I chose myself, but I hoped for a better support of UTF-8 by the C runtime and consoles over various platforms. On Linux the support is very good, on Windows it is not. MinGW does not have UTF-8 locales altogether, while MSVC supports them in printf(), but not in scanf(). This is weird.
I'm afraid you have to use UTF-16 winapi functions
UTF-8 is now supported by the utf8.um standard library module.
For Windows-specific console I/O problems, see #354.