umka-lang icon indicating copy to clipboard operation
umka-lang copied to clipboard

Unicode support

Open vtereshkov opened this issue 4 years ago • 20 comments

What Unicode to choose for chars and strings?

UTF-8 Pros: Backward-compatible with ASCII, no need to support both "narrow" and "wide" strings Cons: Chars have variable width, ambiguous len, sizeof and indexing. Poor support on Windows

UTF-16 Pros: Native for Windows. Fixed char width Cons: Unnatural for Linux. Incompatible with ASCII. Not all Unicode chars can be represented

UTF-32 Pros: Native for Linux. Fixed char width. Complete Unicode supported Cons: Unnatural for Windows. Incompatible with ASCII

vtereshkov avatar Jul 18 '21 00:07 vtereshkov

Now we have a rudimentary support for UTF-8, as the latest terminals and C runtime libraries on Windows 10 and Linux support the C.UTF-8 or similar locale strings. String length returned by len() is in bytes, not in characters. Go does the same, though it is inconvenient.

fn main() {
    s := "Привет" + ',' + " мир!"
    printf("Строка: " + s + ", длина: " + repr(len(s)) + '\n')
}

...:~/umka-lang/umka_linux$ ./umka -locale C.UTF-8 ../test.um
Строка: Привет, мир!, длина: 21 

On Windows, this feaure is available under MSVC, but not under MinGW (older runtime?). It seems that the MSVC runtime is also buggy: scanf() fails to read non-ASCII UTF-8.

On Linux everything works as expected.

vtereshkov avatar Jul 18 '21 21:07 vtereshkov

@marekmaskarinec Please notice the API change: umkaInit() now requires locale, which can be NULL.

vtereshkov avatar Jul 19 '21 00:07 vtereshkov

Need to consider creating a module like utf8 in Go: https://pkg.go.dev/unicode/utf8

vtereshkov avatar Jul 19 '21 09:07 vtereshkov

@marekmaskarinec Do Umka's printf() and scanf() work correctly with non-ASCII UTF-8 strings on Void Linux? Everything is fine on Ubuntu 20, but not on Windows 10.

vtereshkov avatar Jul 19 '21 12:07 vtereshkov

This program:

fn main() {
    s := ""
    scanf("%s", &s)
    printf("%s\n", repr([]char(s)))
    printf("%s\n", s)
}

Produces this (input included):

🬀🬾
{ 0xFFFFFFF0 0xFFFFFF9F 0xFFFFFFAC 0xFFFFFF80 0xFFFFFFF0 0xFFFFFF9F 0xFFFFFFAC 0xFFFFFFBE 0x00 } 
🬀🬾

I did not touch the locale.

marekmaskarinec avatar Jul 23 '21 07:07 marekmaskarinec

@marekmaskarinec And what if you set -locale C.UTF-8?

vtereshkov avatar Jul 23 '21 09:07 vtereshkov

It doesn't seem to work.

[ tests ]$ umka -locale C.UTF-8 test.um
Error test.um (1, 1): Cannot set locale

I think the characters I used to test aren't UTF-8. Should I test with utf-8 characters?

marekmaskarinec avatar Jul 23 '21 09:07 marekmaskarinec

Here is a test with some czech characters, which are utf-8.

řášďéě
{ 0xFFFFFFC5 0xFFFFFF99 0xFFFFFFC3 0xFFFFFFA1 0xFFFFFFC5 0xFFFFFFA1 0xFFFFFFC4 0xFFFFFF8F 0xFFFFFFC3 0xFFFFFFA9 0xFFFFFFC4 0xFFFFFF9B 0x00 } 
řášďéě

marekmaskarinec avatar Jul 23 '21 09:07 marekmaskarinec

@marekmaskarinec Thank you. I doubt if there any characters in Unicode which are not UTF-8. And what does the Linux shell command locale -a print on your machine?

vtereshkov avatar Jul 23 '21 13:07 vtereshkov

[ ~ ]$ locale -a
C
POSIX
en_GB.utf8
en_US.utf8

marekmaskarinec avatar Jul 24 '21 06:07 marekmaskarinec

@marekmaskarinec When running utf8test.um on my Windows machine, I get

bytes: 9
characters: 4
▀: U+2580
€: U+20ac
$: U+24
¢: U+a2

whereas, according to expected.log, it should be

bytes: 6
characters: 2
▀: U+2580
€: U+20ac

I'm not sure that expected.log is correct.

Another problem is that when I print the output to the console rather than a file, the characters are interpreted as Windows-1251 instead of UTF-8:

bytes: 9
characters: 4
тЦА: U+2580
тВм: U+20ac
$: U+24
┬в: U+a2

But as I said in another place, this is probably a problem with the MinGW C runtime.

vtereshkov avatar Sep 04 '21 23:09 vtereshkov

I'm not sure that expected.log is correct.

Yes. I added some additional character so expected.log is incorrect.

marekmaskarinec avatar Sep 05 '21 09:09 marekmaskarinec

@marekmaskarinec I have tested utf8.um on a Cyrillic string. The behavior seems to be incorrect:

string: ▀€$¢
bytes: 9
characters: 4
▀: U+2580
€: U+20ac
$: U+24
¢: U+a2

string: Привет, мир!
bytes: 21
characters: 12
ҟ: U+49f
?: U+4c0
Ҹ: U+4b8
Ҳ: U+4b2
ҵ: U+4b5
?: U+4c2
,: U+2c
 : U+20
Ҽ: U+4bc
Ҹ: U+4b8
?: U+4c0
!: U+21

A third-party UTF-8 encoder gives the following representation for "Привет, мир!":

\xD0\x9F\xD1\x80\xD0\xB8\xD0\xB2\xD0\xB5\xD1\x82\x2C\x20\xD0\xBC\xD0\xB8\xD1\x80\x21

vtereshkov avatar Sep 05 '21 21:09 vtereshkov

@marekmaskarinec Two other things to consider:

  • r^ < 0x7f etc. Shouldn't it be r^ <= 0x7f?
  • 1 << 8. Shouldn't it be 1 << 7?

vtereshkov avatar Sep 05 '21 23:09 vtereshkov

I fixed those things, bit with no effect. As far as I know, the problem is in getNextRune. Encoding works as intended.

Update: the problem might be with characters that have significant bits set to 1 in the first byte.

Update 2: turns out it was problem with the mask. I fixed it and now all except two characters decode corretly.

marekmaskarinec avatar Sep 06 '21 10:09 marekmaskarinec

@marekmaskarinec Are you going to commit the changes? Or you hope to first figure out what has happened with the two remaining characters?

vtereshkov avatar Sep 09 '21 23:09 vtereshkov

The changes are currently in my fork in branch utf8. I tried with one of the not working letters - CYRILLIC CAPITAL LETTER ER. It is generating 0x440, but the correct codepoint is 0x420. What I found out is that the byte I was getting was d1, but it's supposed to be d0.

marekmaskarinec avatar Sep 10 '21 12:09 marekmaskarinec

im for utf8 to be honestly, either that or UTF-32, but given the poor support of UTF-32, i'd choose utf-8, as utf-16 can't represent all characters in 2 bytes anyway, nature of utf-8 makes it opt in, you either have an ascii string, but if you want, you add a foreign character, in this case it makes use of 8th bit, which allows for it to not conflict with ascii

ske2004 avatar Oct 16 '21 20:10 ske2004

@ishdx2 Yes, this is what I chose myself, but I hoped for a better support of UTF-8 by the C runtime and consoles over various platforms. On Linux the support is very good, on Windows it is not. MinGW does not have UTF-8 locales altogether, while MSVC supports them in printf(), but not in scanf(). This is weird.

vtereshkov avatar Oct 16 '21 21:10 vtereshkov

I'm afraid you have to use UTF-16 winapi functions

ske2004 avatar Sep 12 '22 05:09 ske2004

UTF-8 is now supported by the utf8.um standard library module.

For Windows-specific console I/O problems, see #354.

vtereshkov avatar Feb 24 '24 13:02 vtereshkov