CC-Tweaked Prototype: unicode string support

This aims to address the issue #860 about reading and writing unicode character into terminals.

~This pull request mostly adapt the first route in the discussion ("separate versions of methods for unicode").~ (Edit: no longer valid since the commit at 12 July.)

Additions

utflib api

This api provides a UTFString "class" that wraps a utf8-encoded byte string and act as a normal string. Functions that are provided in the standard string library, except string.dump, string.pack, string.packsize, string.unpack, are also provided in UTFString. Users can use this to adapt unicode strings into their old system painlessly. If users want to get Latin-1 string from UTFString, they can use UTFString:toLatin(). Otherwise tostring will return the backend byte string.

Besides UTFString, the module also exports the following functions:
1. fromLatin(str): consider the string as fully Latin-1 and convert it into utf8. Such function is provided as UTFString(str) will consider the string as already utf8-encoded, and only consider invalid byte subsequences as Latin1-encoded and convert them.
2. isUTFString(v): return true if v is a UTFString.
3. wrapStr(str): wrap a lua string so that normal string can be compared with unicode string.
4. isStringWrapper(v): return true if v is a string wrapper from wrapStr(str)
shell.unicode, edit.unicode, lua.unicode settings

New settings allows shell, edit and lua programs to receive and print unicode strings. Such settings will not affect other programs, especially user-defined programs.

Changes

Terminal Font Texture now supports unicode characters by dynamically baking them. Currently it simply uses GNU Unifont provided in vanilla Minecraft (also known as Legacy Unicode, uniform, "Force Unicode Font" options)
- Monitor in TBO drawing mode currently does not support it as it involves changing shader code.
TermMethods.write and TermMethods.blit functions
Now they accepts UTFString properly, and do not write "table: 0x??????" on screen.
- Current value check UTFString via duck test, which do not look really nice, and seems to have strong performance impact on edit program when unicode text presents.
char and paste events
Now it also send utf8-encoded string as the second parameter
read function
Now it accepts _bReadUnicode as the 5th argument, indicating whether it should take UTFString when true, or a normal string otherwise.

Roadmap

[x] Drafting apis
[ ] Confirming api designs
[ ] Writing test script <-- send help - I'm not familiar with designing test cases

Edit

2023-07-12: merge separated unicode modules/functions back to their normal variant to reduce code duplication hell.

Jul 09 '23 13:07 cvrunmin

tbh i do not believe that this has any chance of being merged into CC:T, best of luck but i feel like a lot of features like this have been rejected before because it would conflict with the mods "feel"

Jul 09 '23 14:07 9551-Dev

I really hope that this can find some kind of compromise, because having Unicode support for a terminal kinda is something you'd expect. And it would make stuff so much easier.

Jul 09 '23 14:07 SammyForReal

Personally fully in support of such addition. This is obviously an enromous change, but I believe the current status quo of only having latin + a few extra chars creates a pointless barrier of entry for people from different cultures and is not the way to go forward, even if this takes a while to polish out. I think the potential to slightly "break the feel", as dev1955 pointed out, is worth it to allow people unfamiliar with english to use the mod.

Jul 09 '23 15:07 Andrew-71

Personally fully in support of such addition. This is obviously an enromous change, but I believe the current status quo of only having latin + a few extra chars creates a pointless barrier of entry for people from different cultures and is not the way to go forward, even if this takes a while to polish out. I think the potential to slightly "break the feel", as dev1955 pointed out, is worth it to allow people unfamiliar with english to use the mod.

I dont think the rom even supports multiple languages so it sounds kinda pointless in making it more accessible for newbies Also you still have to use Latin for Lua

Im not against this change as i use older versions anyway but this still feels too drastic

Jul 09 '23 15:07 9551-Dev

I dont think the rom even supports multiple languages so it sounds kinda pointless in making it more accessible for newbies Also you still have to use Latin for Lua

I guess I worded this a bit poorly, I meant not that newbies would be able to set ROM to their language or code in it, but that non-technical users could potentially interact with CC programs others made in their native tongue e.g. a shop or a dashboard

Jul 09 '23 15:07 Andrew-71

O_o That's a lot of new functions to add and import...

Jul 09 '23 22:07 MCJack123

Thank you for looking into this. I realise this is a bit of a pain, but I think it probably makes sense to do this work in two stages/two PRs:

Internal changes (rendering, Terminal, anything which isn't exposed to user-code): This is quite a complex change in itself, but I think what needs doing is (relatively) well understood. A couple of quick comments on what's here already:
- We need to decide on what unicode features to support. I think my feeling is we should support double-wide/full-width forms, but not RTL or multi-codepoint graphemes (so anything involving ZWJ). That consistent with what Minecraft does (and supporting RTL in a terminal seems pretty a little terrifying).
- It might be worth looking at alternative fonts to Minecraft's - as you mentioned in #860, the glyph sizes are different, so things may look a little weird. We might be able to do something using a mixture of X11's font (they have a 6x9 variant), Monocraft and one of Bitfont's, with Unifont as a fallback.
It is going to be hard to maintain consistency with CC's existing font (after all, there's very few CJK characters which are recognisable at a 6x9 resolution!), but we can hopefully reduce the dissonance.
User-facing changes (so TermMethods/window, then expanding out from there): This is going to be much harder to get right, and I suspect take several iterations.

I think the current implementation confirms my worst fears from https://github.com/cc-tweaked/CC-Tweaked/issues/860#issuecomment-881637896, in that you end up duplicating a tonne of code.

It might be worthwhile putting together another version of the Lua-side changes based on a cut-down version of Jack's approach (https://github.com/cc-tweaked/CC-Tweaked/issues/860#issuecomment-881670038). If we convert UTFString to a Java-side userdata type, it should be possible to handle this type inside the Java methods too.

I'm not entirely sure this is the right option either - there's still some questions about how we want to handle receiving strings (both in return values, and from events - I'm not convinced duplicating events is very nice). Like I said, this is going to require some iterations to get right :).

Jul 11 '23 10:07 SquidDev

Quick response of point 1:

We need to decide on what unicode features to support. I think my feeling is we should support double-wide/full-width forms, but not RTL or multi-codepoint graphemes (so anything involving ZWJ). That consistent with what Minecraft does (and supporting RTL in a terminal seems pretty a little terrifying). Agree that we should support double-wide forms. Multi-codepoint graphemes screw things so much especially in trimming strings, so we could forget it for now.

As for the cloning hell in point 2, it is because we cannot distinguish when we expect a latin1 string and when we expect a utf8 string. I believe using UTFString userdata could help eliminating most, if not all of the duplication. However it also need a revamp on CC:T so that it support userdata. For example CobaltLuaMachine:toObject completely ignores userdata type. Still it should be better than making something like this:

term.setUtf(true)
term.write(...)
term.setUtf(false)

that could ruin the subsequent calls if someone forgets to restore state.

For duplicating event issue, would it be better if we send two params for char and paste, one normal string and one UTFString? The ambiguity on codepoint 128-255, whether is should be sent with Latin-1 or UTF-8 encoding concerns me and sending both of them comforts me.

Jul 11 '23 15:07 cvrunmin

I'm not entirely sure this is the right option either - there's still some questions about how we want to handle receiving strings (both in return values, and from events - I'm not convinced duplicating events is very nice).

FWIW, my approach for events in my test was to use the same event name, but adding a second parameter with the UTFString variant. This allows programs that are Unicode-aware to grab the Unicode version without needing to duplicate events.

I didn't do a good job at describing every change I made in my test back then, but you could take a look at the ROM patches for inspiration. I still think this is the most elegant solution, even if it ends up adding a bunch of extra Unicode options to functions.

Also, I don't really like the idea of making people interact with the UTF-8 representations of strings directly. It's really easy to slip up and end up putting it into a normal string function, which would destroy the codepoints and/or not function correctly. It's good that there's a usermode library to help, but IMO we shouldn't be exposing the raw encoding data to users unless they specifically ask for it (e.g. an encode method/serialization).

Jul 11 '23 17:07 MCJack123

How's this going ? It would be an actually great feature. I'm following the progress.

Jun 03 '24 19:06 VELD-Dev

CC-Tweaked
CC-Tweaked copied to clipboard

Prototype: unicode string support

Additions

`utflib` api

`shell.unicode`, `edit.unicode`, `lua.unicode` settings

Changes

`TermMethods.write` and `TermMethods.blit` functions

`char` and `paste` events

`read` function

Roadmap

Edit

CC-Tweaked CC-Tweaked copied to clipboard

Prototype: unicode string support

Additions

utflib api

shell.unicode, edit.unicode, lua.unicode settings

Changes

TermMethods.write and TermMethods.blit functions

char and paste events

read function

Roadmap

Edit

CC-Tweaked
CC-Tweaked copied to clipboard

`utflib` api

`shell.unicode`, `edit.unicode`, `lua.unicode` settings

`TermMethods.write` and `TermMethods.blit` functions

`char` and `paste` events

`read` function