xterm.js icon indicating copy to clipboard operation
xterm.js copied to clipboard

Better Unicode versions support

Open jerch opened this issue 5 years ago • 9 comments

Tracking issue for better Unicode version support.

Things to do:

  • create a build tool based on https://github.com/fluffos/fluffos/tree/master/src/thirdparty/widecharwidth to extract needed Unicode information for a specific version
  • automate version addon creation as far as possible with the build tool
  • options to treat ambiguous character as double width

better Unicode support in general:

  • bidi and grapheme support
  • long run: conceptual work towards a better unicode handling in terminals via terminal-wg

jerch avatar Jan 08 '20 11:01 jerch

just reminder that we also need have an options to treat ambiguous character to double width

thefallentree avatar Jan 10 '20 23:01 thefallentree

@thefallentree Woops, that was the initial reason to create the issue, thx for pointing out (moved it up).

jerch avatar Jan 10 '20 23:01 jerch

Does this issue address overlapping emoji problems in xterm terminals like the one in the following screenshot taken from VSCode integrated terminal? image

OS: Arch Linux 5.11.2-arch1-1 VSCode: 1.53.2 Emoji font: Noto Color Emoji (noto-fonts-emoji package from AUR)

gustavopch avatar Mar 03 '21 18:03 gustavopch

@gustavopch Yeah ideally most of these issue would be handled by that. Still there are corner cases where a proper result cannot be determined - if app side and the terminal host system have different ideas about a certain codepoint width we have to prioritize one side. Here we would have to go with app side by default to not get out of sync with line length assumption of the app.

This is mostly related to different unicode versions being used on either side and currently there is simply no way to harmonize app and terminal side in this regard. Not sure if we will ever see a good solution to that (would need a really advanced unicode terminal interface), also note that most local terminals do not suffer as much as xterm.js does (due its decoupled terminalview <--> PTY host nature).

jerch avatar Mar 05 '21 14:03 jerch

I've finally done the leg work of making a widechar_wcwidth() JS version, directly generated from newly released unicode 14

see https://github.com/fluffos/fluffos/blob/master/src/www/widechar_width.js

and see my xtermjs unicode 14 provider

https://github.com/fluffos/fluffos/blob/master/src/www/xterm-addon-unicode14.js

thefallentree avatar Oct 22 '21 22:10 thefallentree

@thefallentree :+1: Wow thats nice. Care to setup a build script we can use in xterm.js repo, that grabs the width definitions from the official unicode resources?

For v13 and v14 support I think we should get the grapheme clustering done in xterm.js. For that we'd need an extension of the unicode provider to also provide the segmentation tables and logic. So if you want to work on that - feel free to create a PR.

jerch avatar Oct 25 '21 13:10 jerch

@thefallentree 👍 Wow thats nice. Care to setup a build script we can use in xterm.js repo, that grabs the width definitions from the official unicode resources?

For v13 and v14 support I think we should get the grapheme clustering done in xterm.js. For that we'd need an extension of the unicode provider to also provide the segmentation tables and logic. So if you want to work on that - feel free to create a PR.

the generation script is in https://github.com/fluffos/widecharwidth/blob/master/generate.py . It's probably overkill to run this at each build time, I think we just need to check-in the generated file. Will you accept a PR to adding a offical unicode14 addon ? Does it have to be in typescript?

I'm not sure UAX #29 segemenation rules is really what we want here: Do we want to restrict each grapheme cluster to a single Cell (one or two width) ? if we do that, it also require changes of wcwidth(), instead of returing width for a single codepoint, it must now return a width for a graphmem cluster instead, and I have not seen anything in any language does the same

I propose a simpler solution. from what I understand through https://unicode.org/reports/tr29/#GB9 , just by special casing for ZWJ (U+200D ZERO WIDTH JOINER) it would solve 90% of our current problems. and we could wait for feedback on if implementing full UAX #29 is useful or not.

thefallentree avatar Oct 25 '21 16:10 thefallentree

the generation script is in https://github.com/fluffos/widecharwidth/blob/master/generate.py . It's probably overkill to run this at each build time, I think we just need to check-in the generated file.

Oh sweet, thx. Yes it should not run on every CI build. Instead we have a fixtures/ and a bin/ folder for stuff, that might run occassionally, or for setup/cleanup tasks. Imho we could put the generator scripts there, and can perma copy the output over to the source destination. I still would want the generator in the repo, as it is always a nightmare to fetch things from (prolly gone) remote sites, if something needs to fixed.

Will you accept a PR to adding a offical unicode14 addon ? Does it have to be in typescript?

Yes ofc, and yes for the base repo it should be in Typescript. (I think it is not that hard to convert, as it is only about adding already known type declarations.)

I'm not sure UAX #29 segemenation rules is really what we want here: Do we want to restrict each grapheme cluster to a single Cell (one or two width) ? if we do that, it also require changes of wcwidth(), instead of returing width for a single codepoint, it must now return a width for a graphmem cluster instead, and I have not seen anything in any language does the same

We def. want to have proper grapheme segmentation (for clusters), as some TEs already started to support them. The idea is to map most picographic clusters (flags and emojis) to 2 cells (wide), and older ones like combining chars to either one or two cells (depending on the "base" codepoint). Foreign/ancient scripting systems with graphemes will still not be solved by that (as the can have multiples of quarter/half/full widths), but there are ideas to overcome that with an explicit sequence later on requesting the actual width for a sequence of codepoints from the terminal. But thats still in the future. You can read about these ideas in https://github.com/contour-terminal/contour/issues/404.

I propose a simpler solution. from what I understand through https://unicode.org/reports/tr29/#GB9 , just by special casing for ZWJ (U+200D ZERO WIDTH JOINER) it would solve 90% of our current problems. and we could wait for feedback on if implementing full UAX #29 is useful or not.

Imho special casing things here does not help anyone and just creates more frictions. Back in 2018 I already started a PR for it (#1478) for ~v11~ v10, which could be used to get some ideas, but would need serious rework.

jerch avatar Oct 25 '21 16:10 jerch

If someone wants to implement extended grapheme clusters, you might find my unicode-properties library useful. In a single efficient lookup it can return both character widths (single, double, or ambiguous) and character classes (for determining grapheme cluster bounderies). This is used by DomTerm - see second screenshot on this page. It would not be difficult to extend unicode-properties to report additional character attributes.

Once you determine cluster boundaries, it should not be so difficult to draw them if using either a dom-based or canvas-based renderer, since you can delegate to the browser's text rendering.

I do think ultimately terminals need to support variable-width fonts, but this is a different and complicated discussion. (I've been pondering this for a while.)

PerBothner avatar Feb 11 '22 17:02 PerBothner