kitty cursor location seems incorrect following zero-width-joined emoji

cursor location seems incorrect following zero-width-joined emoji

Open dankamongmen opened this issue 2 years ago • 12 comments

Describe the bug Emoji joined via a ZWJ display properly, but the cursor is moved too far. An example:

[schwarzgerat](0) $ printf '\e[6n\U0001f9d1\u200d\U0001f33e\e[6n'
🧑‍🌾  [schwarzgerat](0) $ 1R5R

as you can see, the cursor report indicates that we have moved four -- which is accurate, as we have indeed moved four cursor positions forward. we only ought have moved two.

FWIW, it's nice that kitty actually implements this; it's ahead of most terminals in this regard.

To Reproduce Steps to reproduce the behavior:

Get a cursor report
Print an emoji created via ZWJ (there might be some that work; the example above does not)
Get a cursor report
Look and see that spaces have been printed where no spaces were desired

Screenshots 2021-07-07-102248_781x659_scrot

Environment details

kitty 0.21.2 (e07ba2c53d) created by Kovid Goyal
Linux schwarzgerat 5.12.14nlb #1 SMP Sun Jul 4 17:05:42 EDT 2021 x86_64
Debian GNU/Linux 11 \n \l
Running under:X11
Loaded config files:
  /etc/xdg/kitty/kitty.conf
  /home/dank/.config/kitty/kitty.conf

Config options different from defaults:
background_opacity    0.7
enable_audio_bell     False
font_family           Hack
initial_window_height (72, 'cells')
initial_window_width  (132, 'cells')
scrollback_lines      20000
update_check_interval 0.0

This debug output has been copied to the clipboard

Additional context Happens the same way.

Jul 07 '21 14:07 dankamongmen

Yeah it's on my TODO list to add the ~1400 such sequences from https://unicode.org/emoji/charts/emoji-zwj-sequences.html to gen-wcwidth.py, generate some kind of fast lookup data structure for them and then use them in kitty's wcswidth() and text processing functions, however its a fairly large task with very minimal payoff since that still wont actually make them use able until some terminal programs add support for them in their wcswidth() implementations. In fact the current cursor movement in kitty actually matches most shells and editors expectations, changing that is likely to actually break more things in the short term.

If you wish to add support for them PRs are welcome.

More general and complete grapheme clustering: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

Jul 07 '21 14:07 kovidgoyal

yep, my assessment is similar. i'm looking at trying to implement unicode faithfully in this regard, at which point i'll start seeing which terminals match it. wcswidth() definitely breaks down; i think a big piece of the puzzle here would be a self-contained low-level, fast, minimal-memory EGC-to-width map. if kitty already has a fairly advanced one, how difficult would it be to extract and package up as an .so (by difficult i don't so much mean technical aspects, but more would you be interested in moving that external to kitty, and who would be maintaining it, etc)?

Jul 07 '21 14:07 dankamongmen

Yes, that's another item for my endless TODO list. It's never been worth the trouble for me to actually do the work in the face of higher priority items, but if you wish to collaborate on it, I am happy to be a co-maintainer.

I would be willing to donate kitty's gen-wcwidth.py to such a project, and to change kitty to depend on the .so produced by such a project. The caveat of course is that gen-wcwidth.py is used for a lot more than just wcswidth(), so the project would either have to become a bit of a grab-bag of functions useful for terminal text processing, or else it would become standalone and we would need to keep kitty and the projects wcswidth() functions in sync by hand.

Jul 07 '21 15:07 kovidgoyal

Sorry, I don't know the technical details, but I have an issue. When I type (copy paste) the emoji with ZWJ sequence, it does not print as a single emoji. technologist emoji Does my issue related to this issue? Thanks.

Step to reproduce:

go to https://emojipedia.org/technologist/ and copy (to the clipboard) the emoji
edit text file in terminal using file editor (nvim and go to insert mode)
paste to the terminal from clipboard (Ctrl+Shift+V)
the emoji shown as the image above

Debug Details

kitty 0.24.2 created by Kovid Goyal
Linux ROG 5.16.10-arch1-1 #1 SMP PREEMPT Wed, 16 Feb 2022 19:35:18 +0000 x86_64
Arch Linux 5.16.10-arch1-1 (/dev/tty)

Running under: X11
Frozen: False
Paths:
  kitty: /usr/bin/kitty
  base dir: /usr/lib/kitty
  extensions dir: /usr/lib/kitty/kitty
  system shell: /usr/bin/zsh
Loaded config files:
  /home/latipun/.config/kitty/kitty.conf

Config options different from defaults:
background_opacity            0.75
cursor_shape                  3
font_size                     13.0
scrollback_pager_history_size 19922944
window_padding_width          FloatEdges(left=7.0, top=7.0, right=7.0, bottom=7.0)
Changed shortcuts:
	ctrl+shift+delete → combine : clear_terminal active : send_text normal \x0c
Colors:
	active_border_color           #c9cbff   
	active_tab_background         #575268   
	active_tab_foreground         #f5c2e7   
	background                    #1e1e2e   
	bell_border_color             #fae3b0   
	color0                        #6e6c7e   
	color1                        #f28fad   
	color10                       #abe9b3   
	color11                       #fae3b0   
	color12                       #96cdfb   
	color13                       #f5c2e7   
	color14                       #89dceb   
	color15                       #d9e0ee   
	color2                        #abe9b3   
	color3                        #fae3b0   
	color4                        #96cdfb   
	color5                        #f5c2e7   
	color6                        #89dceb   
	color7                        #d9e0ee   
	color8                        #988ba2   
	color9                        #f28fad   
	cursor                        #f5e0dc   
	cursor_text_color             #1e1e2e   
	foreground                    #d9e0ee   
	inactive_border_color         #575268   
	inactive_tab_background       #1e1e2e   
	inactive_tab_foreground       #d9e0ee   
	mark1_background              #96cdfb   
	mark1_foreground              #1e1e2e   
	mark2_background              #f5c2e7   
	mark2_foreground              #1e1e2e   
	mark3_background              #b5e8e0   
	mark3_foreground              #1e1e2e   
	selection_background          #575268   
	selection_foreground          #d9e0ee   
	tab_bar_background            #161320   
	url_color                     #f5e0dc   

Environment variable names seen by the kitty process:
	ALTUSERXSESSION
	BROWSER
	CARGO_HOME
	DBUS_SESSION_BUS_ADDRESS
	DESKTOP_SESSION
	DISPLAY
	EDITOR
	ERRFILE
	FNM_DIR
	FORCE_COLOR
	GDMSESSION
	GH_TOKEN
	GOPATH
	GPG_TTY
	GTK2_RC_FILES
	GTK_MODULES
	GTK_RC_FILES
	HOME
	HOMEBREW_GITHUB_API_TOKEN
	LANG
	LC_COLLATE
	LC_CTYPE
	LC_MEASUREMENT
	LC_NUMERIC
	LC_TELEPHONE
	LESS
	LOGNAME
	MAIL
	MANPAGER
	MOTD_SHOWN
	NNN_ARCHIVE
	NNN_COLORS
	NNN_FCOLORS
	NNN_FIFO
	NNN_OPTS
	NPM_TOKEN
	PATH
	PWD
	QT_QPA_PLATFORMTHEME
	RUSTUP_HOME
	SHELL
	SHLVL
	USER
	USERXSESSION
	USERXSESSIONRC
	VISUAL
	VSCE_PAT
	XAUTHORITY
	XDG_CACHE_HOME
	XDG_CONFIG_HOME
	XDG_DATA_HOME
	XDG_GREETER_DATA_DIR
	XDG_RUNTIME_DIR
	XDG_SEAT
	XDG_SEAT_PATH
	XDG_SESSION_CLASS
	XDG_SESSION_DESKTOP
	XDG_SESSION_ID
	XDG_SESSION_PATH
	XDG_SESSION_TYPE
	XDG_STATE_HOME
	XDG_VTNR

Feb 25 '22 17:02 latipun7

This seems related to my issue, so I'm posting here. I just switched to gnome-terminal which has inconsistent rendering of emojis. At first I was pleased to see that kitty renders them properly, but then I noticed that the cursor positioning is problematic. As it stands, this is actually less usable than gnome-terminal.

https://user-images.githubusercontent.com/482367/158066201-131e8cb2-d788-44f9-bfa8-64d39ce5a190.mp4

Mar 13 '22 15:03 0xallie

That will be because whatever terminal program you are running is using a different width calculation from what kitty uses, which is based on the unicde standard. And unless your emoji is using zwj it is completely unrelated so post elsewhere.

Mar 13 '22 16:03 kovidgoyal

Sorry, I don't know the technical details, but I have an issue. When I type (copy paste) the emoji with ZWJ sequence, it does not print as a single emoji. Does my issue related to this issue? Thanks.

@latipun7: No, this is caused by zsh not supporting ZWJ. If you run cat and paste it you'll see that it works fine (except for the cursor problem this issue is about).

Mar 13 '22 17:03 trygveaa

@kovidgoyal @dankamongmen Just found this bug after working on a unicode library designed to handle this sort of thing. It winds up being fairly easy, and AFAICT my solution should be decently efficient (binary encoding of a codepoint-based Trie). The data for just emoji sequences (both zwj and otherwise) ends up at ~48kb, with a decent amount of extra data that kitty probably wouldn't need.

If there's interest, I'd be happy to provide suggestions for how it could be implemented in C

Jun 13 '22 15:06 sharnoff

On Mon, Jun 13, 2022 at 08:56:55AM -0700, Max Sharnoff wrote:

@kovidgoyal @dankamongmen Just found this bug after working on a unicode library designed to handle this sort of thing. It winds up being fairly easy, and AFAICT my solution should be decently efficient (binary encoding of a codepoint-based Trie). The data for just emoji sequences (both zwj and otherwise) ends up at ~48kb, with a decent amount of extra data that kitty probably wouldn't need.

If there's interest, I'd be happy to provide suggestions for how it could be implemented in C

Sure, I am always happy to discuss design ideas. By this sort of thing do you mean wcswidth() in general or looking up emoji combining sequences in particular. For zwj+emoji support in kitty one needs basically:

Adding zwj+emoji support to wcswidth_step() which basically tells you how the width of a string changes when you add a codepoint to it.
Changing kitty's cell data structure to support infinite length codepoint strings. This will likely be a auxilliary hash mapping shorts to heap allocated codepoint arrays. The shorts will be stored per cell. This will reduce cell memory usage by 4 bytes at the cost of making looking up the text in a cell more expensive (which is fortunately not a frequent operation).

Jun 13 '22 16:06 kovidgoyal

do you mean wcswidth() in general or looking up emoji combining sequences in particular

I was looking just at emoji sequences -- given that they're all suppposed to have the same width (in practice, font combinations can mess this up IIRC - e.g., defaulting to text presentation when the Emoji spec says otherwise, which is often a width of 1 column).

For adding support:

I was separating into grapheme clusters first, but the operation is mostly the same. If I were implementing this for kitty, I'd use the same sort of trie of codepoints, where each node stores (a) whether the sequence up to that point is one of the zwj sequences, or (b) how wide it would be otherwise. Changing wcswidth_step() gets a little tricky, because maybe it's implicitly expected not to decrease the width. Another tricky spot is handling cases like "this is most of a long zwj sequence, but it's missing the end so it's actually mutiple separate emoji now" -- there's a bit of extra work if those need to be separated into cells in post (and possibly re-parsed for new zwj sequence starters - I'm not sure whether any cases of this are possible).

I don't already know anything about kitty's cell datastructure, but I'd just add: if you're only using the listed zwj sequences, the longest sequence in the set is currently only 10 codepoints -- it doesn't need to be unbounded. But there are a small enough number of them (currently 1349) that packing them into existing empy space in the datatypes may be possible.

Jun 13 '22 21:06 sharnoff

On Mon, Jun 13, 2022 at 02:27:19PM -0700, Max Sharnoff wrote:

do you mean wcswidth() in general or looking up emoji combining sequences in particular

I was looking just at emoji sequences -- given that they're all suppposed to have the same width (in practice, font combinations can mess this up IIRC - e.g., defaulting to text presentation when the Emoji spec says otherwise, which is often a width of 1 column).

In a terminal context width calculations must be font independent.

For adding support:

I was separating into grapheme clusters first, but the operation is mostly the same. If I were implementing this for kitty, I'd use the same sort of trie of codepoints, where each node stores (a) whether the sequence up to that point is one of the zwj sequences, or (b) how wide it would be otherwise. Changing wcswidth_step() gets a little tricky, because maybe it's implicitly expected not to decrease the width. Another tricky spot is handling cases like "this is most of a long zwj sequence, but it's missing the end so it's actually mutiple separate emoji now" -- there's a bit of extra work if those need to be separated into cells in post (and possibly re-parsed for new zwj sequence starters - I'm not sure whether any cases of this are possible).

wcswidth_step() must be able to reduce width to support VS16 which converts emoji to text presentation. So that's not an issue.

I don't already know anything about kitty's cell datastructure, but I'd just add: if you're only using the listed zwj sequences, the longest sequence in the set is currently only 10 codepoints -- it doesn't need to be unbounded. But there are a small enough number of them (currently 1349) that packing them into existing empy space in the datatypes may be possible.

There is no empty space in the struct at the moment, however given the small number of such sequences one could encode them into the existing codepoint to mark mapping (kitty maps all combining unicode marks to numbers since there are only about 2000 odd of them they fit in a short. One could easily add another 2000 marks to this mapping. However, the question is how future proof this is and whether its not better to just allow infinite length sequences now.

Jun 14 '22 02:06 kovidgoyal

Sure, I am always happy to discuss design ideas. By this sort of thing do you mean wcswidth() in general or looking up emoji combining sequences in particular. For zwj+emoji support in kitty one needs basically: 1) Adding zwj+emoji support to wcswidth_step() which basically tells you how the width of a string changes when you add a codepoint to it. 2) Changing kitty's cell data structure to support infinite length codepoint strings. This will likely be a auxilliary hash mapping shorts to heap allocated codepoint arrays. The shorts will be stored per cell. This will reduce cell memory usage by 4 bytes at the cost of making looking up the text in a cell more expensive (which is fortunately not a frequent operation).

fwiw @sharnoff notcurses would need pretty much the exact same thing as @kovidgoyal mentions for his wcswidth_step(), except it would be in utf8_egc_len(). notcurses already handles arbitrarily long EGCs via spillover to the egcpool structure, of which there is one per ncplane.

Jun 16 '22 11:06 dankamongmen

kitty kitty copied to clipboard

cursor location seems incorrect following zero-width-joined emoji

kitty
kitty copied to clipboard