kitty
kitty copied to clipboard
cursor location seems incorrect following zero-width-joined emoji
Describe the bug Emoji joined via a ZWJ display properly, but the cursor is moved too far. An example:
[schwarzgerat](0) $ printf '\e[6n\U0001f9d1\u200d\U0001f33e\e[6n'
🧑🌾 [schwarzgerat](0) $ 1R5R
as you can see, the cursor report indicates that we have moved four -- which is accurate, as we have indeed moved four cursor positions forward. we only ought have moved two.
FWIW, it's nice that kitty actually implements this; it's ahead of most terminals in this regard.
To Reproduce Steps to reproduce the behavior:
- Get a cursor report
- Print an emoji created via ZWJ (there might be some that work; the example above does not)
- Get a cursor report
- Look and see that spaces have been printed where no spaces were desired
Screenshots
Environment details
kitty 0.21.2 (e07ba2c53d) created by Kovid Goyal
Linux schwarzgerat 5.12.14nlb #1 SMP Sun Jul 4 17:05:42 EDT 2021 x86_64
Debian GNU/Linux 11 \n \l
Running under:X11
Loaded config files:
/etc/xdg/kitty/kitty.conf
/home/dank/.config/kitty/kitty.conf
Config options different from defaults:
background_opacity 0.7
enable_audio_bell False
font_family Hack
initial_window_height (72, 'cells')
initial_window_width (132, 'cells')
scrollback_lines 20000
update_check_interval 0.0
This debug output has been copied to the clipboard
Additional context Happens the same way.
Yeah it's on my TODO list to add the ~1400 such sequences from https://unicode.org/emoji/charts/emoji-zwj-sequences.html to gen-wcwidth.py, generate some kind of fast lookup data structure for them and then use them in kitty's wcswidth() and text processing functions, however its a fairly large task with very minimal payoff since that still wont actually make them use able until some terminal programs add support for them in their wcswidth() implementations. In fact the current cursor movement in kitty actually matches most shells and editors expectations, changing that is likely to actually break more things in the short term.
If you wish to add support for them PRs are welcome.
More general and complete grapheme clustering: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
yep, my assessment is similar. i'm looking at trying to implement unicode faithfully in this regard, at which point i'll start seeing which terminals match it. wcswidth()
definitely breaks down; i think a big piece of the puzzle here would be a self-contained low-level, fast, minimal-memory EGC-to-width map. if kitty already has a fairly advanced one, how difficult would it be to extract and package up as an .so (by difficult i don't so much mean technical aspects, but more would you be interested in moving that external to kitty, and who would be maintaining it, etc)?
Yes, that's another item for my endless TODO list. It's never been worth the trouble for me to actually do the work in the face of higher priority items, but if you wish to collaborate on it, I am happy to be a co-maintainer.
I would be willing to donate kitty's gen-wcwidth.py to such a project, and to change kitty to depend on the .so produced by such a project. The caveat of course is that gen-wcwidth.py is used for a lot more than just wcswidth(), so the project would either have to become a bit of a grab-bag of functions useful for terminal text processing, or else it would become standalone and we would need to keep kitty and the projects wcswidth() functions in sync by hand.
Sorry, I don't know the technical details, but I have an issue.
When I type (copy paste) the emoji with ZWJ sequence, it does not print as a single emoji.
Does my issue related to this issue? Thanks.
Step to reproduce:
- go to https://emojipedia.org/technologist/ and copy (to the clipboard) the emoji
- edit text file in terminal using file editor (nvim and go to insert mode)
- paste to the terminal from clipboard (Ctrl+Shift+V)
- the emoji shown as the image above
Debug Details
kitty 0.24.2 created by Kovid Goyal
Linux ROG 5.16.10-arch1-1 #1 SMP PREEMPT Wed, 16 Feb 2022 19:35:18 +0000 x86_64
Arch Linux 5.16.10-arch1-1 (/dev/tty)
Running under: X11
Frozen: False
Paths:
kitty: /usr/bin/kitty
base dir: /usr/lib/kitty
extensions dir: /usr/lib/kitty/kitty
system shell: /usr/bin/zsh
Loaded config files:
/home/latipun/.config/kitty/kitty.conf
Config options different from defaults:
background_opacity 0.75
cursor_shape 3
font_size 13.0
scrollback_pager_history_size 19922944
window_padding_width FloatEdges(left=7.0, top=7.0, right=7.0, bottom=7.0)
Changed shortcuts:
ctrl+shift+delete → combine : clear_terminal active : send_text normal \x0c
Colors:
active_border_color #c9cbff
active_tab_background #575268
active_tab_foreground #f5c2e7
background #1e1e2e
bell_border_color #fae3b0
color0 #6e6c7e
color1 #f28fad
color10 #abe9b3
color11 #fae3b0
color12 #96cdfb
color13 #f5c2e7
color14 #89dceb
color15 #d9e0ee
color2 #abe9b3
color3 #fae3b0
color4 #96cdfb
color5 #f5c2e7
color6 #89dceb
color7 #d9e0ee
color8 #988ba2
color9 #f28fad
cursor #f5e0dc
cursor_text_color #1e1e2e
foreground #d9e0ee
inactive_border_color #575268
inactive_tab_background #1e1e2e
inactive_tab_foreground #d9e0ee
mark1_background #96cdfb
mark1_foreground #1e1e2e
mark2_background #f5c2e7
mark2_foreground #1e1e2e
mark3_background #b5e8e0
mark3_foreground #1e1e2e
selection_background #575268
selection_foreground #d9e0ee
tab_bar_background #161320
url_color #f5e0dc
Environment variable names seen by the kitty process:
ALTUSERXSESSION
BROWSER
CARGO_HOME
DBUS_SESSION_BUS_ADDRESS
DESKTOP_SESSION
DISPLAY
EDITOR
ERRFILE
FNM_DIR
FORCE_COLOR
GDMSESSION
GH_TOKEN
GOPATH
GPG_TTY
GTK2_RC_FILES
GTK_MODULES
GTK_RC_FILES
HOME
HOMEBREW_GITHUB_API_TOKEN
LANG
LC_COLLATE
LC_CTYPE
LC_MEASUREMENT
LC_NUMERIC
LC_TELEPHONE
LESS
LOGNAME
MAIL
MANPAGER
MOTD_SHOWN
NNN_ARCHIVE
NNN_COLORS
NNN_FCOLORS
NNN_FIFO
NNN_OPTS
NPM_TOKEN
PATH
PWD
QT_QPA_PLATFORMTHEME
RUSTUP_HOME
SHELL
SHLVL
USER
USERXSESSION
USERXSESSIONRC
VISUAL
VSCE_PAT
XAUTHORITY
XDG_CACHE_HOME
XDG_CONFIG_HOME
XDG_DATA_HOME
XDG_GREETER_DATA_DIR
XDG_RUNTIME_DIR
XDG_SEAT
XDG_SEAT_PATH
XDG_SESSION_CLASS
XDG_SESSION_DESKTOP
XDG_SESSION_ID
XDG_SESSION_PATH
XDG_SESSION_TYPE
XDG_STATE_HOME
XDG_VTNR
This seems related to my issue, so I'm posting here. I just switched to gnome-terminal which has inconsistent rendering of emojis. At first I was pleased to see that kitty renders them properly, but then I noticed that the cursor positioning is problematic. As it stands, this is actually less usable than gnome-terminal.
https://user-images.githubusercontent.com/482367/158066201-131e8cb2-d788-44f9-bfa8-64d39ce5a190.mp4
That will be because whatever terminal program you are running is using a different width calculation from what kitty uses, which is based on the unicde standard. And unless your emoji is using zwj it is completely unrelated so post elsewhere.
Sorry, I don't know the technical details, but I have an issue. When I type (copy paste) the emoji with ZWJ sequence, it does not print as a single emoji.
Does my issue related to this issue? Thanks.
@latipun7: No, this is caused by zsh not supporting ZWJ. If you run cat
and paste it you'll see that it works fine (except for the cursor problem this issue is about).
@kovidgoyal @dankamongmen Just found this bug after working on a unicode library designed to handle this sort of thing. It winds up being fairly easy, and AFAICT my solution should be decently efficient (binary encoding of a codepoint-based Trie). The data for just emoji sequences (both zwj and otherwise) ends up at ~48kb, with a decent amount of extra data that kitty probably wouldn't need.
If there's interest, I'd be happy to provide suggestions for how it could be implemented in C
On Mon, Jun 13, 2022 at 08:56:55AM -0700, Max Sharnoff wrote:
@kovidgoyal @dankamongmen Just found this bug after working on a unicode library designed to handle this sort of thing. It winds up being fairly easy, and AFAICT my solution should be decently efficient (binary encoding of a codepoint-based Trie). The data for just emoji sequences (both zwj and otherwise) ends up at ~48kb, with a decent amount of extra data that kitty probably wouldn't need.
If there's interest, I'd be happy to provide suggestions for how it could be implemented in C
Sure, I am always happy to discuss design ideas. By this sort of thing do you mean wcswidth() in general or looking up emoji combining sequences in particular. For zwj+emoji support in kitty one needs basically:
-
Adding zwj+emoji support to wcswidth_step() which basically tells you how the width of a string changes when you add a codepoint to it.
-
Changing kitty's cell data structure to support infinite length codepoint strings. This will likely be a auxilliary hash mapping shorts to heap allocated codepoint arrays. The shorts will be stored per cell. This will reduce cell memory usage by 4 bytes at the cost of making looking up the text in a cell more expensive (which is fortunately not a frequent operation).
do you mean wcswidth() in general or looking up emoji combining sequences in particular
I was looking just at emoji sequences -- given that they're all suppposed to have the same width (in practice, font combinations can mess this up IIRC - e.g., defaulting to text presentation when the Emoji spec says otherwise, which is often a width of 1 column).
For adding support:
I was separating into grapheme clusters first, but the operation is mostly the same. If I were implementing this for kitty, I'd use the same sort of trie of codepoints, where each node stores (a) whether the sequence up to that point is one of the zwj sequences, or (b) how wide it would be otherwise. Changing wcswidth_step()
gets a little tricky, because maybe it's implicitly expected not to decrease the width. Another tricky spot is handling cases like "this is most of a long zwj sequence, but it's missing the end so it's actually mutiple separate emoji now" -- there's a bit of extra work if those need to be separated into cells in post (and possibly re-parsed for new zwj sequence starters - I'm not sure whether any cases of this are possible).
I don't already know anything about kitty's cell datastructure, but I'd just add: if you're only using the listed zwj sequences, the longest sequence in the set is currently only 10 codepoints -- it doesn't need to be unbounded. But there are a small enough number of them (currently 1349) that packing them into existing empy space in the datatypes may be possible.
On Mon, Jun 13, 2022 at 02:27:19PM -0700, Max Sharnoff wrote:
do you mean wcswidth() in general or looking up emoji combining sequences in particular
I was looking just at emoji sequences -- given that they're all suppposed to have the same width (in practice, font combinations can mess this up IIRC - e.g., defaulting to text presentation when the Emoji spec says otherwise, which is often a width of 1 column).
In a terminal context width calculations must be font independent.
For adding support:
I was separating into grapheme clusters first, but the operation is mostly the same. If I were implementing this for kitty, I'd use the same sort of trie of codepoints, where each node stores (a) whether the sequence up to that point is one of the zwj sequences, or (b) how wide it would be otherwise. Changing
wcswidth_step()
gets a little tricky, because maybe it's implicitly expected not to decrease the width. Another tricky spot is handling cases like "this is most of a long zwj sequence, but it's missing the end so it's actually mutiple separate emoji now" -- there's a bit of extra work if those need to be separated into cells in post (and possibly re-parsed for new zwj sequence starters - I'm not sure whether any cases of this are possible).
wcswidth_step() must be able to reduce width to support VS16 which converts emoji to text presentation. So that's not an issue.
I don't already know anything about kitty's cell datastructure, but I'd just add: if you're only using the listed zwj sequences, the longest sequence in the set is currently only 10 codepoints -- it doesn't need to be unbounded. But there are a small enough number of them (currently 1349) that packing them into existing empy space in the datatypes may be possible.
There is no empty space in the struct at the moment, however given the small number of such sequences one could encode them into the existing codepoint to mark mapping (kitty maps all combining unicode marks to numbers since there are only about 2000 odd of them they fit in a short. One could easily add another 2000 marks to this mapping. However, the question is how future proof this is and whether its not better to just allow infinite length sequences now.
Sure, I am always happy to discuss design ideas. By this sort of thing do you mean wcswidth() in general or looking up emoji combining sequences in particular. For zwj+emoji support in kitty one needs basically: 1) Adding zwj+emoji support to wcswidth_step() which basically tells you how the width of a string changes when you add a codepoint to it. 2) Changing kitty's cell data structure to support infinite length codepoint strings. This will likely be a auxilliary hash mapping shorts to heap allocated codepoint arrays. The shorts will be stored per cell. This will reduce cell memory usage by 4 bytes at the cost of making looking up the text in a cell more expensive (which is fortunately not a frequent operation).
fwiw @sharnoff notcurses would need pretty much the exact same thing as @kovidgoyal mentions for his wcswidth_step()
, except it would be in utf8_egc_len()
. notcurses already handles arbitrarily long EGCs via spillover to the egcpool
structure, of which there is one per ncplane
.