lsp-mode Incorrect handling of multibyte UTF-16 encodings

Incorrect handling of multibyte UTF-16 encodings

Open Vtec234 opened this issue 3 years ago • 21 comments

Describe the bug It seems the handling of multibyte UTF-16 encodings is incorrect in lsp-mode.

To Reproduce Create a file with just these contents: "🍋" - a single lemon emoji. Place the cursor after the lemon and type a character ("l", say, for lemon). Most emojis including this one are represented by two UTF-16 bytes, so since LSP specifies offsets as in a UTF-16 string representation, this is at column 2.

But lsp-mode sends column 1:

{"jsonrpc":"2.0","method":"textDocument/didChange","params":{"textDocument":{"uri":"file:///home/w/utf16.lean","version":1},"contentChanges":[{"range":{"start":{"line":0,"character":1},"end":{"line":0,"character":1}},"rangeLength":0,"text":"l"}]}}

Expected behavior Compare with e.g. the VSCode sample client, which sends 2 as it should:

{"jsonrpc":"2.0","method":"textDocument/didChange","params":{"textDocument":{"uri":"file:///home/w/utf16.lean","version":56},"contentChanges":[{"range":{"start":{"line":0,"character":2},"end":{"line":0,"character":2}},"rangeLength":0,"text":"l"}]}}

Which Language Server did you use Custom one added via the tutorial. lsp-mode version 7.0.1.

OS Linux

Aug 14 '20 21:08 Vtec234

We are hitting this in rust-analyzer: https://github.com/rust-analyzer/rust-analyzer/issues/4263#issuecomment-762800167.

Out of curiosity, how hard would it be for Emacs to report offsets in ut8 coordinate space (I believe Emacs uses that internally)? I'd be willing to implement a protocol extension to use utf-8 offsets throughout.

I could, but I would be reluctant to implement unicode codepoints coordinates: utt8 skips coordinates transcoding altogether, while unicode would just replace "map utf8 to utf16" with "map utf8 to unicode", which imo isn't meaningfully different.

Jan 19 '21 12:01 matklad

I don't think it will be hard to implement. We havennt fixed that issue due to the fact that it has a relatively low impact and it will affect performance.

Jan 19 '21 12:01 yyoncho

Currently the problem for me is, that this makes rust-analyzer panic :D

Jan 19 '21 16:01 bkchr

So, the priority of the issue now is bigger.

Jan 19 '21 16:01 yyoncho

@yyoncho could you maybe give me some pointer into the code? So, I could try to fix this issue?

Jan 26 '21 13:01 bkchr

@bkchr you have to fix both functions lsp--position-to-point and lsp--point-to-position. Also, AFAIK eglot has that function implemented.

Jan 26 '21 13:01 yyoncho

As a side note, clangd supports UTF-8 directly, if a special option is specified.

Jan 26 '21 13:01 nbfalcon

@nbfalcon thanks, I didn't realize that the extension already exists. I actually would prefer to fix this on the server-side than, to create social pressure to actually officially adopt that into the protocol.

EDIT: https://github.com/rust-analyzer/rust-analyzer/issues/7453

Jan 26 '21 14:01 matklad

Here is the relevant source. @matklad I agree that this would be great, since many servers/editors/libraries assume UTF-8 (well, everything aside from VSCode and JavaScript). There is an upstream bug about this already: https://github.com/microsoft/language-server-protocol/issues/376.

Jan 26 '21 14:01 nbfalcon

@bkchr

@yyoncho could you maybe give me some pointer into the code? So, I could try to fix this issue?

Did you start working on that?

Jan 30 '21 08:01 yyoncho

Jan 30 '21 08:01 bkchr

Ok, I will take a look.

Jan 30 '21 08:01 yyoncho

@bkchr - pushed a proposed fix at - https://github.com/yyoncho/lsp-mode/commit/aac9b6a10611ce9fcb4c9fe272f2835f0b6bb210

Jan 31 '21 08:01 yyoncho

Nice ty

Jan 31 '21 08:01 bkchr

As a side note, clangd supports UTF-8 directly, if a special option is specified.

Now that we have integration tests for clangd-9 on linux, this could be added as a regression test that adds offsetEncoding: ["utf-8"] in clientCapabilities during initialization

Feb 06 '21 00:02 petr-tik

Hi, what's the status of this? Just had rust-analyzer crash due to adding emoji in my code (on lsp-mode from a couple weeks ago).

Oct 08 '21 16:10 itamarst

On rust-analyzer's side, we fully implemented clangd's extension for UTF-8 offsets, so, on Emacs side you an either:

do utf-16 translation, which needs some code and CPU time
expose underlying byte offsets at zero cost (I believe Emacs internally uses something sufficiently close to utf-8 for this to work)

Oct 08 '21 16:10 matklad

@itamarst the fix proposed at https://github.com/yyoncho/lsp-mode/commit/aac9b6a10611ce9fcb4c9fe272f2835f0b6bb210 should work(after rebasing).

Oct 09 '21 07:10 yyoncho

The patch @yyoncho links there seems to work for me with both clangd and rust-analyzer.

Oct 13 '21 20:10 acowley

I have to benchmark it and if it is slow at least allow using utf8 when server supports it

Oct 13 '21 20:10 yyoncho

Hi, I'm running into this issue right now, and I'm wondering what is currently blocking the patch from being merged into master?

Jul 06 '22 03:07 wagk

@yyoncho Sorry for the ping, I've been working on the elixir language server, and just found this issue.

If the reason it hasn't been merged is solely due to performance, then I have a suggestion. You don't need to encode every character that you visit, you merely need to encode non-ascii characters (those that are >= 128) as you find them. Given the fact that the vast, vast majority of source code is ascii text, this shouldn't be a problem perf-wise.

If you haven't merged the fix for some other reason, please ping me, this is holding me up and causes the elixir-ls to produce incorrect results under lsp-mode. Utf8 on the server isn't really a fix, since older clients, windows and the lsp spec require utf-16.

Also, thanks for lsp-mode! It's great.

Nov 16 '22 04:11 scohen

the lsp spec require utf-16.

These days, LSP spec allows utf8 support.

It also formally requires utf16 support, but that seems like a bad design decision in the original protocol, so I personally would be quite happy with server/clients supporting only utf8 and pushing everyone else to do so =P

Nov 16 '22 08:11 matklad

Allowing utf8 and requiring utf16 aren't mutually exclusive 😉 Agreed about utf16 being a bad decision, but that's what the spec says compliant clients need to support, and you should. I don't see a reason to hold off other than performance, and I'll be glad to work with anyone to make the perf impact tiny. Indeed, if you don't reallocate the entire buffer, and only reallocate single multibyte characters when you encounter them, there will likely be no performance impact at all for most documents.
the fix is almost there, we can get it over the line

Nov 16 '22 15:11 scohen

personally would be quite happy with server/clients supporting only utf8 and pushing everyone else to do so

To be quite clear, such servers wouldn't be compliant with the lsp spec. Maybe a future version would remove utf16 support, but for now, servers are stuck handling both.

"This is the default and must always be supported by servers". It's wildly annoying, but we (server contributors) have to support it. Can you please make our job a little easier?

Nov 16 '22 15:11 scohen

@yyoncho can we have a resolution for this? This bug has been open for two years, and you have a fix for it that needs a tiny optimization. I’m willing to help get you there, but I don’t know elisp very well. I’d be willing to help via zoom chat, pop in to irc, submit pseudocode, anything.

Failing that, you should mark this as wontfix, and throw an exception when multibyte characters are encountered in a file when the server is in utf-16 mode, as well as clearly indicating that lsp-mode doesn’t support utf-16 ranges.

The current implementation is broken and causes emacs to produce invalid code. This is not a tenable situation.

Nov 18 '22 23:11 scohen

I just ran into this issue today, it still crashes rust-analyzer.

Can we get at least one of these fixes applied?

correct offsets: https://github.com/yyoncho/lsp-mode/commit/aac9b6a10611ce9fcb4c9fe272f2835f0b6bb210
declaring UTF-8 lsp mode: https://github.com/emacs-lsp/lsp-mode/issues/3344#issuecomment-1320677622

Feb 03 '23 04:02 teor2345

A workaround exists now with https://github.com/emacs-lsp/lsp-mode/pull/3958

For spec compliance, utf-16 offset support still has to be implemented (this issue is not fixed)
If emacs internally uses a utf-8 buffer, utf-8 offset support would have the best performance with language servers that also use utf-8 buffers

Feb 21 '23 10:02 flying-sheep

lsp-mode lsp-mode copied to clipboard

Incorrect handling of multibyte UTF-16 encodings

lsp-mode
lsp-mode copied to clipboard