UTF-16 column offsets
While working on https://github.com/sublimelsp/LSP/pull/932, I learned that pyls does not handle UTF-16. It assumes that (row, col) pairs are in terms of UTF-32 points.
For the row it doesn't matter whether we agree on UTF-8, UTF-16 or UTF-32. For the col it does matter. Consider this line in a python file:
x="😋"
The closing " has a column offset of 5, not 4:
x = " 😋 "
0 1 2 3 45 6 (UTF-16)
If we add some characters to the string, say:
x="😋x"
then an incremental text sync notification will notify pyls that an x has been added at offset 5. This greatly confuses and breaks pyls.
The relevant code for handling incremental text sync looks to be here: https://github.com/palantir/python-language-server/blob/21833eafd4e254a79d35ca539967f3a010e8c2a1/pyls/workspace.py#L150-L191
Looks like
new.write(line[:start_col])
and
new.write(line[end_col:])
need more attention for UTF-16 to work.
Thanks for noticing this problem. You are welcome to submit a pull request to fix this.
I've checked out this repository, but there are more problems on the horizon. Everywhere where pyls takes a position or a range in the request method, we should do the UTF-16 conversion.
From what I understand all "hooks" expect the position and range to be in UTF-32. So there needs to be a translation layer somewhere. I don't have enough of an overview of this codebase to make that decision.
I might be able to hack in a fix for the incremental text sync, but larger changes are needed.
For reference, clangd handles it like this:
https://github.com/llvm/llvm-project/blob/b632bd88a633c84eb2ce8f999119bc4e6c1ee98c/clang-tools-extra/clangd/SourceCode.cpp#L51-L82
By the way, if all source code is ASCII (most common case), then the problem won't show itself.
Correction: if the source code contains characters from the basic multilingual plane, then UTF-16 can encode that in one code point, so those characters will work by accident too.