Better handle multibyte character locations
The locations returned by Prism are all byte locations. However, editors expect different values depending on the selected encoding.
For UTF-8, we need to consider the length of characters, considering emoji and other 3/4 byte characters as length 1. For UTF-16 like VS Code, we need to consider 3/4 byte characters as length 2.
Currently, we're using Prism locations directly everywhere, which is incorrect for documents containing multibyte characters. We need some sort of abstraction that will return the right column positions taking the encoding into account.
Talk to Kevin about whether we can solve this on Prism.
Apparently, Prism locations support the char offsets and columns already. They are start_char_offset, start_char_column instead of start_offset and start_column (and a similar pattern for the end positions).
This might be significantly easier to fix.
Paired with Kevin and we proposed a new code units API in Prism. That's going to make solving this issue in the entire LSP much more easy.
We can make the encoding selected by the editor available everywhere inside the RubyLsp namespace and then always pass that to the code unit column methods, so that we get the positions expected by UTF-16 clients such as VS Code.
This issue is being marked as stale because there was no activity in the last 2 months
Fixed by #2619.
Definition, hover, signature help, completion and workspace symbol should now properly work with multibyte characters thanks to the fixes to the indexer.
We also fixed the locations returned by semantic highlighting in a previous PR.
For now, I'm going to close this issue as complete, but if anyone notices any other location issues for multibyte characters please report.
@vinistock Hi, thank you for reviewing and merging the Pull Request the other day.
Unfortunately, it seems that it's still not functioning correctly. I should have added tests related to the actual use cases. For example, adding the following test case to definition_expectation_test.rb will fail:
def test_multibyte_character_precision
source = <<~RUBY
module Fほげ
module Bar
class Baz
end
end
end
Fほげ::Bar::Baz
RUBY
with_server(source, stub_no_typechecker: true) do |server, uri|
# Foo
server.process_message(
id: 1,
method: "textDocument/definition",
params: { textDocument: { uri: uri }, position: { line: 7, character: 0 } },
)
range = server.pop_response.response[0].attributes[:targetRange].attributes
range_hash = { start: range[:start].to_hash, end: range[:end].to_hash }
assert_equal({ start: { line: 0, character: 0 }, end: { line: 5, character: 3 } }, range_hash)
end
end
After a brief investigation, it seems that RubyDocument#locate also needs to be addressed. In the previous update, multibyte characters were considered when indexing, but I think RubyDocument#locate is referring to Prism's ParseResult.(I'm not too sure)
In the example test case above, when the location is (8-0)-(8-17), loc.start_offset becomes 61, but it should actually be 57. This can be resolved by using loc.start_code_units_offset. https://github.com/Shopify/ruby-lsp/blob/0a710db07ad6c518a9b39204f2629457e64c2bad/lib/ruby_lsp/ruby_document.rb#L64
@NotFounds great catch, I believe you are correct in your assessment. We need to start using the start_character_offset and end_character_offset APIs for Prism::Location when locating nodes, so that we take the multibyte characters into account.
Would you like to put a PR up for that?
Sure thing! I will create a Pull Request right away!