ruby-lsp Better handle multibyte character locations

The locations returned by Prism are all byte locations. However, editors expect different values depending on the selected encoding.

For UTF-8, we need to consider the length of characters, considering emoji and other 3/4 byte characters as length 1. For UTF-16 like VS Code, we need to consider 3/4 byte characters as length 2.

Currently, we're using Prism locations directly everywhere, which is incorrect for documents containing multibyte characters. We need some sort of abstraction that will return the right column positions taking the encoding into account.

Dec 13 '23 18:12 vinistock

Talk to Kevin about whether we can solve this on Prism.

Dec 18 '23 15:12 vinistock

Apparently, Prism locations support the char offsets and columns already. They are start_char_offset, start_char_column instead of start_offset and start_column (and a similar pattern for the end positions).

This might be significantly easier to fix.

Feb 09 '24 16:02 vinistock

Paired with Kevin and we proposed a new code units API in Prism. That's going to make solving this issue in the entire LSP much more easy.

We can make the encoding selected by the editor available everywhere inside the RubyLsp namespace and then always pass that to the code unit column methods, so that we get the positions expected by UTF-16 clients such as VS Code.

Feb 13 '24 20:02 vinistock

This issue is being marked as stale because there was no activity in the last 2 months

May 28 '24 12:05 github-actions[bot]

Fixed by #2619.

Definition, hover, signature help, completion and workspace symbol should now properly work with multibyte characters thanks to the fixes to the indexer.

We also fixed the locations returned by semantic highlighting in a previous PR.

For now, I'm going to close this issue as complete, but if anyone notices any other location issues for multibyte characters please report.

Sep 30 '24 18:09 vinistock

@vinistock Hi, thank you for reviewing and merging the Pull Request the other day.

Unfortunately, it seems that it's still not functioning correctly. I should have added tests related to the actual use cases. For example, adding the following test case to definition_expectation_test.rb will fail:

  def test_multibyte_character_precision
    source = <<~RUBY
      module Fほげ
        module Bar
          class Baz
          end
        end
      end

      Fほげ::Bar::Baz
    RUBY

    with_server(source, stub_no_typechecker: true) do |server, uri|
      # Foo
      server.process_message(
        id: 1,
        method: "textDocument/definition",
        params: { textDocument: { uri: uri }, position: { line: 7, character: 0 } },
      )
      range = server.pop_response.response[0].attributes[:targetRange].attributes
      range_hash = { start: range[:start].to_hash, end: range[:end].to_hash }
      assert_equal({ start: { line: 0, character: 0 }, end: { line: 5, character: 3 } }, range_hash)
    end
  end

After a brief investigation, it seems that RubyDocument#locate also needs to be addressed. In the previous update, multibyte characters were considered when indexing, but I think RubyDocument#locate is referring to Prism's ParseResult.(I'm not too sure)

In the example test case above, when the location is (8-0)-(8-17), loc.start_offset becomes 61, but it should actually be 57. This can be resolved by using loc.start_code_units_offset. https://github.com/Shopify/ruby-lsp/blob/0a710db07ad6c518a9b39204f2629457e64c2bad/lib/ruby_lsp/ruby_document.rb#L64

Oct 04 '24 05:10 NotFounds

@NotFounds great catch, I believe you are correct in your assessment. We need to start using the start_character_offset and end_character_offset APIs for Prism::Location when locating nodes, so that we take the multibyte characters into account.

Would you like to put a PR up for that?

Oct 04 '24 13:10 vinistock

Sure thing! I will create a Pull Request right away!

Oct 04 '24 15:10 NotFounds