Fix the handling of multibyte characters
Motivation
refs: https://github.com/Shopify/ruby-lsp/pull/2021#issuecomment-2110394331
When a Ruby file contains multibyte characters (like Japanese, Chinese, emoji, etc), the go to definition and hover features do not work correctly. Because the document referencing logic does not properly handle multibyte characters when calculating offsets.
This commit fixes the issue by:
- Modifying document referencing to properly handle multibyte characters when mapping between positions and offsets
- Adding test cases to verify go to definition work with multibyte characters
Implementation
- Modified RubyLsp::Document and RubyLsp::Requests::Request to calculate locations considering multibyte characters. This change utilizes the API implemented in Prism by https://github.com/ruby/prism/pull/2406.
- By passing the encoding when creating the Index, the index will be built taking multibyte characters into account.
Automated Tests
I added some tests for the definition and RubyLsp::RubyDocument.
Manual Tests
The definition jump for the following code snippet from https://github.com/shopify/ruby-lsp/issues/1347 works in this branch, but does not function correctly in the main branch.
class Test
TEST = 'test'
def method1
# テスト
pp TEST
end
end
cc @kddnewton in case you have any views on the Prism usage here.
It appears that from 'be648a6' onwards, it no longer works correctly on VSCode. I'll try to fix it. I would appreciate any advice you can provide.
@vinistock Thanks for your comment!
I think I'd rather favour consistency of using the code unit APIs everywhere for now and if we find performance issues we can try to address those.
It sounds good to me. I'll reset or revert the commits related to byte offsets (be648a6..ff4d1960). Additionally, I'll look into refactoring the code based on this review comment (https://github.com/Shopify/ruby-lsp/pull/2051#discussion_r1601997894).
This pull request is being marked as stale because there was no activity in the last 2 months
@vinistock I would appreciate it if you could kindly provide your comments on this PR. Should we wait for the implementation of ResourceUri?
@NotFounds Hi! I'm sorry, I got caught up with other priorities and dropped the ball on this one. Let's move forward with this PR before the resource URI changes since this fixes actual issues and the resource URI approach is mostly a correctness refactor.
Can you please re-open this PR (or a new one, whatever is easier) and I'll help push it over the finish line?
Let's take the approach of saving the code units with the new Prism API in the index. So essentially, we need to:
- Configure the index with the encoding that was negotiated between editor and server. After setting the global state encoding here, you want to grab the
@index.configurationobject and set the encoding there, so that we can check what is the encoding being used during indexing - You then need to pass the encoding (or maybe the entire config object?) to the declaration listener, where we will use the encoding to invoke the Prism location API
location.start_code_units_column(encoding)to get the proper locations for multibyte characters - Finally, we should add a few tests to ensure that we don't accidentally regress. One test per entity type should be okay. These would be:
@vinistock Thank you for your response. I'm glad we can move forward with the implementation of this feature! Since the changes seem significant, I'll create a new Pull Request.
I have a few questions:
- In the declaration listener, should I retrieve the encoding using something like
@index.configuration.encoding? I planned to include the encoding in theConfigurationin step 1. - Is considering multibyte code units during indexing primarily for performance reasons? (Just curious)
In the declaration listener, should I retrieve the encoding using something like @index.configuration.encoding? I planned to include the encoding in the Configuration in step 1.
Yes, that's correct. Let's pass the entire configuration object to the declaration listener. That way, if more configurations are added, it will already be able to access all of them.
Is considering multibyte code units during indexing primarily for performance reasons? (Just curious)
It's unfortunately a trade off of performance and correctness. If you have as much as a single multibyte character in a document, the entire source must be considered multibyte and computing the locations is significantly more expensive. If you don't have any multibyte characters, then Prism uses an ASCII-only optimization which is much much faster.
However, from a correctness standpoint, we need to be able to index declarations made using multibyte characters, especially for languages that need these characters like Japanese.
Also, having the correct code unit locations means that all features that depend on index entries will just work (hover, definition, completion, signature help, workspace symbol).