asai icon indicating copy to clipboard operation
asai copied to clipboard

What would it take to support UTF16 encoding?

Open kentookura opened this issue 1 year ago • 4 comments

@TOTBWF

As reported by @Trebor-Huang in https://github.com/Trebor-Huang/vscode-forester/issues/11#issuecomment-2585703866, it seems that the VSCode LSP client refuses to work with UTF8 encoded positions. I was wondering if you could say a few words about the trouble you mention here?

Thanks!

kentookura avatar Jan 12 '25 12:01 kentookura

@kentookura To support UTF-16 efficiently, we need to avoid the recalculation of the byte offset of a UTF-16 unit when files change. This can be done by... (1) inefficient recalculation (oops) or (2) some smart data structure maintaining the mapping.

favonia avatar Jan 12 '25 13:01 favonia

For ASCII printable characters, I believe byte offsets and UTF-16 units coincide.

favonia avatar Jan 12 '25 13:01 favonia

Ugh, VSCode being a bad citizen yet again...

As for the data structure, I think some sort of rope segmented at points where UTF-16 and UTF-8 offsets disagree ought to work. Nodes further up the tree could then store both the UTF-8 and UTF-16 range that the subtree covers, so we'd get efficient queries in both directions.

TOTBWF avatar Jan 12 '25 17:01 TOTBWF

I don't necessarily think it's a case of VSCode being a bad citizen. The LSP protocol specification allows in recent versions for encodings other than UTF-16, but it still requires that UTF-16 be supported.

liamoc avatar Oct 07 '25 13:10 liamoc