Parser Add option to use utf-16 positions instead of utf-8

We use this parser for a custom linter that we ship through a VSCode extension (rustpython_parser::ast::Suite::parse_without_path). The VSCode extension api expects byte positions to the text encoded in utf-16, but the parser returns the byte positions to text encoded in utf-8. Can we add an option to use utf-16 encoded byte positions?

I'd be happy to implement this feature if you're comfortable with adding it to the parser. Do you have suggestions on how the API should look like?

Dec 09 '24 13:12 nilskch

@youknowone, what do you think?

Dec 10 '24 08:12 nilskch

I'd like to understand the problem better. Is utf-16 position calculation a performance bottleneck of the project? If not, you can simply get the line number and transform utf-8 position to utf-16 position using the source code and library like https://docs.rs/utf16string/latest/utf16string/ Transforming source code to utf-16 is not avoidable for this case. Then either way transforming utf-8 to utf-16 is not avoidable and will cost cpu power. Then doing it outside of parser can be simple, as long as it fits your requirements.

Dec 11 '24 08:12 youknowone

The linter that we built parses the code on every key stroke, so performance is quite important to us. The utf-16 conversion is not the most critical performance bottleneck, but it would still be a nice win for us to get rid of it. We currently iterate over the python code once and create a lookup table that maps utf-8 positions to utf-16 positions (without libraries). It would be nice for us if the AST would provide the utf-16 positions directly. I thought we could maybe parameterize the text_len method to return utf-8 or utf-16 positions:

https://github.com/RustPython/Parser/blob/5e9d9853e751240b28138efd165e793a7ee5dad4/vendored/src/text_size/traits.rs#L31-L37

That way we would not need to create that lookup table in the linter.

Dec 11 '24 10:12 nilskch

I just saw this comment https://github.com/RustPython/Parser/issues/125#issuecomment-2534782891.

I thought this is the parser that ruff uses? The README is misleading if this is not true. Would you recommend us to also move to the Ruff parser?

Dec 11 '24 10:12 nilskch

Rust ecosystem usually don't parametrize len() itself for utf16. Rather than that, adding text_len_utf16 method is making sense. Though still I have no idea how it can improve performance. Since the parser internal is utf8, there must transformation somewhere. If it can be cached inside of the parser, it still can be cached outside of the parser.

That would be better to be changed. I forgot the version number, but Ruff is how using its own parser. If performance is critical, that has better performance by their benchmark.

Dec 11 '24 11:12 youknowone