graal icon indicating copy to clipboard operation
graal copied to clipboard

Support SourceSections with code point based positions

Open fcurts opened this issue 6 years ago • 5 comments

Parsers (e.g., ANTLR 4.7+) and editors increasingly measure source positions (start index, end index, column, length) in number of Unicode code points rather than number of chars. It is important for Truffle to accommodate this trend and support code point based positions for creating and querying SourceSections. The implementation of these operations (including SourceSection.getCharacters()) needs to be efficient; in particular, iterating over (parts of) the Source on each invocation is not an option.

Instead of having separate APIs for char and code point based positions, another option might be to choose between them when building a Source and have generic SourceSection.getIndex() etc. methods that honor this choice. This is also how ANTLR works; for example, Token.getStartIndex() returns a char or code point based index depending on which ANTLR input stream is used. However, the char based ANTLRInputStream has been deprecated, leaving no choice but to eventually migrate to CodePointCharStream. Another parser we use only supports code points to begin with.

fcurts avatar Jan 10 '19 03:01 fcurts

I think this is a good improvement. However, we likely cannot reuse the existing API as it would be breaking many uses of it. We probably need to track both char indices and code points at the same time in order to keep both accesses efficient.

chumer avatar Jan 10 '19 15:01 chumer

You could re-purpose the existing API to be the generic one, returning char or code point based source positions depending on how the underlying Source is configured. As long as the documentation makes this clear, it seems feasible to argue that "char" (as in getCharIndex(), etc.) can either mean Java or Unicode char (i.e., code point) depending on context.

Alternatively, you could deprecate the existing API and only support it for char based Sources, whereas the new generic API would work for both char and code point based Sources.

The main concern I have with supporting both types of source positions for the same Source is that it may not be possible to do this in a way that's both time and space efficient. Perhaps a lazy approach would work.

fcurts avatar Jan 10 '19 19:01 fcurts

Yes we would need to deprecate the current API to solve it using a generic API. Yes, we should consider that.

chumer avatar Jan 11 '19 12:01 chumer

Tracking internally as Issue GR-20799.

boris-spas avatar Jan 22 '20 13:01 boris-spas

Still hoping to see this fixed so that we can stop using the deprecated ANTLRInputStream class and can correctly handle Unicode characters in source files.

As far as I know, several of your own Truffle languages also use ANTLR. Don't they have the same problem?

fcurts avatar Dec 13 '21 08:12 fcurts