JuliaSyntax.jl Generalizing SourceFile - adding line/column offsets etc

https://github.com/JuliaDebug/Cthulhu.jl/pull/345 uses JuliaSyntax to represent source code with expression boundaries mapped to character positions within the source text. In that application, CodeTracking is used to extract source-text for specific methods which may be partway into a file.

Question: for "snippets" like these, should the starting line number lineno be added as a new field to SourceFile? Or do you need it to represent "a whole file"? If so, then adding it as a field seems incorrect, since of course a whole file will start at line 1. The other issue is whether functions like source_line should return the offset line or the line index; currently the two are the same. It's for reasons like these that I didn't want to add it without thinking about the consequences.

Feb 10 '23 13:02 timholy

The main purpose of SourceFile is to map between a lines-and-columns based view of the text and byte offsets. So it would probably be fine to have a line offset and allow it to represent part of a file.

Related, we might need SourceFile to represent part of a file to support incremental reparsing with language server. I haven't tried to get that working yet but it might tie in here at some point.

Feb 11 '23 04:02 c42f

Couple of points from the LS side of things:

we sometimes pass code snippets with an associated line/column offset around in the language server. So, at some level we also sometimes have a need for a type that contains code + offset info for lines/columns, but we would need the ability to offset both the line and the column, not just the line.
it seems to me that a type that includes this kind of offset info shouldn't be called SourceFile simply because it isn't representing an entire file. Maybe SourceSnippet might be better? Or like Roslyn SourceText?
we do have a use-case for the line/column to index conversions in the LS, but the current SourceFile wouldn't work for us. We need uris instead of paths, we need additional versioning info in the file and we will (hopefully soon) have entirely different structures for notebooks. So I'm wondering whether a structure like the following would allow more code-reuse:

struct Position
  line::Int
  column::Int
end

struct SourceSnippet
  code::String
  first_position::Position
  line_starts::Vector{Int}
end

# LS would actually not use this
struct SourceFile
  code::SourceSnippet
  filename::Union{Nothing,String}
end

# LS would use this instead
struct TextDocument
  code::SourceSnippet
  uri::URI
  version::Int
end

struct NotebookCell
  code::SourceSnippet
  other fields...
end

I hope this doesn't derail the original issue too much, I guess my main point is that just adding a first line and nothing about column seems weird to me :)

Feb 13 '23 20:02 davidanthoff

Thanks David this is very useful context. I haven't been able to look at LS integration at all yet or what would be required for incremental parsing.

it seems to me that a type that includes this kind of offset info shouldn't be called SourceFile simply because it isn't representing an entire file

Agreed. SourceText seems good to me. It's shorter than SourceSnippet and I already use the term text in many places.

For now I'm happy with merging #191 to get what Tim needs, but we can go with something more complete in the future.

Related to source abstractions, the internal JuliaSyntax.ParseStream doesn't deal natively with SourceFile because it's trying not to require that the code be copied into a String before we can do anything with it (the C code might pass us a plain old buffer, for example). But maybe this is a bit of a pointless optimization and we could streamline the internals too. (Maybe also improve the Core._parse hook API in Base to make extra copying less usual.)

Feb 14 '23 03:02 c42f

we can go with something more complete in the future

Let's keep this issue open to discuss that

Feb 14 '23 03:02 c42f