Generalizing SourceFile - adding line/column offsets etc
https://github.com/JuliaDebug/Cthulhu.jl/pull/345 uses JuliaSyntax to represent source code with expression boundaries mapped to character positions within the source text. In that application, CodeTracking is used to extract source-text for specific methods which may be partway into a file.
Question: for "snippets" like these, should the starting line number lineno be added as a new field to SourceFile? Or do you need it to represent "a whole file"? If so, then adding it as a field seems incorrect, since of course a whole file will start at line 1. The other issue is whether functions like source_line should return the offset line or the line index; currently the two are the same. It's for reasons like these that I didn't want to add it without thinking about the consequences.
The main purpose of SourceFile is to map between a lines-and-columns based view of the text and byte offsets. So it would probably be fine to have a line offset and allow it to represent part of a file.
Related, we might need SourceFile to represent part of a file to support incremental reparsing with language server. I haven't tried to get that working yet but it might tie in here at some point.
Couple of points from the LS side of things:
- we sometimes pass code snippets with an associated line/column offset around in the language server. So, at some level we also sometimes have a need for a type that contains code + offset info for lines/columns, but we would need the ability to offset both the line and the column, not just the line.
- it seems to me that a type that includes this kind of offset info shouldn't be called
SourceFilesimply because it isn't representing an entire file. MaybeSourceSnippetmight be better? Or like RoslynSourceText? - we do have a use-case for the line/column to index conversions in the LS, but the current
SourceFilewouldn't work for us. We need uris instead of paths, we need additional versioning info in the file and we will (hopefully soon) have entirely different structures for notebooks. So I'm wondering whether a structure like the following would allow more code-reuse:
struct Position
line::Int
column::Int
end
struct SourceSnippet
code::String
first_position::Position
line_starts::Vector{Int}
end
# LS would actually not use this
struct SourceFile
code::SourceSnippet
filename::Union{Nothing,String}
end
# LS would use this instead
struct TextDocument
code::SourceSnippet
uri::URI
version::Int
end
struct NotebookCell
code::SourceSnippet
other fields...
end
I hope this doesn't derail the original issue too much, I guess my main point is that just adding a first line and nothing about column seems weird to me :)
Thanks David this is very useful context. I haven't been able to look at LS integration at all yet or what would be required for incremental parsing.
it seems to me that a type that includes this kind of offset info shouldn't be called
SourceFilesimply because it isn't representing an entire file
Agreed. SourceText seems good to me. It's shorter than SourceSnippet and I already use the term text in many places.
For now I'm happy with merging #191 to get what Tim needs, but we can go with something more complete in the future.
Related to source abstractions, the internal JuliaSyntax.ParseStream doesn't deal natively with SourceFile because it's trying not to require that the code be copied into a String before we can do anything with it (the C code might pass us a plain old buffer, for example). But maybe this is a bit of a pointless optimization and we could streamline the internals too. (Maybe also improve the Core._parse hook API in Base to make extra copying less usual.)
we can go with something more complete in the future
Let's keep this issue open to discuss that