tree-sitter-cli
tree-sitter-cli copied to clipboard
When will Regex assertions like `$` will be supported?
The language I am implementing really needs this functionality, as I am using it for file/script endings.
/\s*$/
I'd like to support them, but I'm not sure exactly when it will happen. The current workaround is to use an external scanner to match that token. For your use case, do you want $
to match the end of a line or the end of the file?
I use it to match the end of the file.
@maxbrunsfeld I guess I have to use an external scanner anyway as my language uses an intented syntax similar to python. Is there some documentation on how to hook up the scanner.cc
file?
I guess I have to use an external scanner anyway as my language uses an indented syntax similar to python.
Yeah, you're right. Unfortunately, I haven't documented the external scanner API yet 😞 . The best thing to do right now is to base it on existing ones like tree-sitter-python.
The basic idea is:
- You have an
externals
Array on your grammar, like this. - In your external scanner, you have an
enum
whose elements match thatexternals
array, like this. - You define the 5 external scanner hooks as plain C functions, like this.
The behaviors of the 5 hooks are:
-
create
- if you need to carry state in your external scanner, allocate your state object here and return a pointer to it. -
destroy
- if you allocated state increate
, release the memory here. -
scan
- this function receives two parameters: alexer
, and an array of booleans calledvalid_lookaheads
, which tells you which tokens are currently expected in the grammar. You must return abool
indicating whether or not you found a token. If you do find a token, you have to assign to theresult_symbol
field on the lexer. To advance to the next character, use thelexer->advance
function. -
serialize
- you must be able to represent the scanner's state in a reasonably small buffer of bytes. This function writes to a buffer of bytes, and returns the number of bytes written. -
deserialize
- this function receives the previously-serialized bytes and must restore the scanner's state.deserialize
also gets called with an empty buffer in order to clear the scanner's state, resetting it back to the initial state.
You also need to add it to the list of files in binding.gyp
@maxbrunsfeld Thanks for the detailed explanation! For now I mainly copied the scanner used for python and came across a few problems. Consider this input:
-1
1_1
results in the following output:
(package [0, 0] - [2, 0]
(expr_stmt [0, 0] - [0, 2]
(unary_operation [0, 0] - [0, 2]
(integer [0, 1] - [0, 2])))
(ERROR [0, 2] - [1, 0])
(expr_stmt [1, 0] - [1, 3]
(integer [1, 0] - [1, 3]))
(expr_stmt [1, 3] - [1, 3]))
./file.few 1 ms ERROR [0, 2] - [1, 0]
In general after every newline token, the parser appears to throw an error. Also the parser always ends with an expr_stmt
token that he takes from nowhere (literally, nowhere). Do you have an idea what might cause this behavior?
All one liners seem to work just fine. However, some input results in the parser entering some kind of invalid state. It just never finishes. Here is an example:
struct Test
.var: Int = 1
How does tree-sitter handle mutual left-recursion? The way I implemented the grammar for tree-sitter it does have quite a bit. Might this be a reason for the strange behavior?
You can find it here: https://github.com/few-lang/tree-sitter-few
How does tree-sitter handle mutual left-recursion?
Tree-sitter generates LR parsers, so left-recursion doesn't present any problem.
In general after every newline token, the parser appears to throw an error.
Is your external scanner returning something that's unexpected by the grammar?
All one liners seem to work just fine. However, some input results in the parser entering some kind of invalid state. It just never finishes.
Hmm, you might be hitting this issue: https://github.com/tree-sitter/tree-sitter/issues/98. Do you use the empty string, or regexes that can match zero characters in your grammar?
Hi @maxbrunsfeld , it would be a similar case for the assertion of position at start of a line ^
?
example:
/^.{5}F/