tree-sitter-cli icon indicating copy to clipboard operation
tree-sitter-cli copied to clipboard

When will Regex assertions like `$` will be supported?

Open jonhue opened this issue 6 years ago • 8 comments

The language I am implementing really needs this functionality, as I am using it for file/script endings.

/\s*$/

jonhue avatar Jan 08 '19 20:01 jonhue

I'd like to support them, but I'm not sure exactly when it will happen. The current workaround is to use an external scanner to match that token. For your use case, do you want $ to match the end of a line or the end of the file?

maxbrunsfeld avatar Jan 08 '19 21:01 maxbrunsfeld

I use it to match the end of the file.

jonhue avatar Jan 08 '19 21:01 jonhue

@maxbrunsfeld I guess I have to use an external scanner anyway as my language uses an intented syntax similar to python. Is there some documentation on how to hook up the scanner.cc file?

jonhue avatar Jan 08 '19 21:01 jonhue

I guess I have to use an external scanner anyway as my language uses an indented syntax similar to python.

Yeah, you're right. Unfortunately, I haven't documented the external scanner API yet 😞 . The best thing to do right now is to base it on existing ones like tree-sitter-python.

The basic idea is:

  1. You have an externals Array on your grammar, like this.
  2. In your external scanner, you have an enum whose elements match that externals array, like this.
  3. You define the 5 external scanner hooks as plain C functions, like this.

The behaviors of the 5 hooks are:

  1. create - if you need to carry state in your external scanner, allocate your state object here and return a pointer to it.
  2. destroy - if you allocated state in create, release the memory here.
  3. scan - this function receives two parameters: a lexer, and an array of booleans called valid_lookaheads, which tells you which tokens are currently expected in the grammar. You must return a bool indicating whether or not you found a token. If you do find a token, you have to assign to the result_symbol field on the lexer. To advance to the next character, use the lexer->advance function.
  4. serialize - you must be able to represent the scanner's state in a reasonably small buffer of bytes. This function writes to a buffer of bytes, and returns the number of bytes written.
  5. deserialize - this function receives the previously-serialized bytes and must restore the scanner's state. deserialize also gets called with an empty buffer in order to clear the scanner's state, resetting it back to the initial state.

maxbrunsfeld avatar Jan 08 '19 21:01 maxbrunsfeld

You also need to add it to the list of files in binding.gyp

Aerijo avatar Jan 08 '19 22:01 Aerijo

@maxbrunsfeld Thanks for the detailed explanation! For now I mainly copied the scanner used for python and came across a few problems. Consider this input:

-1
1_1

results in the following output:

(package [0, 0] - [2, 0]
  (expr_stmt [0, 0] - [0, 2]
    (unary_operation [0, 0] - [0, 2]
      (integer [0, 1] - [0, 2])))
  (ERROR [0, 2] - [1, 0])
  (expr_stmt [1, 0] - [1, 3]
    (integer [1, 0] - [1, 3]))
  (expr_stmt [1, 3] - [1, 3]))
./file.few      1 ms    ERROR [0, 2] - [1, 0]

In general after every newline token, the parser appears to throw an error. Also the parser always ends with an expr_stmt token that he takes from nowhere (literally, nowhere). Do you have an idea what might cause this behavior?

All one liners seem to work just fine. However, some input results in the parser entering some kind of invalid state. It just never finishes. Here is an example:

struct Test
  .var: Int = 1

How does tree-sitter handle mutual left-recursion? The way I implemented the grammar for tree-sitter it does have quite a bit. Might this be a reason for the strange behavior?

You can find it here: https://github.com/few-lang/tree-sitter-few

jonhue avatar Jan 10 '19 19:01 jonhue

How does tree-sitter handle mutual left-recursion?

Tree-sitter generates LR parsers, so left-recursion doesn't present any problem.

In general after every newline token, the parser appears to throw an error.

Is your external scanner returning something that's unexpected by the grammar?

All one liners seem to work just fine. However, some input results in the parser entering some kind of invalid state. It just never finishes.

Hmm, you might be hitting this issue: https://github.com/tree-sitter/tree-sitter/issues/98. Do you use the empty string, or regexes that can match zero characters in your grammar?

maxbrunsfeld avatar Jan 10 '19 20:01 maxbrunsfeld

Hi @maxbrunsfeld , it would be a similar case for the assertion of position at start of a line ^ ? example: /^.{5}F/

szamarri avatar Jan 17 '19 17:01 szamarri