tree-sitter-julia icon indicating copy to clipboard operation
tree-sitter-julia copied to clipboard

bug: REPL prompt parsed incorrectly

Open halleysfifthinc opened this issue 1 year ago • 4 comments

Did you check existing issues?

  • [X] I have read all the tree-sitter docs if it relates to using the parser
  • [X] I have searched the existing issues of tree-sitter-PARSER_NAME

Tree-Sitter CLI Version, if relevant (output of tree-sitter --version)

No response

Describe the bug

REPL prompt's (e.g. in julia-repl codeblocks in docstrings, etc) are not recognized as a thing™ and are parsed using normal rules.

Steps To Reproduce/Bad Parse Tree

julia> a = [1,2]

gives

(source_file [0, 0] - [1, 0]
  (assignment [0, 0] - [0, 16]
    (binary_expression [0, 0] - [0, 8]
      (identifier [0, 0] - [0, 5])
      (operator [0, 5] - [0, 6])
      (identifier [0, 7] - [0, 8]))
    (operator [0, 9] - [0, 10])
    (vector_expression [0, 11] - [0, 16]
      (integer_literal [0, 12] - [0, 13])
      (integer_literal [0, 14] - [0, 15]))))

and even more wrong (as far as the parse tree goes) is with a preceding empty REPL prompt

julia> 
julia> a = [1,2]
(source_file [0, 0] - [2, 0]
  (assignment [0, 0] - [1, 16]
    (binary_expression [0, 0] - [1, 8]
      (binary_expression [0, 0] - [1, 5]
        (identifier [0, 0] - [0, 5])
        (operator [0, 5] - [0, 6])
        (identifier [1, 0] - [1, 5]))
      (operator [1, 5] - [1, 6])
      (identifier [1, 7] - [1, 8]))
    (operator [1, 9] - [1, 10])
    (vector_expression [1, 11] - [1, 16]
      (integer_literal [1, 12] - [1, 13])
      (integer_literal [1, 14] - [1, 15]))))

(Equivalent julia code to that parse tree is julia > julia > a = [1,2] which is invalid/incoherent syntax which throws an error)

Expected Behavior/Parse Tree

I would expect something like

(source_file 
  (repl_prompt
    (assignment
      (operator
      (vector_expression
        (integer_literal)
        (integer_literal))))))

and

(source_file 
  (repl_prompt)
  (repl_prompt
    (assignment
      (operator
      (vector_expression
        (integer_literal)
        (integer_literal))))))

Repro

No response

halleysfifthinc avatar May 01 '24 22:05 halleysfifthinc

The repl prompt isn't parsed because it's not part of the language at all.

e.g. in julia-repl codeblocks in docstrings

What editor/platform are you using that uses tree-sitter for julia-repl code blocks?

savq avatar May 02 '24 02:05 savq

To be clear, the problem here is that there's no way of knowing what's code, i.e. input, and what's not code, i.e. the prompt and the output.

A possible solution would be to have a separate grammar that parses the prompts and treats everything between the prompt and a newline as a Julia code injection. Lines without prompt are assumed to be output. The limitation in this case would be that it could not parse multi-line inputs.

savq avatar May 02 '24 03:05 savq

What editor/platform are you using that uses tree-sitter for julia-repl code blocks?

Noevim; I added injection queries to markdown to highlight julia-repl, jldoctest, and Documenter blocks (e.g. @example, etc).

The repl prompt isn't parsed because it's not part of the language at all.

Not sure I agree with that; depends on your perspective of technically vs functionally. The REPL properly parses (strips) copy-pastes of REPL code/prompts. So ideally, Julia things would be parsed/highlighted correctly by a (but maybe not this) julia parser.

I played around a bit yesterday trying some simple rules (e.g. require that repl prompts occur at the beginning of a line using token.immediate, etc), but I haven't worked with tree-sitter grammars before so I didn't make much progress.

I hadn't considered that repl output would need to be explicitly minimally/not parsed, therefore definitely requiring a separate julia-repl grammar. How difficult would it be to adapt the existing rules here for a new grammar that could handle multi-line inputs? (Given that this julia grammar already correctly handles multi-line statements/blocks, etc.)

halleysfifthinc avatar May 02 '24 18:05 halleysfifthinc

How difficult would it be to adapt the existing rules here for a new grammar that could handle multi-line inputs?

No idea.

I know some repos have multiple grammars to handle multi-language documents, like tree-sitter/tree-sitter-typescript. That might be a good first place to look.

savq avatar May 02 '24 20:05 savq