elixir icon indicating copy to clipboard operation
elixir copied to clipboard

Relative tokenizer

Open lukaszsamson opened this issue 9 months ago • 3 comments

This is an attempt at making the tokenizer produce relative tokens. My intention is to enable building an incremental parsers better suited to LSP use case. With relative tokenization and parsing it may be possible to build a parser that does not need to rebuild the entire AST on each document edit.

Design choices:

  • In relative mode token line and column represents a difference from the last token position
  • In case of interpolated binaries, charlists and sigils the parts are relative to the position of the token itself
  • Inside interpolation the positions are relative to the begin of interpolation #{

The current state:

  • the relative mode produces valid relative tokens that after conversion to absolute via :elixir_tokenizer.to_absolute_tokens are identical to the ones produced by absolute mode. I verified this over elixir source as well as a number of other projects.
  • all tests pass
  • parser and errors tests return the same tokens and errors/warnings

Examples:

iex(2)> :elixir_tokenizer.tokenize(~c'fun(x + 1)', 1, 1, [mode: :absolute]) |> elem(4) |> Enum.reverse
[
  {:paren_identifier, {1, 1, ~c"fun"}, :fun},
  {:"(", {1, 4, nil}},
  {:identifier, {1, 5, ~c"x"}, :x},
  {:dual_op, {1, 7, nil}, :+},
  {:int, {1, 9, 1}, ~c"1"},
  {:")", {1, 10, nil}}
]
iex(3)> :elixir_tokenizer.tokenize(~c'fun(x + 1)', 1, 1, [mode: :relative]) |> elem(4) |> Enum.reverse
[
  {:paren_identifier, {0, 0, ~c"fun"}, :fun},
  {:"(", {0, 3, nil}},
  {:identifier, {0, 1, ~c"x"}, :x},
  {:dual_op, {0, 2, nil}, :+},
  {:int, {0, 2, 1}, ~c"1"},
  {:")", {0, 1, nil}}
]
iex(7)> :elixir_tokenizer.tokenize(~c'"\#{fun(x + 1)}" <> ""', 1, 1, [mode: :absolute]) |> elem(4) |> Enum.reverse
[
  {:bin_string, {1, 1, nil},
   [
     {{1, 2, nil}, {1, 14, nil},
      [
        {:paren_identifier, {1, 4, ~c"fun"}, :fun},
        {:"(", {1, 7, nil}},
        {:identifier, {1, 8, ~c"x"}, :x},
        {:dual_op, {1, 10, nil}, :+},
        {:int, {1, 12, 1}, ~c"1"},
        {:")", {1, 13, nil}}
      ]}
   ]},
  {:concat_op, {1, 17, nil}, :<>},
  {:bin_string, {1, 20, nil}, [""]}
]
iex(8)> :elixir_tokenizer.tokenize(~c'"\#{fun(x + 1)}" <> ""', 1, 1, [mode: :relative]) |> elem(4) |> Enum.reverse
[
  {:bin_string, {0, 0, nil},
   [
     {{0, 1, nil}, {0, 13, nil},
      [
        {:paren_identifier, {0, 3, ~c"fun"}, :fun},
        {:"(", {0, 3, nil}},
        {:identifier, {0, 1, ~c"x"}, :x},
        {:dual_op, {0, 2, nil}, :+},
        {:int, {0, 2, 1}, ~c"1"},
        {:")", {0, 1, nil}}
      ]}
   ]},
  {:concat_op, {0, 16, nil}, :<>},
  {:bin_string, {0, 3, nil}, [""]}
]

lukaszsamson avatar Feb 16 '25 14:02 lukaszsamson

I have been thinking about this. Couldn't this be implemented by doing a later pass on the tokens or the AST that computes the difference and relative positions? Or perhaps we include more information on the metadata so it can be done by a later pass?

josevalim avatar Feb 20 '25 05:02 josevalim

I'm not fully convinced this is the right step on the road to incremental parsing. My intention was to explore a direction and see where we can go from there. Metadata in the current form is the problem. What I would like to see is separation of AST and position/range metadata so were possible to keep AST nodes in one place and have metadata stored elsewhere. A weak map would be nice but I'm not aware of one for OTP.

lukaszsamson avatar Feb 25 '25 09:02 lukaszsamson

@lukaszsamson something like this for the tracking? https://github.com/jonatanklosko/elixir_ast_ranges

I honestly don't think a weak map would be that hard to implement in C as you could send messages when a reference/resource is GCed, so I think we could explore solutions alongside that.

josevalim avatar Feb 25 '25 09:02 josevalim

Closing this for now, we can revisit if that's indeed the way we want to move forward in the future.

josevalim avatar Aug 30 '25 08:08 josevalim