tree-sitter icon indicating copy to clipboard operation
tree-sitter copied to clipboard

Multi-language parsing with ERB does not respect ranges

Open stackmystack opened this issue 1 year ago • 3 comments

Description

I have been trying to implement a multi-language parser using:

  1. Embedded Templates
  2. HTML
  3. Ruby

… as described in the documentation (that has a typo, see this PR).

But it's failing to produce the correct HTML tree.

The input example is:

<% if t %><span> hello </span><% end %>

[EOF]

Note the new-line at the end

The ERB tree produces 2 content nodes pointing to the <span>… part and the \n respectively. Printing the content of the byte range validates it.

However, once I switch to the HTML parser, I get the final text node pointing to <% end %>\n part of the source.

I am using tree-sitter 0.20.8, and all parsers were regenerated using the tools of that tag.

Reference Program

Here's the program I used to produce the bug (adapted from the doc example):

#include <string.h>
#include <tree_sitter/api.h>
#include <fcntl.h>
#include <unistd.h> 

// These functions are each implemented in their own repo.
const TSLanguage *tree_sitter_embedded_template();
const TSLanguage *tree_sitter_html();
const TSLanguage *tree_sitter_ruby();

void print_node(TSNode node, const char* text);

int main(int argc, const char **argv) {
  const char *text = argv[1];
  unsigned len = strlen(text);

  // Parse the entire text as ERB.
  TSParser *parser = ts_parser_new();
  ts_parser_set_language(parser, tree_sitter_embedded_template());
  TSTree *erb_tree = ts_parser_parse_string(parser, NULL, text, len);
  TSNode erb_root_node = ts_tree_root_node(erb_tree);

  // In the ERB syntax tree, find the ranges of the `content` nodes,
  // which represent the underlying HTML, and the `code` nodes, which
  // represent the interpolated Ruby.
  TSRange html_ranges[10];
  TSRange ruby_ranges[10];
  unsigned html_range_count = 0;
  unsigned ruby_range_count = 0;
  unsigned child_count = ts_node_child_count(erb_root_node);

  printf("ERB details:\n");
  for (unsigned i = 0; i < child_count; i++) {
    TSNode node = ts_node_child(erb_root_node, i);
    print_node(node, text);
    if (strcmp(ts_node_type(node), "content") == 0) {
      html_ranges[html_range_count++] = (TSRange) {
        ts_node_start_point(node),
        ts_node_end_point(node),
        ts_node_start_byte(node),
        ts_node_end_byte(node),
      };
    } else {
      TSNode code_node = ts_node_named_child(node, 0);
      ruby_ranges[ruby_range_count++] = (TSRange) {
        ts_node_start_point(code_node),
        ts_node_end_point(code_node),
        ts_node_start_byte(code_node),
        ts_node_end_byte(code_node),
      };
    }
  }

  // Use the HTML ranges to parse the HTML.
  //   parser = ts_parser_new();
  ts_parser_set_language(parser, tree_sitter_html());
  bool status = ts_parser_set_included_ranges(parser, html_ranges, html_range_count);
  printf("\nset_included_ranges: %d\n", status);

  printf("\nHTML ranges count: %d\n", html_range_count);
  const TSRange* ranges = ts_parser_included_ranges(parser, &html_range_count);
  for(unsigned i = 0 ; i < html_range_count; i++){
    printf(">> %d-%d\n", ranges[i].start_byte, ranges[i].end_byte);
  }

  TSTree *html_tree = ts_parser_parse_string(parser, NULL, text, len);
  TSNode html_root_node = ts_tree_root_node(html_tree);

//   printf("\nHTML ranges count: %d\n", html_range_count);
//   ranges = ts_parser_included_ranges(parser, &html_range_count);
//   for(unsigned i = 0 ; i < html_range_count; i++){
//     printf(">> %d-%d\n", ranges[i].start_byte, ranges[i].end_byte);
//   }

//   int fd = open("ts.gv", O_WRONLY | O_CREAT | O_TRUNC, 0644);
//   ts_tree_print_dot_graph(html_tree, fd);
//   close(fd);
//   system("open ts.gv");

  child_count = ts_node_child_count(html_root_node);
  printf("\nHTML details:\n");
  for (unsigned i = 0; i < child_count; i++) {
    TSNode node = ts_node_child(html_root_node, i);
    print_node(node, text);
  }

  // Use the Ruby ranges to parse the Ruby.
  ts_parser_set_language(parser, tree_sitter_ruby());
  ts_parser_set_included_ranges(parser, ruby_ranges, ruby_range_count);
  TSTree *ruby_tree = ts_parser_parse_string(parser, NULL, text, len);
  TSNode ruby_root_node = ts_tree_root_node(ruby_tree);

  // Print all three trees.
  char *erb_sexp = ts_node_string(erb_root_node);
  char *html_sexp = ts_node_string(html_root_node);
  char *ruby_sexp = ts_node_string(ruby_root_node);
  printf("\n");
  printf("ERB: %s\n", erb_sexp);
  printf("HTML: %s\n", html_sexp);
  printf("Ruby: %s\n", ruby_sexp);
  return 0;
}

void print_node(TSNode node, const char* text) {
    const char* type = ts_node_type(node);
    uint32_t start = ts_node_start_byte(node);
    uint32_t end = ts_node_end_byte(node);
    TSPoint spoint = ts_node_start_point(node);
    TSPoint epoint = ts_node_end_point(node);

    printf("%s (%.*s) {%d-%d | %d.%d-%d.%d}\n", type, end - start, &text[start], start, end, spoint.row, spoint.column, epoint.row, epoint.column);
}

PS: I stumbled upon this open issue since 2019. Is this bug report related?

More Examples

A newline inside the ERB directive works fine:

<% if t %><span> hello </span>
<% end %>[EOF]
<% if t %>
<span> hello </span><% end %>[EOF]
<% if t %>
<span> hello </span>
<% end %>[EOF]

This can happen anywhere, as in this example:

<% [1].each do |i| %>
<span> text </span>
<% if t %> <div> divvy's </div>
<% end %>
<% end %>[EOF]

So I get text nodes pointing to <% if t %> and the first <% end %>.

stackmystack avatar Nov 06 '23 16:11 stackmystack

@maxbrunsfeld @ahlinc @amaanq do you have any idea on what's happening? Maybe you could point me to some direction to better understand the issue? Maybe even I can help fixing this?

stackmystack avatar Nov 14 '23 08:11 stackmystack

Is this issue still present? Trying it in Neovim (because I'm lazy) seems to work fine while #327 is still broken.

ObserverOfTime avatar Apr 12 '24 17:04 ObserverOfTime

However, once I switch to the HTML parser, I get the final text node pointing to <% end %>\n part of the source.

I don't understand what this means. Could you report what your expected syntax tree is, and what the actual tree is that you're getting, for the HTML?

maxbrunsfeld avatar Apr 12 '24 22:04 maxbrunsfeld

This has been closed since a request for information has not been answered for 30 days. It can be reopened when the requested information is provided.

github-actions[bot] avatar May 13 '24 01:05 github-actions[bot]