Description

I have been trying to implement a multi-language parser using:

… as described in the documentation (that has a typo, see this PR).

But it's failing to produce the correct HTML tree.

The input example is:

<% if t %><span> hello </span><% end %>

[EOF]

Note the new-line at the end

The ERB tree produces 2 content nodes pointing to the <span>… part and the \n respectively. Printing the content of the byte range validates it.

However, once I switch to the HTML parser, I get the final text node pointing to <% end %>\n part of the source.

I am using tree-sitter 0.20.8, and all parsers were regenerated using the tools of that tag.

Reference Program

Here's the program I used to produce the bug (adapted from the doc example):

#include <string.h>
#include <tree_sitter/api.h>
#include <fcntl.h>
#include <unistd.h> 

// These functions are each implemented in their own repo.
const TSLanguage *tree_sitter_embedded_template();
const TSLanguage *tree_sitter_html();
const TSLanguage *tree_sitter_ruby();

void print_node(TSNode node, const char* text);

int main(int argc, const char **argv) {
  const char *text = argv[1];
  unsigned len = strlen(text);

  // Parse the entire text as ERB.
  TSParser *parser = ts_parser_new();
  ts_parser_set_language(parser, tree_sitter_embedded_template());
  TSTree *erb_tree = ts_parser_parse_string(parser, NULL, text, len);
  TSNode erb_root_node = ts_tree_root_node(erb_tree);

  // In the ERB syntax tree, find the ranges of the `content` nodes,
  // which represent the underlying HTML, and the `code` nodes, which
  // represent the interpolated Ruby.
  TSRange html_ranges[10];
  TSRange ruby_ranges[10];
  unsigned html_range_count = 0;
  unsigned ruby_range_count = 0;
  unsigned child_count = ts_node_child_count(erb_root_node);

  printf("ERB details:\n");
  for (unsigned i = 0; i < child_count; i++) {
    TSNode node = ts_node_child(erb_root_node, i);
    print_node(node, text);
    if (strcmp(ts_node_type(node), "content") == 0) {
      html_ranges[html_range_count++] = (TSRange) {
        ts_node_start_point(node),
        ts_node_end_point(node),
        ts_node_start_byte(node),
        ts_node_end_byte(node),
      };
    } else {
      TSNode code_node = ts_node_named_child(node, 0);
      ruby_ranges[ruby_range_count++] = (TSRange) {
        ts_node_start_point(code_node),
        ts_node_end_point(code_node),
        ts_node_start_byte(code_node),
        ts_node_end_byte(code_node),
      };
    }
  }

  // Use the HTML ranges to parse the HTML.
  //   parser = ts_parser_new();
  ts_parser_set_language(parser, tree_sitter_html());
  bool status = ts_parser_set_included_ranges(parser, html_ranges, html_range_count);
  printf("\nset_included_ranges: %d\n", status);

  printf("\nHTML ranges count: %d\n", html_range_count);
  const TSRange* ranges = ts_parser_included_ranges(parser, &html_range_count);
  for(unsigned i = 0 ; i < html_range_count; i++){
    printf(">> %d-%d\n", ranges[i].start_byte, ranges[i].end_byte);
  }

  TSTree *html_tree = ts_parser_parse_string(parser, NULL, text, len);
  TSNode html_root_node = ts_tree_root_node(html_tree);

//   printf("\nHTML ranges count: %d\n", html_range_count);
//   ranges = ts_parser_included_ranges(parser, &html_range_count);
//   for(unsigned i = 0 ; i < html_range_count; i++){
//     printf(">> %d-%d\n", ranges[i].start_byte, ranges[i].end_byte);
//   }

//   int fd = open("ts.gv", O_WRONLY | O_CREAT | O_TRUNC, 0644);
//   ts_tree_print_dot_graph(html_tree, fd);
//   close(fd);
//   system("open ts.gv");

  child_count = ts_node_child_count(html_root_node);
  printf("\nHTML details:\n");
  for (unsigned i = 0; i < child_count; i++) {
    TSNode node = ts_node_child(html_root_node, i);
    print_node(node, text);
  }

  // Use the Ruby ranges to parse the Ruby.
  ts_parser_set_language(parser, tree_sitter_ruby());
  ts_parser_set_included_ranges(parser, ruby_ranges, ruby_range_count);
  TSTree *ruby_tree = ts_parser_parse_string(parser, NULL, text, len);
  TSNode ruby_root_node = ts_tree_root_node(ruby_tree);

  // Print all three trees.
  char *erb_sexp = ts_node_string(erb_root_node);
  char *html_sexp = ts_node_string(html_root_node);
  char *ruby_sexp = ts_node_string(ruby_root_node);
  printf("\n");
  printf("ERB: %s\n", erb_sexp);
  printf("HTML: %s\n", html_sexp);
  printf("Ruby: %s\n", ruby_sexp);
  return 0;
}

void print_node(TSNode node, const char* text) {
    const char* type = ts_node_type(node);
    uint32_t start = ts_node_start_byte(node);
    uint32_t end = ts_node_end_byte(node);
    TSPoint spoint = ts_node_start_point(node);
    TSPoint epoint = ts_node_end_point(node);

    printf("%s (%.*s) {%d-%d | %d.%d-%d.%d}\n", type, end - start, &text[start], start, end, spoint.row, spoint.column, epoint.row, epoint.column);
}

PS: I stumbled upon this open issue since 2019. Is this bug report related?

More Examples

A newline inside the ERB `directive` works fine:

<% if t %><span> hello </span>
<% end %>[EOF]

<% if t %>
<span> hello </span><% end %>[EOF]

<% if t %>
<span> hello </span>
<% end %>[EOF]

This can happen anywhere, as in this example:

<% [1].each do |i| %>
<span> text </span>
<% if t %> <div> divvy's </div>
<% end %>
<% end %>[EOF]

So I get text nodes pointing to <% if t %> and the first <% end %>.

Nov 06 '23 16:11 stackmystack

@maxbrunsfeld @ahlinc @amaanq do you have any idea on what's happening? Maybe you could point me to some direction to better understand the issue? Maybe even I can help fixing this?

Nov 14 '23 08:11 stackmystack

Is this issue still present? Trying it in Neovim (because I'm lazy) seems to work fine while #327 is still broken.

Apr 12 '24 17:04 ObserverOfTime

However, once I switch to the HTML parser, I get the final text node pointing to <% end %>\n part of the source.

I don't understand what this means. Could you report what your expected syntax tree is, and what the actual tree is that you're getting, for the HTML?

Apr 12 '24 22:04 maxbrunsfeld

This has been closed since a request for information has not been answered for 30 days. It can be reopened when the requested information is provided.

May 13 '24 01:05 github-actions[bot]

tree-sitter
tree-sitter copied to clipboard

Multi-language parsing with ERB does not respect ranges

Description

Reference Program

More Examples

A newline inside the ERB `directive` works fine:

This can happen anywhere, as in this example:

tree-sitter tree-sitter copied to clipboard

Multi-language parsing with ERB does not respect ranges

Description

Reference Program

More Examples

A newline inside the ERB directive works fine:

This can happen anywhere, as in this example:

tree-sitter
tree-sitter copied to clipboard

A newline inside the ERB `directive` works fine: