syntect icon indicating copy to clipboard operation
syntect copied to clipboard

Embed rewriting not perfect

Open keith-hall opened this issue 6 years ago • 8 comments

After the latest fixes, I tried updating the submodules, and can see we still have some syntax test failures, this time relating to embed...escape:

FAILED testdata/Packages\Java\syntax_test_java.java: 817
FAILED testdata/Packages\Markdown\syntax_test_markdown.md: 61

We get the same failures in Sublime Text, when rewriting the embed...escape rules to with_prototype like syntect's yaml_load is doing, so the good news is that syntects parser is working correctly:

JavaDoc.sublime-syntax:

--- Shipped Packages/Java/JavaDoc.sublime-syntax    2018-10-11 19:11:54
+++ Packages/Java/JavaDoc.sublime-syntax    2018-10-30 11:28:20
@@ -20,11 +20,12 @@
     - meta_include_prototype: false
     - match: /\*\*
       scope: comment.block.documentation.javadoc punctuation.definition.comment.begin.javadoc
-      embed: contents
-      embed_scope: comment.block.documentation.javadoc text.html.javadoc
-      escape: \*/
-      escape_captures:
-        0: comment.block.documentation.javadoc punctuation.definition.comment.end.javadoc
+      push:
+        - [{ meta_include_prototype: false }, { meta_content_scope: 'comment.block.documentation.javadoc text.html.javadoc' }, { match: '\*/', scope: comment.block.documentation.javadoc punctuation.definition.comment.end.javadoc, pop: true }]
+        - contents
+      with_prototype:
+        - match: (?=\*/)
+          pop: true

Markdown.sublime-syntax:

--- Shipped Packages/Markdown/Markdown.sublime-syntax   2018-10-11 19:11:54
+++ Packages/Markdown/Markdown.sublime-syntax   2018-10-30 09:44:46
@@ -1106,12 +1106,12 @@ contexts:
         0: meta.code-fence.definition.begin.html-php.markdown-gfm
         2: punctuation.definition.raw.code-fence.begin.markdown
         5: constant.other.language-name.markdown
-      embed: scope:embedding.php
-      embed_scope: markup.raw.code-fence.html-php.markdown-gfm
-      escape: '{{code_fence_escape}}'
-      escape_captures:
-        0: meta.code-fence.definition.end.html-php.markdown-gfm
-        1: punctuation.definition.raw.code-fence.end.markdown
+      push:
+        - [{ meta_include_prototype: false }, { meta_content_scope: 'markup.raw.code-fence.html-php.markdown-gfm' }, { match: '{{code_fence_escape}}', captures: { 0: meta.code-fence.definition.end.html-php.markdown-gfm, 1: punctuation.definition.raw.code-fence.end.markdown }, pop: true }]
+        - scope:embedding.php
+      with_prototype:
+        - match: (?={{code_fence_escape}})
+          pop: true
     - match: |-
          (?x)
           {{fenced_code_block_start}}

For Java(Doc), the workaround suggested at https://github.com/trishume/syntect/issues/160#issuecomment-388378965 works.

For Markdown, the following syntax test is enough to see the problem:

| SYNTAX TEST "Packages/Markdown/Markdown.sublime-syntax"
```html+php
<div></div>
|^^^ entity.name.tag.block.any.html
<?php
|^^^^ punctuation.section.embedded.begin.php
var_dump(expression);
| ^^^^^^ support.function.var.php
```
|^^ punctuation.definition.raw.code-fence.end.markdown

Everything except the last line's assertions pass. Adding ?> before the closing backticks/code fence allows the backticks to be correctly recognized as the closing code fence instead of PHP interpolated string punctuation, but we will need a solution so that the way syntect handles embed actions will work 100% correctly in all situations.

Altering the PHP Source syntax definition can help to work around it, which (although obviously not a solution,) suggests that, for some reason, the with_prototype (rewritten from the embed in the Markdown syntax definition) is not being considered at this point (in both Sublime and syntect):

--- Shipped Packages/PHP/PHP Source.sublime-syntax  2018-10-11 19:11:54
+++ Packages/PHP/PHP Source.sublime-syntax  2018-10-30 13:36:58
@@ -109,6 +109,8 @@
         - include: statements
 
   expressions:
+    - match: (?=```)
+      pop: true
     - include: comments
     - match: (?i)\b((?:require|include)(?:_once)?)\b\s*
       captures:

keith-hall avatar Oct 30 '18 10:10 keith-hall

I guess the fundamental flaw with rewriting embed actions to the "equivalent" with_prototype construct is that embed is guaranteed to escape at the escape pattern, but with with_prototype, a match pattern could consume the characters first.

Simple example embed_vs_with_prototype.sublime-syntax:

%YAML 1.2
---
# See http://www.sublimetext.com/docs/3/syntax.html
scope: source.embed-vs-with_prototype
contexts:
  main:
    - match: embed
      scope: keyword.begin
      embed: contents
      escape: escape
      escape_captures:
        0: keyword.end
    - match: with_prototype
      scope: keyword.begin
      push: 
        - [{ meta_include_prototype: false }, { match: escape, scope: keyword.end, pop: true }]
        - contents
      with_prototype:
        - match: (?=escape)
          pop: true
  contents:
    - match: \w+
      scope: string

Test text:

embedtesting123escape
with_prototypetesting123escape

image

So unfortunately, I believe we will need to include native support for embed/escape in the parser.

keith-hall avatar Oct 31 '18 08:10 keith-hall

Makes sense! So escape needs to have precedence over other matches.

robinst avatar Oct 31 '18 08:10 robinst

Just a note for when we work on this, there is an open question for SublimeHQ to answer regarding whether the prototype context should be ignored in contexts embedded from the prototype. Currently (as at build 3180) Sublime Text's behavior is to apply the prototype.

keith-hall avatar Nov 22 '18 08:11 keith-hall

@keith-hall So let me get this straight, in this example:

embedtesting123escape

embed matches the embed, that's easy. But then the next match would be that \w+ matches all of testing123escape with scope string. But the embed escape also matches escape.

So in that case, the escape match takes precedence and the match for string is cut short.

I'm wondering how that's implemented (and how we would implement it). Does something like this make sense?:

  1. If we are currently in an embed, try matching the escape first. If that matches:
    • if the match is at our current position, pop out of the embed
    • else continue matching other contexts but limit it to text that comes before the match (basically truncate our input).
  2. Advance position to the next match and repeat

What happens if there are multiple active embeds that has matching escapes, which one wins?

robinst avatar Mar 28 '19 07:03 robinst

yes, that makes sense, and sounds like the way forward. For multiple embeds, I believe the same ordering as with_prototype is used - the embedded contexts are prioritized by the order in which were pushed onto the stack, an earlier embed's escape wins even if a later escape would match at an earlier index in the string, if that makes sense. It may be better illustrated with an example, I will try to come up with one when I get a chance :)

keith-hall avatar Mar 28 '19 10:03 keith-hall

I tried this example:

%YAML 1.2
---
# See http://www.sublimetext.com/docs/3/syntax.html
scope: source.nested-embed
contexts:
  main:
    - match: \(
      scope: paren.begin
      embed: a
      escape: \)
      escape_captures:
        0: paren.end
  a:
    - match: \w+
      scope: string.a
    - match: \<
      scope: bracket.begin
      embed: b
      escape: \>
      escape_captures:
        0: bracket.end
  b:
    - match: \w+
      scope: string.b

With the input ( < > ), the > is bracket.end and ) is paren.end.

With ( < ) >, ) is paren.end and ) doesn't have a scope. So yeah, the earlier embed's escape wins and pops out all the way. (I also did an example where I used the same pattern for both escapes, and it popped out all the way as well.)

robinst avatar Mar 29 '19 05:03 robinst

nice :+1: one other thing I wanted to check was how escapes are handled if one escape pattern matches inside another one... say, escape: me and escape: test-me with text test-me - I imagine the first embed should be popped as it should win with me. Sorry for not doing the due diligence myself or laying out the yaml for a proper test case, I'm on mobile atm. EDIT: actually, this is probably covered by your using the same pattern for both escapes.

keith-hall avatar Mar 29 '19 05:03 keith-hall

Just tested, me wins (if it's the earlier embed).

robinst avatar Mar 29 '19 05:03 robinst