vscode-textmate icon indicating copy to clipboard operation
vscode-textmate copied to clipboard

Option to match end before any patterns

Open KapitanOczywisty opened this issue 3 years ago • 8 comments

There is persistent problem with injected languages where unclosed parentheses or quotes eat whole document.

I'm proposing to add option e.g. applyEndPatternFirst. This would check such end pattern at every line before anything else. I've looked at code a bit and it seems possible, but I'm not familiar enough with engine to correctly implement this (at least without trashing performance).

Details:

  • On end match - everything before end would be extracted and checked again with sub-patterns and merged with end's tokens, while rest of line would be checked normally. Without match - business as usual.
  • When multiple end patterns with this option will be present, they should be stored and checked from first to last, at every line until any of them matches. There shouldn't be too many of them, since this should be used mainly with injected languages.
  • Successful match will terminate all descendant rules, after matching "before end" part of line.
  • end with option should also match in line where before was found.

Why not use while? While while (sic) is great addition, it can only exclude whole line. Proposed option would allow stopping in mid-line.

KapitanOczywisty avatar Sep 02 '20 16:09 KapitanOczywisty

@KapitanOczywisty I think this makes sense, but vscode-textmate has as a goal to remain compliant with TM grammars. IMHO this proposal would make it that vscode-textmate diverges from TM and might lead to the creation of grammars that are not portable. I wonder what the exact problem you are running into is and how has it been addressed in grammars such as html, where the closing of </script> tag can occur in a JS comment, statement, etc.

alexdima avatar Sep 02 '20 20:09 alexdima

@alexdima If properly used, this option will only prevent some nasty injection bugs. No support for this option shouldn't cause grammar to break (well... not more than already is). As always there might be someone who uses this in highly improper way, but even now there are slight differences in implementations, e.g. php with sql injections breaks differently in atom and vscode. What is worth to mention, we're importing php grammar from atom, which don't have while support. I suspect that some of vscode's grammars are not "portable" already, at least to the atom.

Fwiw there is nothing stopping other implementations to add this feature.

About problem: There are many injection bugs atm in different languages, e.g. countless issues about embedded SQL in PHP.

Even mentioned </script> is not addressed: image

What sometimes makes funny disagreements between LSP and grammar: image

We can fix some of these issues making many duplicated rules, but there are issues which could be fixed only with such feature. Also atom's language-php in vscode uses different grammar for sql, what makes fixes even more fun.

KapitanOczywisty avatar Sep 02 '20 22:09 KapitanOczywisty

@KapitanOczywisty Grammars (outside of injection) already do this by default, the end pattern is applied first by default and it is possible to configure in the grammar that the end pattern is applied last:

https://github.com/microsoft/vscode-textmate/blob/ebcfe99ce2d262165c53e44db71575f37750c070/src/rule.ts#L535-L555

Otherwise, if this is about injection, then is it that both the injection and the rule matches? But it looks to me like we already handle that case with some scoring:

https://github.com/microsoft/vscode-textmate/blob/ebcfe99ce2d262165c53e44db71575f37750c070/src/grammar.ts#L769-L777

More likely, what I believe is the problem is that a certain rule misses an injection for the end embedded language pattern or that certain rules are written in a way where they cannot be injected (greedily eating up text and not giving a chance to the embedded language end regex to match). But IMHO this cannot be tackled with some new flag, this is a technical flaw of TM.

alexdima avatar Sep 04 '20 09:09 alexdima

@alexdima this is true as long as every rule is properly closed, problems starts when we have parentheses and quotes.

Take this example (not the best practice, but easy to encounter in the wild):

<?php
$query = "SELECT * FROM `table`
WHERE `id` IN ('". implode("','", $ids) ."')";

implode is eaten as part of sql image

In vscode parentheses in sql seems to not be tokenized so removing quotes fixes issue image

But in atom sql tokenizes parentheses and we have this beauty image

There were attempts to fix this, but this made even bigger mess.

Flag will not fix everything, but will allow stopping sql tokenization at IN ('". TM is flawed, but can still be a bit better :wink:

Alternatively If there would be any way to alter tokens returned from TM, without separate tokenization in extension, I could write semanticTokenProvider to inject SQL tokens afterwards. Is there some callback like that already?

KapitanOczywisty avatar Sep 04 '20 10:09 KapitanOczywisty

AFAIK, the "TM way" to fix this is to have a language called "sqlindoublequotes". Then a grammar is written from scratch or adapted such that this language never eats by accident ". The outer language, like php in this case would then inject the " end rule in all the rules of the "sqlindoublequotes".

I still don't understand your proposal though, since at a point the tokenizer reaches the position at '". If the sql language has a regex like /'[^']+'/, then that will match and consume the ". The end rule injected by php has no chance to intercept anything because the " is entirely skipped. The solution is to not have a regex like /'[^']+'/ in SQL, IMHO either by having a begin/end rule for strings where each character is looked at individually to which php can inject, either by having this language aware of its embedder that does not eagerly consume ".

alexdima avatar Sep 04 '20 10:09 alexdima

@alexdima In my proposal end with flag is checked at every line before "normal" rules. On positive match string is trimmed to the found position for second pass with "normal" rules, and then whole block is closed. Descendant rules cannot eat " because it isn't passed to them anymore.

They tried to fix that way (sqlindoublequotes), but this failed as you can see. PHP has 4 variants of writing strings:

<?php
// double quoted
$query = "SELECT * FROM `{$table["name"]}`";
// single quoted (no interpolation)
$query = 'SELECT * FROM `'. $table["name"] .'`';
// heredoc
$query = <<<SQL
SELECT * FROM `{$table["name"]}`
SQL;
// nowdoc (no interpolation)
$query = <<<'SQL'
SELECT * FROM `table`
SQL;

Writing all variants is quite an overhead. It's possible, but also time consuming. I mean I'll do it if there is no other way, but I'd rather not. And there are similar issues in other languages.

Edit: In general any option for second pass would be a huge change, even if this would be callback in JS.

KapitanOczywisty avatar Sep 04 '20 11:09 KapitanOczywisty

In my proposal end with flag is checked at every line before "normal" rules. On positive match string is trimmed to the found position for second pass with "normal" rules, and then whole block is closed.

That makes sense. Then I am sorry, I misunderstood your proposal. The description applyEndPatternFirst made me think this is about the order of the regexes or their priority. This is something completely new to TM, exactly like we do in Monarch.

alexdima avatar Sep 04 '20 12:09 alexdima

Any progress on this matter? I'm having similar problems embedding markdown in my language. It works fine if the end pattern is on a separate line from the "markdown content" but as soon as it's on the same line the markdown eats it and takes over the entire document.

var a = md"""
This is some md that works
"""

var b = md"""
This doesn't work"""

# This comment is now highlighted as if it was a markdown header

HugoGranstrom avatar Jun 04 '21 09:06 HugoGranstrom