highlight.js Discuss: Sequences

Is your request related to a specific problem you're having?

I don't have time to go back and cite all the examples, but see a lot of existing grammars and see our lengthily discussions on sequencing in the LaTex thread. The problem is when want to match a specific sequence of modes (which all may be complex in and of themselves). A made up BNF like example:

<list>           ::= *special <term> <term> <opt-whitespace> <list>

But now imagine that instead of just <tag> that these might be more HTML like constructs, with attributes, strings, special characters, etc... so 4 modes start to appear:

key tag (<list ...>)
assignment (::=)
modifiers (*special)
then multiple rule tags

The solution you'd prefer / feature you'd like to see added...

I proposing adding a sequence to sit beside contains. While contains is orderless, sequence would be sequential. I'm simplifying the rules below so you can see the bigger picture, but each rule could possibly have it's own begin, end, contains, etc... For the first pass I might restrict some things like starts, endsParent, endsWithParent just to shrink the problem space... and then find out later if those things are truly needed... but otherwise these would be full modes in their own right and have the same output/processing behavior as other modes when they are active.

// rule
{
    begin: /<.*>/,
    sequence: [
        { match: /<.*>/ },
        { match: /::=/ },
        { match: /\*special/, optional: true },
        { match: /<.*>/, multiple: true }
    ],
    end: mode.MATCH_NOTHING_RE,
    illegal: /\S/,
}

begin could be optional and if so it could borrow the begin from the first match in the sequence. The first match would obviously then need to be mandatory (not optional).
end by default would attempt to "immediate terminate" the sequence (as it does with contains) - though it's worth discussing if this is the correct default for sequences and how we might change this without being inconsistent
- this behavior could be changed just by setting and end rule or using MATCH_NOTHING_RE if there is truly no end rule
if one wants to mandate a full sequence illegal could be used (such as using illegal above to flag non-spaces). This would cause an illegal error to be thrown for incomplete sequences.

worth discussing if this is the correct default for sequences

It feels like specifying end above is a bit annoying... but if we did not then any space we encountered between the tags we care about would cause the sequence to terminate. This type of scenario (spaces or non-content in between things care about) is common enough that we should try to find a nice way to handle it without needing tons of additional modes for whitespace. Having separate regex for whitespace is already really bugging me with our new multi-match support.

How would it work in practice

Once a mode with sequence was entered the parser would go into a sequential mode loop:

loop
- look for current item in sequence (or end or illegal)
- if illegal found, raise; if end found, end mode
- when item found, start that mode
  - if item singular, increment position in sequence
- when item not found and optional (or multiple and already matched once), increment position
- if no more items, end mode

This is overly simplified of course... since if you had 2 optionals for example the parser would be using a multi-regex that was scanning for the next 3 items in the sequence (plus begin and illegal)... since the 3rd rule would be elgible to match at any position due to the first 2 rules being optionals.

This adds complexity but I'm not sure a (one item after another, no repeats, all mandatory) solves a lot of real problems... I remember LaTex definitely had optionals and such things.

Any alternative solutions you considered...

More sugar on top of our existing starts stuff, but I find it all quite convoluted to think and reason about... and there is very hard to understand behavior with endsParent and starts. Since changing starts would likely break a bunch of existing grammars I think we need some real new behavior rather than just sugar.

Sugar also becomes incredibly hard to debug since the end user sees only the sugar and doesn't understand the potentially incredibly complex rules being generated behind the scenes. So far I've tried to keep our sugar minimal and doing simple things.

There are other ways of writing it syntactically, such as reusing contains but have a flag to say it's a sequence. I think I like this els off the top of my head though.

contains: [ ... ],
sequential: true,

Just to name one example.

Additional context...

Note that no where are we talking about branching or back-tracking. This is not being discussed. Sequences either complete (after matching every item), terminate early with an incomplete sequence (the end matcher is triggered) or raise an error (they hit an illegal). Illegal is how you would specify the "the full sequence is required".

Once we find the start of a sequence, we are committed to that sequence. This is not tackling problems like "It might be sequence X or sequence Y" - which would require backtracking. For some grammars these situations might be handler with creative use of optionals and multiples.

Also, in simple cases a strong begin regex with a look-ahead could help making sure the right sequence was selected.

Apr 17 '21 12:04 joshgoebel

And now I'm wondering if all this couldn't be done with a small amount of glue and tiny change to ResumableMultiRegex to allow the ranges of regex eligible for matching at any given time to be scoped. We currently already allows this for (x...) (to support skipping matches) but not (a...b). IE the sequence below would be:

  sequence: [
        { match: /<.*>/ }, // mm.scope(0,0)
        { match: /::=/ }, // mm.scope(1,1)
        { match: /\*special/, optional: true }, // mm.scope(2,3) - can match rule 2 OR rule 3
        { match: /<.*>/, multiple: true } // mm.scope(3,3)

IE, the exact same behavior as optional but just with the choices scoped. Every iteration of the loop when inside a sequence mode would merely need to adjust the regex scope before executing a match.

End and illegal fall "outside" the scope, which is a small complexity of course.

Apr 17 '21 13:04 joshgoebel

@schtandard How much would this help with LaTex? Would this solve all the problems we were having there?

Apr 18 '21 15:04 joshgoebel

I think there is a potential problem with optionals where end is holding the mode open. At first I thought that optionals could just all be include in the multi-regex scope, but then I realized that's only true if you want the "first optional you can find"... lets take:

sequence:
{
  { match: /aaa/ },
  { match: /ccc/, optional: true },
  { match: /\w+/ },
  // assume end is holding the mode open
}

How does this match:

"aaabccc"

Is it:

["aaa", "ccc"] (the first two succeed, the last fails)
["aaa", "bccc"] (the first and last rule succeed, the middle rule is skipped)

In my first pass the latter is the result because I went with a "find the first in scope rule that matches" (much like contains works). So we see that ccc is optional so we start a search for EITHER \w+ or ccc, and we find \w+ first...

The alternative would be to start a separate search for ccc and then once that fails start the search for \w+, that would result in the first solution given above.

I'm not convinced the current behavior is correct. I'm pretty sure I'd rather not build/support both approaches if it can be avoided.

Apr 19 '21 17:04 joshgoebel

I'm thinking it should be separate - ie if you want "could be multiple things" you use contains and if you want STRICT sequential, you use sequence, and we do not muddle the two. So I'm thinking the correct behavior would be:

["aaa", "ccc"]

ccc while optional is still part of the sequence so if it's possible to find it in order, we will.

Apr 19 '21 17:04 joshgoebel

@schtandard How much would this help with LaTex? Would this solve all the problems we were having there?

A lot and yes! I believe this would make everything I wanted to do (and am still planning to get around to one day) much easier both to write and to comprehend afterwards.

I can't fully follow the multi-regex stuff and am guessing that this is the reason for my confusion on the following comments:

Note that no where are we talking about branching or back-tracking.

ccc while optional is still part of the sequence so if it's possible to find it in order, we will.

How can you decide if ccc does appear without searching the whole input before deciding to move on to the next mode (i.e. back-tracking)? Or do you just mean that you do not implement any back-tracking in addition to what the regex-engine already does?

Another thought: What happens for the sequence

{
  sequence: [
    { match: /a/ },
    { match: /b/, optional: true },
    { match: /c/ }
  ],
  end: /d/
}

with the inputs acbd... or acdb...? My intuition from the sequence rule would be that we want to match c in both cases and b in neither, but that could not be done by just looking for the b "in order", could it?

Apr 19 '21 21:04 schtandard

I can't fully follow the multi-regex stuff and am guessing that this is the reason for my confusion on the following comments:

It's just a lot of glue/magic around contains_regexes.join("|")... so that our parser can scan the input a single time for ALL contains in the current mode, know which regex matched, resume matching at a given point while ignoring some regexes (if a callback asks the match to be ignored, etc). Without this we'd have to scan the entire text for every rule sequentially, which is much slower.

How can you decide if ccc does appear without searching the whole input before deciding to move on to the next mode (i.e. back-tracking)? Or do you just mean that you do not implement any back-tracking in addition to what the regex-engine already does?

Right. Searching ahead for a token doesn't require any backtracking at the parser level. We just match /ccc/ and if it returns null, then we know there are no matches, so we can terminate or move on to the next thing to match. When I say no back-tracking I mean that once we start a sequence we're committed... we can't get to the 5th item and go "crap this isn't what we wanted, rewind the whole stack, go back 1000 characters and try to match something else entirely different"... if a sequence "fails" it either consumes all the input or terminates immediately (depending on end). This is the same behavior as contains.

My intuition from the sequence rule would be that we want to match c in both cases and b in neither

Why? Ah I see... it gets more confusing the more specific the matchers...

but that could not be done by just looking for the b "in order", could it?

Correct, this could not be achieved. hmmmm

Apr 19 '21 21:04 joshgoebel

What happens for the sequence

Can you come up with a more real life example of this same problem say perhaps from LaTex domain...? That might be helpful to aid in the discussion.

Apr 19 '21 21:04 joshgoebel

acdb...

This would of course terminate when it hit the d, since it matches end. end is constantly scanned for, just like with contains. So in this case the b is entirely out of scope either way. IE:

The a would be the only part of the sequence matched. It finds d while searching for b.

Apr 19 '21 22:04 joshgoebel

I think I'm entirely reversing my earlier thought now, LOL. Forgetting the exactly content... if we scan for items 2 and 3 concurrently (because 2 is optional) and we FIND 3 before coming across an instance of 2... certainly that means that 2 was skipped and the sequence continues with 3?

But what about?

{
  sequence: [
    { match: /a/ },
    { match: /b/, optional: true },
    { match: /c/, optional: true }
    { match: /d/, optional: true }
    { match: /e/ },
  ],
  end: /f/
}

Does the same hold true? First found wins? So adcbef would match:

a, d and e, ending on f
the cb content is ignored

Apr 19 '21 22:04 joshgoebel

I don’t know, but if the value of match is always a regex, why do you have to be bothered on adding an API to describe optional, case-insensitive and such?

sequence: [
    { match: /(a)(b?)(c?)(d?)(e)(f)$/ }
]

Apr 19 '21 22:04 taufik-nurrohman

but if the value of match is always a regex,

It's not just regex (they just make the examples easier to read), these can be full modes with 100s of submodes, etc... If it was just regex then someone would use the new multi-match support we just added... this is needed for much more complex rulesets.

sequence: [ PREAMBLE, HEADERS, YAML_FRONT_MATTER, DOCUMENT_BODY ]

Apr 19 '21 23:04 joshgoebel

Can you come up with a more real life example of this same problem say perhaps from LaTex domain...? That might be helpful to aid in the discussion.

All optional arguments in LaTeX work like this, for example

\section[short title]{long title} This is some text [containing brackets].

Here [short title] is optional and in

\section{title} This is some text [containing brackets].

the fact that \section is not directly (excepting most whitespace) followed by [ signals that the optional argument is not present. In particular, matching [containing brackets] as belonging to \section would be very wrong.

certainly that means that 2 was skipped and the sequence continues with 3?

I agree.

But what about?

{
  sequence: [
    { match: /a/ },
    { match: /b/, optional: true },
    { match: /c/, optional: true }
    { match: /d/, optional: true }
    { match: /e/ },
  ],
  end: /f/
}

Does the same hold true? First found wins? So adcbef would match:

* `a`, `d` and `e`, ending on `f`

* the `cb` content is ignored

Yes. At least from my standpoint (i.e. what do I need for LaTeX), that's exactly what it should mean (the cb would simply be out of place there).

Apr 20 '21 18:04 schtandard

highlight.js highlight.js copied to clipboard

Discuss: Sequences

highlight.js
highlight.js copied to clipboard