grmtools
grmtools copied to clipboard
Permit stack operations on start conditions
In #318, start state logic was added for start states defined by name. In the POSIX lex standard, start states can be used by numeric id
Q: Should this include support for expanding the target start state logic to support increment and decrementing the current start state, as well as setting to an explicit target?
Regarding your Q, I'd perhaps just go with the former for now, I'm not quite sure if just incrementing/decrementing a counter for the start state will suffice, I think that a stack (possibly with a counter to reduce depth?) might be necessary if there is a need to enter the same state multiple times, and only exiting the state after the same number of state exits.
Once you do that over a pair of states, a simple counter wouldn't suffice but a stack of them would. Having said that, all the cases in which I've wanted this manipulation of start state logic via operators, I've since realized I would also need custom lex action code above and beyond operators for target state manipulation.
I'm curious though, if you have any cases where operators for manipulating target states are sufficient?
The classic one is nested comments
Ahh yes, I had forgot about that I had been thinking about rusts raw strings like r##"r#""""#"## where the number of # characters on the outside affects the inner match, it was these that i've had trouble doing without actions as well as some form of start state manipulation.
Misunderstanding of the spec - can't see anything in the spec to permit numeric start conditions.
Remaining: stack of start conditions (push + pop operations)
I'm considering a +/- as prefixes before the start condition name in the rule action, e.g.
%x bracket brace
%%
<*>{ <+brace>'OPEN BRACE'
<brace>} <-brace>'CLOSE BRACE'
<*>[ <+bracket>'OPEN BRACKET'
<bracket>] <-bracket>'CLOSE BRACKET'
@ratmice any concerns in this?
Sounds good, and it seems like the obvious choice i would go with as well.
The only thing I wonder about is how to handle states that apply to multiple conditions, if there is a need for a variable indicating the current state. Something to the effect of:
<StateA, StateB>] <-$CurrentState>'End State A or B'
Technically, the pop will always pop the top of the stack - so could just be <-> - thought it made sense to be consistent with push, but possibly consider everything until the close > as unprocessed text (effectively a comment).
I'd say if the user needed to use current state (e.g. in the rule name), they would be better placed to create multiple rules, rather than increasing the complexity, with state capturing variables.
Yeah, I was just considering that possibility as well, and it seems better than what I initially suggested.
But I'm definitely curious what @ltratt thinks, there is a certain 'may appear to be doing something that it is not' quality to <-brace>. Which I suppose could be a concern.
But leaving it as effectively a comment, seems nice since it would allow that <-end_current_state> is much nicer than just <-> .
I don't know this feature well (at all...) but I guess that "+X" can (in the stack of these things) only be followed by "-X" at the same level. In other words, a valid sequence is +X+Y+Z-Z-Y-X. In that sense the "+" and "-" are effectively comments? Are they part of lex's normal spec?
Looks like (another) flex extension (like exclusive start conditions) [c.f. http://dinosaur.compilertools.net/flex/manpage.html]:
void yy_push_state(int new_state)
pushes the current start condition onto the top of
the start condition stack and switches to new_state
as though you had used BEGIN new_state (recall that
start condition names are also integers).
void yy_pop_state()
pops the top of the stack and switches to it via
BEGIN.
So, you push a state - start conditions are represented as named constants, so via: yy_push_state({state name})
Then, you pop a state using yy_pop_state() which pops whatever is on the top off, and makes it the current state.
Thus, there is a required parameter for push which is the state to push (and hence change to), but pop takes no parameter as the new state is the popped one, but I felt that it feels less weird to permit an unused parameter to pop, so it's symmetrical with push
I'm starting to wonder if we need to do for lrlex what we did for lrpar, that is use LexKind to say "here's the format of the input file" so that we can differentiate "lex" vs. "flex" (vs. "whatever") inputs? https://softdevteam.github.io/grmtools/master/book/yacccompatibility.html
How was it implemented in lrpar - guards / switches on features, or separated implementations? I'd be prepared to do some initial work on adding a LexKind for feature segregation.
Guards/switches on features. That may or may not have been the best decision!
So, I believe this one can finally be closed. As the implementation has been in for a while, and the errors are now all wired up to the builders.
I'm still confused about how to use start states in lrlex, as compared to lex/flex. What if I don't want to push or pop but just want to go to a state. The grmtools book doesn't even have documentation about lrlex start states except saying that they are a feature. I had to dig and find one example directory (the one with the nested comments) to even see examples of the syntax and guess what the semantics are.
I must admit that I haven't used start states, so it's difficult for me to write "deep" documentation on them. I keep hoping that someone else will chip in with it!
I'll give a go at a rough draft here, though admittedly i'm pretty terrible at writing, and working off of memory .
When we introduce states to lex the first thing that needs to be explained is the INITIAL state, previously we weren't specifying any states at all. Implicitly lex rules without a state specified are given the INITIAL state, as their entrance state and their exit state.
<EntranceState> RULE <[OP]ExitState> note: [OP] here not being part of the syntax.
There is one other operation besides push & pop, which is ReplaceStack, which might be what you want if you just want to enter a state.
- The
Pushstarts with+for[OP]as in<+ExitState> Popis-as in<-ExitState>Replaceis just<ExitState>
When lex now looks for which rules apply to an input character, first it looks at the top of the start conditions stack, and finds rules which apply to the Entrance State. Rules can have multiple comma separated Entrance states. So an entrance condition of <INITIAL, Foo> RULE can be entered from either the INITIAL state, or the Foo state.
When when the rule is then applied it performs an operation on the stack corresponding to the ExitState operation.
Thus if your stack looks like [Initial, Foo], and you match a rule <Initial, Foo> .* <+Bar> the new stack will invoke the push operator, subsequently producing a stack of [INITIAL, Foo, Bar] upon further input, matching rules which have an entrance condition of Bar would be applied. The .* rule would no longer match with the stack as it is, because it doesn't have the Bar state as an entrance state.
If the op was instead a <-Foo> op, the Foo state would be popped and upon exit we would be back in the INITIAL state where our .* rule would again be matchable.
So, Start conditions allow you to declare your lex rules in a way in which there is more than a single flat list of rules which can be applied to all input. Now as rules are matched, the sets of applicable rules can change as we modify the stack.
Edit (adding stuff about comments example):
If we consider a flat c-style comment rule:
/\*.*\*/ "Comment" given input text with a nested comment such as: /*first /* second */ */. The rule will match up to the first */.
At which point it has already consumed the second /* opening the second comment.
With start conditions we can have multiple COMMENT states on the stack, so when the second COMMENT is entered
we know we need to exit the COMMENT state by twice popping off the stack.
%x COMMENT
%%
. "TEXT"
<COMMENT,INITIAL>/\* <+COMMENT>;
<COMMENT>. ;
<COMMENT>\n ;
<COMMENT>\*/ <-COMMENT>;
Let me know if there is anything which can be clarified or is needlessly redundant. I need to do some double checking about default/implicit things work exactly as I described but should be roughly the behavior (I'm confused a bit about the default when no operation is given if it is really REPLACE or if there is an implicit NOP in the code which isn't considered as a member of the struct StartStateOperation. Also I should probably look at lex docs for entrance/exit terminology used there.
Thanks @ratmice ! @FranklinChen does that help -- are there things you would still need to know to use this feature?
I'm still confused about how to return a token when states are involved, because the example above does not illustrate that situation. For example, what if I want to enter a special mode after finding an asterisk, but want to return an "ASTERISK" token? E.g. does something like this work:
<INITIAL> \* <ASTERISK_MODE> "ASTERISK"
<INITIAL> \* <ASTERISK_MODE> "ASTERISK"
Yes, you can return a token exactly like that. Thanks for pointing out this omission both in what I wrote, and the existing example! There is an test case in the testsuite which shows it it though. https://github.com/softdevteam/grmtools/blob/294c466c60f86c948a6e83f968c53f5af9787e59/lrlex/src/lib/parser.rs#L1399
Edit:
I've been struggling to come up with a realistic example to use for this, if anyone has any ideas. Perhaps adding doc comments to the comment example?
The other example I keep thinking of is rust 'raw string literals',
but I believe those also require some amount of stack introspection to know the number of times the # state in r##"#"## has been entered which is not something we currently support or is easily added.
So if anyone does have thoughts on examples, or more use cases I'd appreciate hearing about them!
I finally figured out what I did wrong. There can't be spaces between the start state and the following material!
<INITIAL> \* <ASTERISK_MODE> "ASTERISK"
did not work but
<INITIAL>\* <ASTERISK_MODE>"ASTERISK"
did!! I missed this.
@FranklinChen Thanks for reporting back!
@ratmice I wonder if that's intentional on our part or not? I don't know the bison syntax for start states well enough to know.
@ltratt I believe it is intentional, sorry I didn't catch it earlier! Flex doesn't accept spaces between the start state, and the regular expression. It does accept leading initial spaces, but that doesn't do what one might think. It copies such lines verbatim into the generated sources.
Any such input (beginning with a <blank> or within "%{" and "%}" delimiter lines) appearing at the beginning of the Rules section before any rules are specified shall be written to lex.yy.c
The Posix lex spec doesn't say anything that would allow a space between the start state and the regular expression.
lex source files are a table in which the left column contains regular expressions
ERE action
The extended regular expression (ERE) portion of a row shall be separated from action by one or more
characters. A regular expression containing characters shall be recognized under one of the following conditions:
- The entire expression appears within double-quotes.
- The
characters appear within double-quotes or square brackets. - Each
is preceded by a character. <state>r, <state1,state2,...>r The regular expression r shall be matched only when the program is in one of the start conditions indicated by state, state1, and so on; see Actions in lex. (As an exception to the typographical conventions of the rest of this volume of POSIX.1-2017, in this case
does not represent a metavariable, but the literal angle-bracket characters surrounding a symbol.) The start condition shall be recognized as such only at the beginning of a regular expression.
So I think because it can't be at the beginning of a regex normally, it can't be at the beginning of a regex including the optional start state.
It does accept leading initial spaces, but that doesn't do what one might think. It copies such lines verbatim into the generated sources.
I didn't expect that! Do we think that's an actually useful feature? I am inclined to think at the moment it's not because we'd have to think very carefully about what guarantees we'd make to the user. Perhaps the best thing is to explicitly error with whitespace at the start of the line and say "Not yet supported"?
I have only ever seen it used for embedding comments (I put some stuff in the relevant bug report). I had suggested one option there is that we ignore these lines entirely ( which is the easiest thing to do), we could expand it then to emit into the generated sources, or document it in the differences to lex? One thing we could consider making it a pedantic warning which could be toggled on/off, buecause I do agree it is suprising, but unfortunately I also think it is useful.
I'm not a big fan of ignoring such lines, because I guess that some people expect this to interact with lex code? Assuming that's correct, my instinct is to make it a hard warning for now, which then gives us however much time we want to work out if we want to do anything more clever than that.
@ltratt I realized that part of that first quoted text was being interpreted as markdown, reproduced it again below because it was omitting crucial <blank> so it didn't make much sense in context, anyhow working on a patch to produce an error:
Any such input (beginning with a <blank> or within "%{" and "%}" delimiter lines) appearing at the beginning of the Rules section before any rules are specified shall be written to lex.yy.c
I'm not a big fan of ignoring such lines, because I guess that some people expect this to interact with lex code?
Found another section relating to the above Rationale section says:
The reason for the undefined condition associated with text beginning with a
or within "%{" and "%}" delimiter lines appearing in the Rules section is historical practice. Both the BSD and System V lex copy the indented (or enclosed) input in the Rules section (except at the beginning) to unreachable areas of the yylex() function (the code is written directly after a break statement). In some cases, the System V lex generates an error message or a syntax error, depending on the form of indented input.
So it seems that both System-V and BSD lex copy these into unreachable areas of the source text, and thus is no longer expected to interact with the generated lex code.