congo-parser-generator Implement repetition cardinality assertions (Java-only currently); ...

…add recovery-point action code blocks; add cardinality and CICS examples; update bootstrap jar.

I did this as a pull request at this point to give @revusky and @vsajip a chance to comment/criticize/suggest before I merged it. The goal was originally to allow a notation and implementation for set-based choices, primarily power sets of choices. Once I got into it I realized that 1) I needed a bit more than that to solve the problem, and 2) it was not much harder to do something even more generally useful than my immediate problem. What it ended up as is the syntactic sugar (or maybe a whole cake) comprising what I decided to call "repetition cardinality assertions". In a nutshell they are special-purpose assertions (operating both at lookahead time and parse time) that ensure that the cardinality of a given cut in an expansion sequence is within specified bounds at the point no later than that when closest enclosing loop expansion exits. Both the need for this and a complete description and examples is included in the new "cardinality" example. An additional example is provided as the new "cics" example, which reflects the problem I set out to solve.

The result of all this is a capability for CongoCC that as far as I can tell is unique in the world of LL parser generators and, maybe, parser generators in general. I hope you agree.

I chose the "&" delimiter for no particular reason except that it was relatively available in the context and seemed to convey the function (with a stretch). After I implemented it, I realized that "%" might have been better, but I left it for your opinions. Personally, I don't think it matters and, if there is a better character to use, I'd be happy to change it.

The bulk of the changes are to the lookahead and parser_productions templates with a scattering of changes in other places. I've tested it very thoroughly, but if you play with it and find something let me know. I plan to try and make the necessary parallel changes to Python and CSharp after I have finished the CICS project and uncover any oversights in the design.

Apr 02 '25 23:04 adMartem

There's a lot of changes here - to help me understand what this change is for, perhaps you could post on the forum or a blog post somewhere as to what problem this is solving, and how these changes address that? Thanks.

Apr 03 '25 05:04 vsajip

Sure, I'll do that. In the meantime, below is the markdown I referred to in the initial comment that, I think, is maybe what you are looking for. In a nutshell, it is virtually impossible to use an LL parser to parse languages (most often command languages like CICS) due to the requirement to allow the power set of alternatives. While it can be done, it usually exponentially expands the number of rules and is, therefore, not done. It also is present in a few places in almost every language (for example, COBOL, SQL, Java, etc). In those cases it is usually handled by code actions rather than syntax.

This change allows a simple, easy way to specify a cut point in a sequence subordinate to a ZeroOrMore or OneOrMore with an ENSURE[ASSERT] that counts the number of times it is passed and prevents going past an upper bound (1 by default). When the loop exits, all counts within it are checked for a lower bound and it fails lookahead and/or causes an assertion error at parse time if any do not meet their respective minimum (0 by default).

Repetition Cardinality Assertions

1. Introduction

The repetition cardinality assertion is a mechanism provided by the CongoCC parser generator. It enables the constraining of the number of times a specific expansion may occur within a repeating grammar element. Thus, it constrains the behavior of OneOrMore (+) and ZeroOrMore (*) loops dynamically by allowing explicit occurrence limits to be optionally specified at points in the grammatical scope of the loop.

2. Why Cardinality Constraints?

Often programming languages impose restrictions on the number of syntactic elements of a certain type that may be used within a source construct. These restrictions can go beyond the simple "no more than one instance in any order" to cases requiring "exactly one of each ...", "no more than n ..." or "between n and m ..." occurrences. This introduces constraints that are usually difficult or tedious to enforce during top-down parsing. In particular, the solution to this often requires obscure semantic actions or overly complex grammatical expression.

A notable example of a language with a compelling need for this capability in a top-down parser is IBM's Customer Information Control System (CICS) embedded command language. CICS is an extensive language comprising hundreds of commands. With a single seemingly innocuous syntactic rule stating that "optional items may occur in any order," CICS seems not to be parsable by any of the common parser generators. Use of a typical recursive descent parser generator would would be expected to require tedious semantic actions and predicates within the grammar specification to enforce this kind of syntax restriction during parsing.

The following describes a feature of the CongoCC grammar supporting a type of constraining predicate: a repetition cardinality assertion, or RCA, notated generally as &m:n&, where m is the lower bound of cardinality and n is the upper bound of cardinality of the following sequence in the context of the nearest enclosing syntactic repetition. This sequence must be within a ZeroOrMore or OneOrMore expansion. The notation may also be abbreviated as &n (equivalent to &n:n&, meaning "exactly n"), &n:& (equivalent to &n:∞&), or simply &, which defaults to &0:1& (meaning "no more than one").

3. CongoCC Formal Syntax

In the CongoCC parser generator the syntax used to express repetition cardinality consists of an assertion with a semantic predicate specifying a special notation as follows (and itself using this notation):

( &"ASSERT" | &"ENSURE" )+ "{" Cardinality "}"
|
Cardinality

The syntax for repetition cardinality (Cardinality) can be expressed as:

"&" [ Min [ ":" [ Max ] ] "&" ]

where:

Min is the minimum required occurrences (optional; defaults to 0 if omitted).
Max is the maximum allowed occurrences (optional; defaults to Min if the Min alone is present or the largest integer value otherwise).
A lone "&" is interpreted as &0:1&, meaning the expansion may appear at most once.
If no constraint is specified, it is implicitly equivalent to &0:∞& (no upper limit).

If an ASSERT is used, the assertion will only be checked during parsing, and will not affect any lookahead. To enforce the assertion during lookahead in addition to parsing, ENSURE may be specified, or it may be followed by the # suffix consistent with other semantic assertions.

For convenience and readablity the assertion may also be abbreviated using the cardinality syntax alone at the point that the corresponding ASSERT and/or ENSURE may be used. When this is the case, it also implies the application of the constraint in both lookahead and parsing.

4. Formal Semantics

The repetition constraints are always applied during parsing of the affected repetition, and may be applied during lookahead, thus ensuring that parsing follows a consistent path.
Upon entering a repetition (...)* or (...)+ the parser will initialize a cardinality tally if the loop contains an RCA. Each unique RCA within the loop will have its own tally.
At the point in either lookahead or parsing that the parser encounters the RCA, it will attempt to increment the tally by one and the lookahead or parsing will continue if it is successful. If the tally has already reached the maximum indicated in the associated RCA, however, parsing or lookahead will fail. This process is repeated for each RCA encountered until the loop controlling them is exited.
Following exit of the loop, the parser will check each RCA's tally against the minimum cardinality of the RCA and, if met, will continue as is normally the case. If the minimum constraint is not met, the parsing or lookahead will fail at that point.

Examples of expansion behavior:

( &A | &B | &C )* allows any subset of sequences of A, B, and C, including the empty set. Specifically, {} (empty sequence), A, B, C, AB, AC, BA, CA, BC, CB, ABC, CBA, BAC, CAB, BCA, ACB.
- The parser will enter the loop containing either an A, B, or C after initializing three distinct tallies to the value 0.
- Each iteration of the loop that is attempted will increment the corresponding tally by 1, provided it is not already 1, and will parse the choice normally. If the tally cannot be incremented (i.e., it has already been incremented in this loop instance), parsing will fail with a parse exception (or the lookahead will deem the loop completed).
- During lookahead when no choice is available within the loop, either because the input does not match any choice or the input matches a choice but that choice's tally is 1, the loop exits.
- The parser does not need to check the minimum cardinality because it is 0 and permits omission of any choices.
- ( &1:2& A | &1& B )* ensures A appears at least once but no more than twice, and B appears exactly once. Possible sequences include {}, AB, BA, AAB, ABA, BAA.
  - The parser will enter the loop containing A or B after initializing both tallies to 0.
  - If an A is the next input token, the parser checks the first tally and, if it is not already 2, increments it. If it is already 2, the loop is terminated with an exception (or lookahead will deem the loop completed).
  - If a B is encountered, the parser checks the second tally and, if it is not already 1, increments it. If it is already 1, the loop is terminated with an exception (or lookahead will deem the loop completed).
  - Upon loop completion the parser will check the first tally and, if it is not at least 1, will indicate a parse exception (or if in lookahead, will deem the loop's lookahead scan to have failed). If the first tally meets the minimum check, the parser will then check the second tally in the same manner.

5. Examples of Usage

Basic Cases

( &Foo | &Bar )* → Equivalent to ( &0:1& Foo | Bar& )* → Equivalent to ( ENSURE ASSERT {&0:1&} Foo | Bar ENSURE ASSERT {&0:1&} =>|| )* .

This is an example of the simplest use of repetition cardinality to recognize the set of all combinations of at most one Foo and one Bar. Note that the placement of the repetition cardinality assertion usually does not affect the syntax recognized by the grammar, since recognition of the entire sequence is based on the truth of all the predicates and assertions in the sequence, but the position could affect semantic actions applied before the acceptance or rejection of the sequence by the RCA.

( &1& Foo | Bar )* → Equivalent to ( &1:1& Foo | Bar )*.

Constrained Elements Within Repetitions

( &1:2& A | B )* → A sequence of A and B containing at least one, but not more than two As, and any number of Bs, including none, or an empty sequence.
( &0:4& A | &B )* → A sequence of A and B containing no more than four As and one B, or an empty sequence.
( &2& Foo )* → A sequence of Foo appearing exactly twice, or nothing.
( & Bar )* → Bar or nothing. Equivalent to [Bar].

These constraints refine the parsing process while maintaining the expressiveness of choice-based repetitions.

Constrained Repetitions

( ( A | B ) &5&)+ → A sequence of As and Bs of length 5.
( &5:&( A | B ) )* → An optional sequence of 5 or more As and/or Bs.

6. Summary

Repetition cardinality provides precise control over element occurrences within repetition constructs and repetitions themselves in CongoCC. By applying identical lookahead and parsing constraints, it ensures predictable parsing behavior for cardinality-constrained syntax without semantic rules or confusingly complex syntax in grammar definitions.

Apr 05 '25 20:04 adMartem

And this is an example of a CongoCC rule using this feature from the cics example and a sample of its input. The RR diagram for this command can be found here: https://www.ibm.com/docs/en/cics-ts/6.x?topic=summary-read [Note: this is a first cut at this command, and as I reread the RR diagram and IBM's text I see that the following is probably not quite correct, but I think it is still illustrative of the general idea.]

Read :
    <READ>
    (
        &1&<FILE> "(" FileName ")"
        |
        &(
            <UNCOMMITTED> |
            <CONSISTENT> |
            <REPEATABLE> |
            <UPDATE> [ <_TOKEN> "(" DataArea ")" ]
        )
        |
        &1&(
            <INTO> "(" DataArea ")" |
            <SET> "(" PtrRef ")"
        )
        |
        &1&<RIDFLD> "(" DataArea ")"
        |
        &<KEYLENGTH> "(" DataValue ")" [ <GENERIC> ]
        |
        &(
            <SYSID> "(" SystemName ")" <LENGTH> "(" DataArea ")" |
            <LENGTH> "(" DataArea ")"
        )
        |
        &(
            <DEBKEY> |
            <DEBREC> |
            <RBA> |
            <RRN> |
            <XRBA>
        )
        |
        &( <EQUAL> | <GTEQ> )
        |
        &<NOSUSPEND>
        |
        (
            & @noHandle :=? <NOHANDLE>
            |
            & <RESP> "(" Name #ResponseName(1) ")"
            |
            & <RESP2> "(" Name #Response2Name(1) ")"
        ) #CommonOptions
    )+ =>||
;

Here is a typical input for this command:

READ 
DATASET (LIT-CARDXREFNAME-ACCT-PATH) 
RIDFLD (WS-CARD-RID-ACCT-ID-X) 
KEYLENGTH (LENGTH OF WS-CARD-RID-ACCT-ID-X) 
INTO (CARD-XREF-RECORD) 
LENGTH (LENGTH OF CARD-XREF-RECORD) 
RESP (WS-RESP-CD) 
RESP2 (WS-REAS-CD)

There are around 400 different commands in the CICS embedded language with varying degrees of complexity, so you see the problem.

Apr 05 '25 20:04 adMartem

Thanks, that all seems very comprehensive! I'll give that a go :slightly_smiling_face:

Apr 05 '25 21:04 vsajip

I'm sorry to be so slow to give any feedback. I'll try to have a look tomorrow. (Mañana.)

Apr 05 '25 21:04 revusky

No problem. Hasta mañana.

Apr 05 '25 22:04 adMartem

Hi Jon, I think I've updated this to reflect all the recent changes. If you get a chance to look at it, I would really like to merge it. My stuff makes extensive use of it, and I've done some other stuff on top of it, so I'd like to stop the dangerous rebasing/merging it (I almost missed a case today where VSC tried to incorrectly accept a change). If you do have time, please look particularly at how I pass the additional cardinality arguments in the lookahead and production templates. They are designed to be deeply available and passed through to called macros even when they are not needed by the calling macro, but I'm not sure I completely grokked, for this use case, the "best practice" for Freemarker, and especially with your intention in the congo templates. This pass-thru implementation is really what I'm not sure is optimally coded, as I want the actual argument to be transparently passed down to called macros as arguments.

This took quite a while to get right, and I still have some improvements and further thoughts about the implementation, but I would like to get this merged and then consider Python and CS (mostly the two templates) so they don't drift out of sync. It provides a unique and useful capability to Congo, I think.

-john

Jun 11 '25 20:06 adMartem

I think the failing tests at this point are the same as the ones in the main branch.

Jun 13 '25 15:06 adMartem

I think the failing tests at this point are the same as the ones in the main branch.

It's just the non-Java generation tests that are failing. I think the failures are somehow spurious. (Vinay?)

Jun 13 '25 16:06 revusky

I'm not sure they're spurious, but I'll take a look next week.

Jun 13 '25 22:06 vsajip

I'll just wait until the main branch passes all the tests before trying to sync this PR again. Then I'll probably squish it some to eliminate all the merges of main and reconciliations.

Jun 18 '25 15:06 adMartem

I'll just wait until the main branch passes all the tests before trying to sync this PR again. Then I'll probably squish it some to eliminate all the merges of main and reconciliations.

Actually, I'm not sure there is any need for you to hold back. Maybe it's just better to commit and let somebody (maybe VS) sort out the problem with the tests.

Jun 18 '25 17:06 revusky