re2c Generate `goto end_of_block` after user-defined actions.

Hi, I wanted to write parser for HTTP protocol using re2c. I wrote it in following way:

/*!re2c
    * { printf("Error!\n"); return; }
    "GET" { printf("GET\n"); }
    "POST" { printf("GET\n"); }
*/

/*!re2c
    * { printf("Error!\n"); return; }
    " " {}
*/

char* start = YYCURSOR;
/*!re2c
    * { printf("Error!\n"); return; }
    [^ ]+ { printf("URL %s\n", substr(start, YYCURSOR)); }
*/

I assumed that after rule will be matched in one re2c block and corresponding user code will be executed, execution will continue from end of current block. So code would check request type first, then consume space, then get url, and so on. Unfortunately it turned out that re2c does not generate instruction like "goto end_of_block;" after inserted user code, what leads to undefined behavior. After I modified my code to explicitly jump to next block, code started working properly. Please either fix this or specify in documentation that re2c works in this way (please also add example for this).

Jun 23 '16 07:06 sirzooro

The goto end_of_block feature you suggested seems to be backwards compatible and is very simple to implement. However, since re2c knows nothing about the action code, it will append this goto end_of_block every time, which means dead code in most of the cases. Compiler will probably throw that away; but I'd better test this with various compilers.

The manual has to be updated no matter what (the core of the manual has been written long ago). There is an example with multiple re2c blocks http://re2c.org/examples/example_04.html (it doesn't emphasize your issue though).

Jun 23 '16 08:06 skvadrik

By the way, I'm currently working on adding submatch extraction which will allow to rewrite your code like this (@p in the action code denotes a pointer to input which corresponds to @p position in the regexp):

/*!re2c
    * { printf("Error!\n"); return; }
    @a ("GET" | "POST) @b " " @c [^ ]+ {
        printf("%.*s: %.*s\n", @b - @a, @a, YYCURSOR - @c, @c);
        return;
    }
*/

So that you won't need multiple blocks for simple things.

Jun 23 '16 08:06 skvadrik

Thanks, I will try this new option when new re2c version will be released.

BTW, this issue also affects code which uses single block only, it may not work properly too. In fact I had problem in my 1st re2c block.

Jun 23 '16 15:06 sirzooro

Bug #151 is related (add goto condition after user-defined action when => condition is used).

Jun 28 '16 21:06 skvadrik

After experimenting with the idea I realized that fixing this bug will break backwards compatibility, because when jumping to the end of block it is necessary to rollback the input position to where it was before matching (since the match failed, no characters should be consumed). And that may require YYMARKER (or YYBACKUP/YYRESTORE with generic API) in cases where they currently aren't needed. This is demonstrated by the following example:

/*!re2c
    "aa" { /* ... */ }
*/

Currently re2c generates the following:

        YYCTYPE yych;
        if ((YYLIMIT - YYCURSOR) < 2) YYFILL(2);
        yych = *YYCURSOR;
        switch (yych) {
        case 'a':       goto yy3;
        default:        goto yy2;
        }
yy2:
yy3:
        yych = *++YYCURSOR;
        switch (yych) {
        case 'a':       goto yy4;
        default:        goto yy2;
        }
yy4:
        ++YYCURSOR;
        { /* ... */ }

After the fix, it would generate something like this:

        YYCTYPE yych;
        if ((YYLIMIT - YYCURSOR) < 2) YYFILL(2);
        yych = *(YYMARKER = YYCURSOR);
        switch (yych) {
        case 'a':       goto yy3;
        default:        goto yy2;
        }
yy2:
        goto yyend1;
yy3:
        yych = *++YYCURSOR;
        switch (yych) {
        case 'a':       goto yy5;
        default:        goto yy4;
        }
yy4:
        YYCURSOR = YYMARKER;
        goto yy2;
yy5:
        ++YYCURSOR;
        { /* ... */ }
yyend1:

Currently the program doesn't need YYMARKER, but after the fix it will (and it will do some additional work on backup/restore).

Adding goto to the end of block without restoring the input position seems useless, as it will substitute one unexpected behavior with another, more subtle one (it is easier to notice crashes and infinite loops than incorrect number of characters being consumed).

Sep 07 '20 22:09 skvadrik