ragel
ragel copied to clipboard
Ragel v7 scanner does not set/reset ts/te after the loop ending in a full token
This is a regression from Ragel 6.10, which set ts
and te
pointers as described in the manual, which says this:
This variable may be used in action code for retrieving the text of the current match. Ragel ensures that in between tokens and outside of the longest-match machines that this pointer is set to null. In between calls to the execute code the user must check if ts is set and if so, ensure that the data it points to is preserved ahead of the next buffer block.
Here's the program to show this behavior:
#include <cstdio>
#include <cstdlib>
#include <string>
%%{
machine part_token;
main := |*
' '+;
',';
'abcd' => { printf("token: %s\n", std::string(ts, te).c_str()); };
*|;
}%%
%% write data;
int main(int argc, char *argv[])
{
if(argc != 2) {
printf("Usage: scan token\n");
return EXIT_FAILURE;
}
std::string in(argv[1]);
int cs = 0, act = 0;
const char *p = in.c_str();
const char *pe = in.c_str() + in.length();
const char *ts = nullptr, *te = nullptr;
const char *eof = pe;
%% write init;
%% write exec;
printf("done\n");
if(cs == part_token_error)
printf("Error near %zd\n", p-in.data());
else if(ts)
printf("offsets: ts %zd te: %zd pe: %zd\n", ts-in.data(), te-in.data(), pe-in.data());
else
printf("ts is null\n");
return EXIT_SUCCESS;
}
Compiled on CentOS 8 Stream with Ragel 7.0.0.12, which is available as a package, and with Ragel 6.10, built on the same machine from the source at http://www.colm.net/files/ragel/ragel-6.10.tar.gz.
ragel -o scan.cpp scan.rl
g++ -std=c++17 -o scan scan.cpp
Both scanners ran this way:
./scan abcd
./scan abcd,ab
Ragel 6.10 produces this output:
./scan abcd
token: abcd
done
ts is null
./scan abcd,ab
token: abcd
done
offsets: ts 5 te: 5 pe: 7
Ragel 7 produces this output:
./scan abcd
token: abcd
done
offsets: ts 4 te: 4 pe: 4
./scan abcd,ab
token: abcd
done
offsets: ts 5 te: 5 pe: 7
ts
is no longer reset to a null pointer after the execution loop if the input ends in a full token.
Hi @gh-andre thanks for reporting this.
This is a consequence of
- Ragel using from-state actions to set
ts = p
on the start of a token (always the case). - In ragel 7 from-state actions getting executed on EOF. This was done to better align with leaving-state actions getting executed on EOF. It helps with zero-length matches.
I'm not sure what to do about this at the moment, but thinking about it.
Note that this is only the case after EOF. It means you can check that EOF has not been reached yet. It's still not ideal though. It should stay at zero.
Proposed solution in df0a63a7e779fa8e25c92099e65e9fd10a053241.
@adrian-thurston Thank you for looking into this. I worked it around in my code by checking whether ts
and pe
are the same, in addition to checking whether ts
is a null pointer or not. I'm not familiar with Ragel source and cannot comment on the proposed fix, but will be able to test it once I'm back in the office next week, if you don't mind the delay.
@adrian-thurston Sorry for the delay - colm
was conflicting with the one installed via dnf
and I needed how to build both from the commit.
I tested various scenarios and the new Ragel behaves the same as v6.10. I also tested the scenario I forgot to mention, when the first token is incomplete (i.e. ./scan abc
) and verified that ts
is set at the start of input, while te
is not set and in this example remains nullptr
, while pe
in each case is advanced past the end of the last character (i.e. points to the null character).
Seems to be working well for the scenarios I can think of. Thanks again for looking into this.
@gh-andre great, thanks for trying it out!