sedlex
sedlex copied to clipboard
Interactive mode
Hi,
I'm running into an issue migrating an existing app using ocamllex
/ocamlyacc
to sedled
/menhir
. The problem at hand here does concern only sedlex
as I've successfully tested it with ocamllex
/menhir
.
The application has an interactive mode, it works this way:
parser.mly
:
%token <string> NUMBER
%token SEQSEQ EOF
%start interactive
%type <int> interactive
%%
exprs:
| NUMBER { int_of_string $1 }
interactive:
| exprs SEQSEQ { $1 }
| EOF { raise End_of_file }
lexer.ml
:
open Parser
let rec tokenizer lexbuf =
match%sedlex lexbuf with
| '\n' -> tokenizer lexbuf
| Plus('0'..'9') -> NUMBER (Sedlexing.Utf8.lexeme lexbuf)
| ';', ';' -> SEQSEQ
| eof -> EOF
| _ -> failwith "Parse error!"
main.ml
:
let interactive =
MenhirLib.Convert.Simplified.traditional2revised Parser.interactive
let get_token lexbuf =
let tokenizer () =
(Lexer.tokenizer lexbuf, Lexing.dummy_pos, Lexing.dummy_pos)
in
interactive tokenizer
let () =
let lexbuf = Lexing.from_channel stdin in
let rec f () =
Printf.printf "Got number from stdin: %d\n%!" (get_token lexbuf);
f ()
in
f ()
When compiled using ocamllex
and run through a terminal console, I can do:
% ./test
<type: 123;;\n>
Got number from stdin: 123
<type: 456;;\n>
Got number from stdin: 456
However, nothing happen when I do the same with the sedlex
lexer. But it works when I do this:
% echo "123;;\n456;;\n" | ./test_sedlex
Got number from stdin: 123
Got number from stdin: 456
Fatal error: exception End_of_file
I'd imagine that this is most likely due to the difference in treating ascii
characters vs. utf8
characters. However, do y'all think that there is a way to mitigate this?
Thanks!
Seconded. I was expecting using a generator to address that problem, to no avail; match%sedlex
only starts lexing the input when EOF is found. That also makes parsing from a pipe for instance quite problematic.
I'm not sure what the issue is, but we can try looking at it. Sedlex lacks many of the features ocamllex has. In the near term I was emphasizing mostly work on the regular expression language, submatch capture, etc., but I suspect this is a simple buffering issue and might not be hard to fix. Do you want to have an initial look at it yourself? I might not be able to get at it for some days.
I am toying with making the channel non-canonical and using a generator. If that works it may not be a sedlex issue after all. I'll keep y'all posted.
So it seems this is indeed about when sedlex triggers the parsing of its internal buffer. I’m investigating.
Hello! Any update on this? I've revived the sedlex
branch in liquidsoap
and would love to merge it. This issue is blocking that for now. If you think that there will be a fix, I'm happy to merge it and wait for it in future releases..
@toots Maybe you would like to try to isolate what the problem is? If you have a case which reproduces it, that makes the situation much easier.
@pmetzger I believe the example in the description is pretty minimal and self-explanatory. I created a gist
for it with a Makefile
in case you need a quick way to build it: https://gist.github.com/toots/c0a082da9e92ebdae0025da49bfb52d8
Reproduction steps:
% git clone https://gist.github.com/toots/c0a082da9e92ebdae0025da49bfb52d8
% cd c0a082da9e92ebdae0025da49bfb52d8
% make
% echo "123;;\n456;;\n" | ./test
Got number from stdin: 123
Got number from stdin: 456
Fatal error: exception End_of_file
% ./test
<type: 123;;\n>
(...nothing...)
Ok, I found the source of my issue here. It seems that chunk_size
here: https://github.com/ocaml-community/sedlex/blob/master/src/lib/sedlexing.ml#L54 is so large that the lexer gets stuck during the first refill.
Any idea on why such a high value? A utf8
char should be at most 4
bytes so 512
seems a lot to me..
Almost certainly it's a question of performance. Otherwise you're going to do an OS i/o call on every character. That's fine if you're reading things interactively, but it will reduce performance dramatically if you're (say) lexing a large corpus of source code. Indeed, arguably the number in non-interactive use should be larger, not smaller.
One possible solution is to adjust the behavior of the lexer in the face of a read that's smaller than expected. The Unix read(2) system call will return on an interactive device at the end of line, even if the supplied read count is much larger than the number of bytes actually available, precisely in order to allow interactive applications.
Perhaps the solution here is to use a different refill function when creating your lexbuf?
Thanks, that's very informative. Looking at the code, actually, it seems that it is not yet taking full advantage of that for input channels since it looks like it is, in fact, repeatedly calling input_char
on each character. I'll look at changing this behavior to use input
, whose semantics actually matches the refill
semantics as described in the API.