sedlex Interactive mode

Interactive mode

Open toots opened this issue 7 years ago • 10 comments

Hi,

I'm running into an issue migrating an existing app using ocamllex/ocamlyacc to sedled/menhir. The problem at hand here does concern only sedlex as I've successfully tested it with ocamllex/menhir.

The application has an interactive mode, it works this way: parser.mly:

%token <string> NUMBER
%token SEQSEQ EOF

%start interactive
%type <int> interactive

%%

exprs:
  | NUMBER { int_of_string $1 }

interactive:
  | exprs SEQSEQ { $1 }
  | EOF { raise End_of_file }

lexer.ml:

open Parser

let rec tokenizer lexbuf =
  match%sedlex lexbuf with
    | '\n' -> tokenizer lexbuf
    | Plus('0'..'9') -> NUMBER (Sedlexing.Utf8.lexeme lexbuf)
    | ';', ';' -> SEQSEQ
    | eof -> EOF
    | _ -> failwith "Parse error!"

main.ml:

let interactive =
  MenhirLib.Convert.Simplified.traditional2revised Parser.interactive

let get_token lexbuf =
  let tokenizer () =
    (Lexer.tokenizer lexbuf, Lexing.dummy_pos, Lexing.dummy_pos)
  in
  interactive tokenizer

let () =
  let lexbuf = Lexing.from_channel stdin in
  let rec f () =
    Printf.printf "Got number from stdin: %d\n%!" (get_token lexbuf);
    f ()
  in
  f ()

When compiled using ocamllex and run through a terminal console, I can do:

% ./test
<type: 123;;\n>
Got number from stdin: 123
<type: 456;;\n>
Got number from stdin: 456

However, nothing happen when I do the same with the sedlex lexer. But it works when I do this:

% echo "123;;\n456;;\n" | ./test_sedlex
Got number from stdin: 123
Got number from stdin: 456
Fatal error: exception End_of_file

I'd imagine that this is most likely due to the difference in treating ascii characters vs. utf8 characters. However, do y'all think that there is a way to mitigate this?

Thanks!

Mar 22 '17 20:03 toots

Seconded. I was expecting using a generator to address that problem, to no avail; match%sedlex only starts lexing the input when EOF is found. That also makes parsing from a pipe for instance quite problematic.

Sep 14 '18 02:09 xguerin

I'm not sure what the issue is, but we can try looking at it. Sedlex lacks many of the features ocamllex has. In the near term I was emphasizing mostly work on the regular expression language, submatch capture, etc., but I suspect this is a simple buffering issue and might not be hard to fix. Do you want to have an initial look at it yourself? I might not be able to get at it for some days.

Sep 14 '18 15:09 pmetzger

I am toying with making the channel non-canonical and using a generator. If that works it may not be a sedlex issue after all. I'll keep y'all posted.

Sep 14 '18 15:09 xguerin

So it seems this is indeed about when sedlex triggers the parsing of its internal buffer. I’m investigating.

Sep 14 '18 16:09 xguerin

Hello! Any update on this? I've revived the sedlex branch in liquidsoap and would love to merge it. This issue is blocking that for now. If you think that there will be a fix, I'm happy to merge it and wait for it in future releases..

Apr 25 '19 16:04 toots

@toots Maybe you would like to try to isolate what the problem is? If you have a case which reproduces it, that makes the situation much easier.

Apr 25 '19 17:04 pmetzger

@pmetzger I believe the example in the description is pretty minimal and self-explanatory. I created a gist for it with a Makefile in case you need a quick way to build it: https://gist.github.com/toots/c0a082da9e92ebdae0025da49bfb52d8

Reproduction steps:

% git clone https://gist.github.com/toots/c0a082da9e92ebdae0025da49bfb52d8
% cd c0a082da9e92ebdae0025da49bfb52d8
% make
% echo "123;;\n456;;\n" | ./test
Got number from stdin: 123
Got number from stdin: 456
Fatal error: exception End_of_file
% ./test
<type: 123;;\n>
(...nothing...)

Apr 25 '19 17:04 toots

Ok, I found the source of my issue here. It seems that chunk_size here: https://github.com/ocaml-community/sedlex/blob/master/src/lib/sedlexing.ml#L54 is so large that the lexer gets stuck during the first refill.

Any idea on why such a high value? A utf8 char should be at most 4 bytes so 512 seems a lot to me..

Apr 25 '19 22:04 toots

Almost certainly it's a question of performance. Otherwise you're going to do an OS i/o call on every character. That's fine if you're reading things interactively, but it will reduce performance dramatically if you're (say) lexing a large corpus of source code. Indeed, arguably the number in non-interactive use should be larger, not smaller.

One possible solution is to adjust the behavior of the lexer in the face of a read that's smaller than expected. The Unix read(2) system call will return on an interactive device at the end of line, even if the supplied read count is much larger than the number of bytes actually available, precisely in order to allow interactive applications.

Perhaps the solution here is to use a different refill function when creating your lexbuf?

Apr 28 '19 13:04 pmetzger

Thanks, that's very informative. Looking at the code, actually, it seems that it is not yet taking full advantage of that for input channels since it looks like it is, in fact, repeatedly calling input_char on each character. I'll look at changing this behavior to use input, whose semantics actually matches the refill semantics as described in the API.

Apr 28 '19 14:04 toots

sedlex sedlex copied to clipboard

Interactive mode

sedlex
sedlex copied to clipboard