sedlex Support named capture groups

This PR tries to support named capture groups. Closes #5, closes #102.

Usages

In the example

(Plus '0'..'9') as digits

Variable digits is defined as a (pos, len) pair, which can be passed as arguments to Sedlexing.sub_lexeme to retrieve the group captured by digits.

The as syntax is only allowed in match%sedlex ... with ... constructs. You cannot provide an alias in let ... = [%sedlex.regexp? ...]. And it is not allowed inside constructors (i.e. Star, Plus, Rep, Opt, Compl, Sub, Intersect, Chars).

Plus ("ab" as a)  (* this is invalid *)

Implementation

We are trying to restore the NFA node transitions path with the information of the DFA state transitions path.

Explanation

And finally,

let sub (a, b) = Sedlexing.Latin1.sub_lexeme buf a b in
match%sedlex buf with
  | (sign as s), prefix_2, (num_2 as n) ->
      Printf.printf "Bin %s%s\n" (sub s) (sub n);
      token buf
  ...

is translated to

let sub (a, b) = Sedlexing.Latin1.sub_lexeme buf a b in
match __sedlex_result with
  | 0 ->
      let __sedlex_aliases_starts, __sedlex_aliases_stops =
        __sedlex_trace_0 buf __sedlex_path
      in
      let n = (__sedlex_aliases_starts.(0), __sedlex_aliases_stops.(0) - __sedlex_aliases_starts.(0))
      and s = (__sedlex_aliases_starts.(1), __sedlex_aliases_stops.(1) - __sedlex_aliases_starts.(1)) in
      Printf.printf "Bin %s%s\n" (sub s) (sub n);
      token buf
  ...

Note

When there are multiple available paths, any of them might be chosen. As in

(Plus "ab" as a), (Plus "ab" as b)

"abababab" => a:"ababab", b:"ab"

One of the possible combinations of a and b will be returned.

Others

There are some tests (test/number_lexer.ml, test/misc.ml) currently. More to be added.

Sep 19 '22 12:09 dyzsr

Can anybody help add reviewers? (Also approve running workflows)

Sep 20 '22 15:09 dyzsr

PTAL @alainfrisch @Drup @pmetzger

Sep 21 '22 13:09 dyzsr

Could you give more details about the current implementation ? Why is it done this way ? Can't we update positions as we traverse the automation, instead of doing a second pass based on some trace ? Why is the trace processed in reverse order ?

Sep 21 '22 20:09 hhugo

There is a lot of room for optimization.

Offset tracking could be shared between bindings.

For example, in(Plus 'a' as a), (Plus 'b' as b), end of binding a equals start of binding b.

Some offset can be computer statically. In the regexp above, start of a is 0, end of b is the end of the lexeme.

Have you considered such optimizations ?

Sep 21 '22 20:09 hhugo

let rec token buf =
  match%sedlex buf with
    | (Plus "a" as a), (Plus "b" as b), "c" ->
        ignore a;
        ignore b;
        Printf.printf "Op %s - %s\n"
          (Sedlexing.Latin1.sub_lexeme buf (fst a) (snd a))
          (Sedlexing.Latin1.sub_lexeme buf (fst b) (snd b));
        token buf
    | eof -> print_endline "EOF"
    | _ -> failwith "Unexpected character"

let () =
  let lexbuf = Sedlexing.Latin1.from_string "aaaabbbc" in
  token lexbuf

Returns an unexpected result:

Op aaaab - bb                       
EOF

Sep 22 '22 09:09 hhugo

I think I would be easier if your trace was maintaining pairs of pos, instead of pos + len.

Sep 22 '22 11:09 hhugo

I think I would be easier if your trace was maintaining pairs of pos, instead of pos + len.

Thanks for your suggestions! Let me do some fixes.

Sep 22 '22 11:09 dyzsr

I can't really comment on the implementation but, for the API, is there a benefit in exporting offset/length vs simply binding the resulting match string?

The bindings can be either string, Uchar.t array or utf8/utf16 strings, so I think it's better to export just offset/length and let the user decide the sub_lexeme function.

Sep 22 '22 12:09 dyzsr

I can't really comment on the implementation but, for the API, is there a benefit in exporting offset/length vs simply binding the resulting match string?

The bindings can be either string, Uchar.t array or utf8/utf16 strings, so I think it's better to export just offset/length and let the user decide the sub_lexeme function.

Makes sense, ty!

Sep 22 '22 14:09 toots

I would be useful to write test (as expect test (ppx_expect) maybe). The examples are currently not testing enough.

Sep 22 '22 15:09 hhugo

I would be useful to write test (as expect test (ppx_expect) maybe). The examples are currently not testing enough.

Sure, I'll refactor the test later.

Op aaaab - bb                       
EOF

Give me some time and I'm working on this.

Sep 22 '22 15:09 dyzsr

Op aaaab - bb                       
EOF

@hhugo This one is fixed. You may check the expect test result.

I will update the implementation details in next few days. (It's kind of difficult to explain it clearly).

Sep 22 '22 16:09 dyzsr

Another one

let print_alias buf name (x,len) =
  Printf.printf "%s: %S (%d, %d)\n" name (Sedlexing.Latin1.sub_lexeme buf  x len) x len
let rec token buf =
  match%sedlex buf with
  | (((Star "a" as a), (Plus "b" as b)), "c") | (Plus "a" as b), (Plus "b" as a), ("d") ->
    print_alias buf "a" a;
    print_alias buf "b" b;
        token buf
    | eof -> print_endline "EOF"
    | _ -> failwith "Unexpected character"

let () =
  let lexbuf = Sedlexing.Latin1.from_string "aaaabbbbd" in
  token lexbuf

wrongly returns

a: "aaaa" (0, 4)                    
b: "bbbb" (4, 4)
EOF

Sep 22 '22 20:09 hhugo

I suspect you will need to track more than a pair of pos for some alias.

Sep 22 '22 20:09 hhugo

With the information of matched lexeme and DFA path, we have to restore the NFA path. Tracing back the NFA path, we know from the NFA node about when an alias starts and stops.

For example,

| (((Star "a" as a), (Plus "b" as b)), "c") | (Plus "a" as b), (Plus "b" as a), ("d") ->

A possible (simplified) NFA would be

0 ->(eps) 1
0 ->(eps) 5
1 ->[a] 1
1 ->(eps) 2
2 ->[b] 3
3 ->(eps) 2
3 ->(eps) 4
4 ->[c] 10
5 ->[a] 6
6 ->(eps) 5
6 ->(eps) 7
7 ->[b] 8
8 ->(eps) 7
8 ->(eps) 9
9 ->[d] 10

where 0 is the initial node and 10 is the final node. A converted DFA could be

1: {0, 1, 2, 5}
2: {1, 2, 5, 6, 7}
3: {2, 3, 4, 7, 8, 9}
4: {10}

1 ->[a] 2
2 ->[a] 2
2 ->[b] 3
3 ->[b] 3
3 ->[c, d] 4

where 1 is the initial state and 4 is the final state.

let lexbuf = Sedlexing.Latin1.from_string "aaaabbbbd" in

With input "aaaabbbbd", we got a DFA path [4; 3; 3; 3; 3; 2; 2; 2; 2; 1]. The final NDA node is unique so we know it's 10. Starting from state 4 and node 10, we now trace back the path. The previous NFA node can be decided by (current_state, current_node, previous_state, char_set), therefore a transition case is in the form of (current_state, current_node, previous_state, char_set) => previous_node. What Sedlex.compile_traces returns is a list of such transition cases.

The transition cases for DFA state 4 and NFA node 10 can be

(4, 10, 3, 'c') => 4
(4, 10, 3, 'd') => 9

Note that previous_node can travel to current_node by first going through a real edge (node.trans), and then several epsilon edges (node.eps). They are not always adjacent.

And since the starting and stoping of an alias can only happen when travel through epsilon edges (see Sedlex.alias), we just generate actions when the transition go through any epsilon edges (see Sedlex.trans_case).

The last DFA state we trace back will always be the initial state (which is 1), however the NFA node we are currently in may not be 0. It can be in 1, 2, or 5, and will travel back to 0 by only epsilon edges. Therefore we generate actions for the initial state (see Sedlex.final_case).

Sep 23 '22 15:09 dyzsr

Could you give more details about the current implementation ? Why is it done this way ? Can't we update positions as we traverse the automation, instead of doing a second pass based on some trace ? Why is the trace processed in reverse order ?

The reason why making a second pass in the reverse order is that, with a valid given DFA path, you can always find an NFA path from the final node back to the initial node. But from a front order, there are some paths that cannot lead to the final node within the required steps.

And in the second pass, we already know which branch we're in, so nodes and aliases from other branches can be ignored. If do it in one pass, I'm afraid we have to refactor a lot to record the states and actions.

Sep 23 '22 15:09 dyzsr

I think you should look at minimizing the quantity of code generated for trace fonctions. My last example generate a trace function with 80 cases.

Have you though about optimizing the number positions tracked ? You could have a single array of positions (instead of one for starts and one for ends) and have alias reference a pair of position. position could then more easily be shared between aliases.

Sep 23 '22 20:09 hhugo

This PR starts to look good, thanks for all the changes.

I think it would be nice to commit the post-processed version of your test (like you did before) with the corresponding dune stanza so that the file is kept up to date automatically. I think this really helps when reviewing changes.
More comments in the code would be welcome. Maybe inline some of the explanation made in this thread inside the code.
You haven't reacted to one of my comment regarding optimization.

There is a lot of room for optimization. Offset tracking could be shared between bindings. For example, in(Plus 'a' as a), (Plus 'b' as b), end of binding a equals start of binding b. Some offset can be computer statically. In the regexp above, start of a is 0, end of b is the end of the lexeme. Have you considered such optimizations ?

ocamllex has some optimization:

rule token = parse
  | ("a"+ as a) ("b" + as b) "d" { a,b }

only stores 2 integers

rule token = parse
  | (['-' '+'] as sign) (['0'-'9']+ as a) { a,b }

Doesn't store anything.

The change on .ocamlformat needs to be removed before the merge.

Sep 26 '22 08:09 hhugo

You should also update the README.md with some documentation

Sep 26 '22 08:09 hhugo

@hhugo Thanks for your considerate suggestions! Let me do it one by one and start from merging the alias offsets.

ocamllex has some optimization:
rule token = parse
  | ("a"+ as a) ("b" + as b) "d" { a,b }
only stores 2 integers
rule token = parse
  | (['-' '+'] as sign) (['0'-'9']+ as a) { a,b }
Doesn't store anything.

I'm not sure if I'm able to support optimizations in such static cases. I'll think about it.

Another thing yet to do is to stop tracking the path if the matching branches have no alias at all.

Sep 26 '22 14:09 dyzsr

FYI, if one really wants to support capture groups correctly: https://re2c.org/#papers has several must-read papers.

Sep 29 '22 13:09 pmetzger

@pmetzger Thanks for the link!

Sep 29 '22 14:09 dyzsr

I think it would be nice to commit the post-processed version of your test (like you did before) with the corresponding dune stanza so that the file is kept up to date automatically. I think this really helps when reviewing changes.

I've added this in https://github.com/hhugo/sedlex/commit/0284bfd961d3c2c4bcda759fb0e03f68fb7e84ee

Oct 04 '22 09:10 hhugo

Hey there! Do we want to try & come to a consensus on wether this could be merged?

Oct 22 '22 14:10 toots

@toots I think there's at least one code review comment above that hasn't yet been addressed? Generally I think we should merge though?

Oct 24 '22 14:10 pmetzger

I currently can't close/resolve threads. There are still a bunch unaddressed ones.

Oct 24 '22 17:10 hhugo