alex icon indicating copy to clipboard operation
alex copied to clipboard

Feature: allow the user to capture groups on regex

Open german1608 opened this issue 6 years ago • 4 comments
trafficstars

Currently, when alex executes a token action, the whole match is captured. For example:

%wrapper "basic" -- the behaviour is similar on the other wrappers
tokens :-
  \"a\"  { \s -> Token a }


{
data Token = Token String
}
-- After reading, alex would return [Token "\"a\""] for input = "\"a\""

Would be amazing that alex match syntax allows to capture groups, like other regexes:

%wrapper "basic" -- the behaviour is similar on the other wrappers
tokens :-
  \"(a)\"  { \s -> Token s }
-- After reading, alex would return [Token "a"] for the same input

Alex could also handle several groups in the same token action, so it could extract the groups as a list of strings or something alike.

german1608 avatar Oct 09 '19 17:10 german1608

Yes, I would really like to have this functionality. As I recall it was non-trivial to implement it (especially if we want to do it without a performance overhead if you don't use the functionality), but it would be very useful to have it.

simonmar avatar Oct 11 '19 10:10 simonmar

We could try to have separate wrappers for that, something like:

%wrapper "basic-group-captor"
%wrapper "posn-group-captor"
%wrapper "monad-group-captor"
%wrapper "monadUserState-group-captor"

Or adding another directive, like:

%capturegroups

I think the last approach is better than the first one.

german1608 avatar Oct 11 '19 14:10 german1608

@german1608: is the automata-theoretic implementation of groups worked out some where? I found only the question: https://stackoverflow.com/questions/28941425/capture-groups-using-dfa-based-linear-time-regular-expressions-possible , and the answer pointed out that groups introduce nondeterminism, e.g. matching

a(b)|(a)b

against ab would give you non-deterministically a or b. This is maybe a can of worms we do not want to open.

Without drilling up the automata generation of Alex, one could imagine the following light-weight approach:

%wrapper "basic-regex-groups" 
tokens :-
  \"(a)\"  { \ r s -> Token (extractGroup r s) }
-- After reading, alex would return [Token "a"] for the same input

What Alex would do here, is:

  • Ignore the groups for the sake of DFA generation and token recognition.
  • Provide you with the regex r that accepted the token s, such that you can extract the group content from s yourself. (A user-defined extractGroup could utilize a third-party library after converting r into the format required by that library.)
  • The regex r would be provided as element of a type RExp defined in the basic-regex-groups wrapper, basically as an abstract syntax tree, see https://github.com/simonmar/alex/blob/6c4db72c3bb3419c84740e2e9ae112ed0abd572e/src/AbsSyn.hs#L198-L205.

This might not be as comfortable as it could be, but a rather lightweight and generic extension of Alex. In terms of efficiency, there is of course some duplication of work, but just extracting groups in a regexp that is known to match should be cheaper that figuring out which token to extract from the input in the first place.

andreasabel avatar Jan 31 '20 17:01 andreasabel

@german1608: is the automata-theoretic implementation of groups worked out some where?

There is TDFAs (Tagged Deterministic Finite Automata), e.g. https://github.com/haskell-hvr/regex-tdfa .

andreasabel avatar Apr 14 '23 19:04 andreasabel