riff icon indicating copy to clipboard operation
riff copied to clipboard

Store named capture groups in field table

Open darrylabbate opened this issue 2 years ago • 3 comments

Numbered groups are already stored; forgot to implement named capture groups.

Also, audit the behavior of $abc. Currently, abc would be treated as an expression (variable). To dereference the field table with a named group, you'd need to use a string literal (e.g. $'abc').

If the field table were named/aliased (like arg), you could cleanly dereference using match.group or match[n].

darrylabbate avatar Dec 04 '22 21:12 darrylabbate

Also, audit the behavior of $abc. Currently, abc would be treated as an expression (variable). To dereference the field table with a named group, you'd need to use a string literal (e.g. $'abc').

This would be a breaking change, but logically it makes sense for $foo to correspond to the capture group foo

darrylabbate avatar Jul 15 '23 04:07 darrylabbate

The named capture groups can be extracted from a compiled pattern (pcre2_code *) via pcre2_pattern_info().

  • PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the "name table" (PCRE2_SPTR)
  • PCRE2_INFO_NAMECOUNT returns the number of named capture groups (uint32_t)
  • PCRE2_INFO_NAMEENTRYSIZE returns the size of each entry in the name table (uint32_t), which is essentially the length of the longest capture group name + 3 (8-bit library)
    • First 2 bytes are the corresponding number (big endian) for the capture group
    • Each string is null-terminated

Example pattern and corresponding name table layout:

  (?<date> (?<year>(\d\d)?\d\d) - (?<month>\d\d) - (?<day>\d\d) )
  00 01 d  a  t  e  00 ??
  00 05 d  a  y  00 ?? ??
  00 04 m  o  n  t  h  00
  00 02 y  e  a  r  00 ??

Obvious approach:

  • Collect the capture group names upon pattern compilation
  • Extract captured substrings from the match data via pcre2_substring_copy_byname() upon pattern matching

Should look closely at the PCRE2 spec for duplicated group names before doing any optimzations with the number <-> name mapping.

darrylabbate avatar Dec 03 '23 21:12 darrylabbate

Should look closely at the PCRE2 spec for duplicated group names before doing any optimzations with the number <-> name mapping.


In an attempt to reduce confusion, PCRE2 does not allow the same group number to be associated with more than one name. [...] However, there is still scope for confusion. Consider this pattern:

(?|(?<AA>aa)|(bb))

Although the second group number 1 is not explicitly named, the name AA is still an alias for any group 1. Whether the pattern matches "aa" or "bb", a reference by name to group AA yields the matched string.

(source)


I.e. Number -> name mapping should be safe if needed; even with PCRE2_DUPNAMES. Name -> number mapping isn't safe since a name can correspond to multiple numbered groups.

darrylabbate avatar Aug 06 '24 08:08 darrylabbate