jq
jq copied to clipboard
bugs in integration of oniguruma
[EDIT: erroneous remarks about the "m" FLAG have been corrected.]
There are several outstanding bug reports regarding jq's support for regular expression functionality (e.g. #2562), but it's beginning to dawn on me that there might be some significant issues with the integration of the Oniguruma library, so I thought it could be helpful to have a central "hub" for identifying these issues so that hopefully they can be more efficiently resolved.
(1) Both the manual and builtin.c envision FLAGS (as in test(_; FLAGS)
) as being any of the characters in [gilmnpsx], but in practice, "m", "p", and "s" do not work as advertised in the jq manual:
According to the manual:
* `m` - Multi line mode (`.` will match newlines)
* `p` - Both s and m modes are enabled
* `s` - Single line mode (`^` -> `\A`, `$` -> `\Z`)
But in practice:
* `m` - . matches \n; ^ matches at start or after \n; $ matches at end or before \n
* `p` - only causes `.` to match `\n`
* `s` - does not seem to have any effect at all
Here are some examples:
$ ./jq -nM '"a\nb" | test("^b"; "p")'
false
$ ./jq -nM '"a\nb" | test("^b"; "s")'
false
(2) There seems to be a general problem in the support for "Extended Groups" (EGs) of the form (?-OFF)
or (?-OFF:regex)
.
Consider for example:
$ jaq -n '"TEst" | test("(?-i:te)st"; "i") '
false
$ gojq -n '"TEst" | test("(?-i:te)st"; "i") '
false
$ jq -n '"TEst" | test("(?-i:te)st"; "i") '
jq: error (at <unknown>): Regex failure: invalid group name <>
(I have not uncovered problems with EGs of the form (?ON)
and (?ON-OFF)
,
where ON
is understood to be a string of letters chosen from [imsxWDSPy].)
It's hard to imagine that this is an Oniguruma issue (I have checked at https://github.com/kkos/oniguruma/issues?q=is%3Aissue+extended).
As noted in #2562, the "m" flag does have an effect. The tests above seem to use \b
for backspace - I would expect \n
for newline. Here's the description I drew up over there, and following that the tests I used to verify it:
- Default behaviour
^
matches only at start of string,.
does not match newliness
flag- No effect
m
flag- Causes
.
to match newlines (?s)
option- Causes
.
to match newlines (?m)
option- Causes
^
to match after newlines
def test($options; $flags):
def matches($regex):
(if $options == "" then "" else "(?\($options))" end) as $optstring|
[match("\($optstring)\($regex)"; $flags)]
;
"Options: \($options), Flags: \($flags)",
"Does . match newlines? \("\n"|matches(".")|length==1)",
"Does ^ match at line start? \("\na\nb"|matches("^b")|length==1)",
""
;
test(""; "g"),
test("s"; "g"),
test("m"; "g"),
test("sm"; "g"),
test(""; "sg"),
test("s"; "sg"),
test("m"; "sg"),
test("sm"; "sg"),
test(""; "mg"),
test("s"; "mg"),
test("m"; "mg"),
test("sm"; "mg"),
test(""; "smg"),
test("s"; "smg"),
test("m"; "smg"),
test("sm"; "smg")
As I understand it, the reason "s" and "p" do not change the behaviour of ^
and $
is that the Perl mode already turns on this behaviour, that Oniguruma calls "ONIG_OPTION_SINGLELINE" and I shall call "anchors-don't-match-at-newlines". If you read the description in the jq documentation for "s" you will see that it is accurate but surprising. It causes ^
to be interpreted the same as \A
, which is already true.
@annettejanewilson thanks for that.
jq -nr 'def test($options; $flags): [28/1842]
def matches($regex):
(if $options == "" then "" else "(?\($options))" end) as $optstring|
[match("\($optstring)\($regex)"; $flags)]
;
try
if ("\n"|matches(".")|length==1) and ("\na\nb"|matches("^b")|length==1)
then "Options \"\($options)\" and flags \"\($flags)\" allow . to match newline and ^ to match line start (as opposed to string start)"
else "FAIL: options \"\($options)\" and flags \"\($flags)\""
end
catch "ONIGURUMA EXCEPTION: \(.)";
test(""; ""),
test(""; "g"),
test("s"; "g"),
test("m"; "g"),
test("p"; "g"),
test("sm"; "g"),
test(""; "sg"),
test("s"; "sg"),
test("m"; "sg"),
test("p"; "sg"),
test("sm"; "sg"),
test(""; "mg"),
test("s"; "mg"),
test("m"; "mg"),
test("p"; "mg"),
test("sm"; "mg"),
test(""; "smg"),
test("s"; "smg"),
test("m"; "smg"),
test("p"; "smg"),
test("sm"; "smg"),
test("s"; ""),
test("m"; ""),
test("p"; ""),
test("sm"; "")
produces
FAIL: options "" and flags ""
FAIL: options "" and flags "g"
FAIL: options "s" and flags "g"
FAIL: options "m" and flags "g"
ONIGURUMA EXCEPTION: Regex failure: undefined group option
Options "sm" and flags "g" allow . to match newline and ^ to match line start (as opposed to string start)
FAIL: options "" and flags "sg"
FAIL: options "s" and flags "sg"
FAIL: options "m" and flags "sg"
ONIGURUMA EXCEPTION: Regex failure: undefined group option
Options "sm" and flags "sg" allow . to match newline and ^ to match line start (as opposed to string start)
FAIL: options "" and flags "mg"
FAIL: options "s" and flags "mg"
Options "m" and flags "mg" allow . to match newline and ^ to match line start (as opposed to string start)
ONIGURUMA EXCEPTION: Regex failure: undefined group option
Options "sm" and flags "mg" allow . to match newline and ^ to match line start (as opposed to string start)
FAIL: options "" and flags "smg"
FAIL: options "s" and flags "smg"
Options "m" and flags "smg" allow . to match newline and ^ to match line start (as opposed to string start)
ONIGURUMA EXCEPTION: Regex failure: undefined group option
Options "sm" and flags "smg" allow . to match newline and ^ to match line start (as opposed to string start)
FAIL: options "s" and flags ""
FAIL: options "m" and flags ""
ONIGURUMA EXCEPTION: Regex failure: undefined group option
Options "sm" and flags "" allow . to match newline and ^ to match line start (as opposed to string start)
The option "p"
doesn't work even though "sm"
does, and that's because the test programs above are prepending the options to the regex while Oniguruma doesn't know "p"
. The "p"
is supposed to be in the second argument to match
, but there it definitely doesn't work either, nor does "sm"
.
So the mode
argument to match
is a bit busted. And Oniguruma does give us a workaround.
Also, fmatch()
has a number of leaks.
Near as I can tell onig_new()
doesn't honor ONIG_OPTION_MULTILINE
and ONIG_OPTION_SINGLELINE
being set in its option
argument.
For Oniguruma to understand the "p"
option we need to define USE_POSIXLINE_OPTION
:
8262 #ifdef USE_POSIXLINE_OPTION
8263 case 'p':
8264 OPTION_NEGATE(option, ONIG_OPTION_MULTILINE|ONIG_OPTION_SINGLELINE, neg);
8265 break;
8266 #endif
(in src/regparse.c
).
So I think in fact we're looking at an Oniguruma bug. I've single stepped through it a bit and I can't quite find the bug yet.
@nicowilliams wrote:
Oniguruma doesn't know "p". For Oniguruma to understand the "p" option we need to define USE_POSIXLINE_OPTION.
As I mentioned, using e.g. jq 1.6, the "p" FLAG is recognized, but not as advertised:
"p" does cause .
to match \n
$ jq -n '"a\nb" | test("a.b")'
false
$ jq -n '"a\nb" | test("a.b"; "p")'
true
"p" does not cause ^ to behave differently
$ jq -n '"a\nb" | test("^b")'
false
$ jq -n '"a\nb" | test("^b"; "p")'
false
For contrast:
$ jaq -n '"a\nb" | test("^b")'
false
$ jaq -n '"a\nb" | test("^b"; "p")'
true
@pkoppstein well, I was trying jq from master
:), and the thing about "p"
and Oniguruma is about regexes that start with ?(p)
. The flag "p"
in your example is handled by jq adding the relevant values to the option
bitmask, and Oniguruma seems to be ignoring that.
@nicowilliams - Yes, we have to be careful about distinguishing between FLAGS and the options allowed in "extended groups". Regarding the latter, lower-case "p" is not one of the options.
The Oniguruma documentation gives these syntax summaries for extended groups:
(?imxWDSPy-imxWDSP:subexp)
(?imxWDSPy-imxWDSP)
Regarding the latter, lower-case "p" is not one of the options.
But you can enable it to be one of them by adding -DUSE_POSIXLINE_OPTION
to the build's C flags.
@nicowilliams wrote:
you can enable it to be one of them by adding -DUSE_POSIXLINE_OPTION to the build's C flags.
I would have expected to see something about that on the Oniguruma configuration page: https://github.com/kkos/oniguruma/blob/master/doc/SYNTAX.md
Can you see it mentioned there?
I would have expected to see something about that on the Oniguruma configuration page: https://github.com/kkos/oniguruma/blob/master/doc/SYNTAX.md
Can you see it mentioned there?
No, I found it by code inspection. Clearly it's not meant to be supported. Or maybe it is but they failed to document it. IMO the should just add it unconditionally, but anyways.
(1) Re FLAGS (second argument to test/1):
It may be helpful to remember that:
(a) The jq documentation is muddled in that the descriptions of the "m" and "s" have been incorrectly swapped (that is, "m" should (in accordance with universal agreement) refer to ^ and $; and "s" to .)
(b) jq's "m" FLAG behaves in accordance with the (correct) description of the "p" FLAG (i.e. "s" + "m")
(c) jq's "p" FLAG behaves in accordance with the description: . matches \n
Thus the missing functionality is multiline-only behavior.
(2) Re the "p" Extended Group option (as in (?p:subexp)
)
Since the Oniguruma documentation is consistent (to the point of being emphatic) that p
is NOT an allowed Extended Group letter, I think it would be unwise to make it so.
Could the "p" FLAG be implemented in terms of (?ms)
?
@nicowilliams @itchyny @annettejanewilson - What are your thoughts about dealing with this for jq 1.7, assuming no fixes will be available for this release? It seems to me that the main options involve some combination of these elements:
- Add info about Extended Groups (
(?options:regex)
, etc); - Describe the actual behavior of FLAGS (second arg of
test
), with brief indications of those FLAGS whose behavior will (or are expected to) change; - Include details about both the "intended" (i.e. ideal or standard) behaviors of FLAGS and their actual behaviors.
@pkoppstein and maybe
- fix this code so that the intended behaviors of flags actually get implemented.
Does (1) wed us to Oniguruma? Probably not. We could parse those from any regex if we had to. So I'm certainly happy with (1), but I'd like to do the others too.
@nicowilliams asked:
Does (1) wed us to Oniguruma?
No, at least in the sense that the following engines (*) all understand(?i:a)
:
- PCRE2
- PRCRE
- Go
- Java 8
- .NET (C#)
- Rust
Please note that your point (4) is essentially my point (0). That is, there seems to be a high likelihood that not all the major problems will be resolved in the near term (for whatever reason), so it makes sense to prepare for that eventuality.
How would it be if I reworked what I have to make it clear which sentences are purely descriptive of "buggy" behavior that will hopefully change?
Another thought is that it should be possible to rejigger the FLAGS and use the Extended Group functionality to
overcome most of the major issues. This would definitely be hokey - e.g., mapping the user's "p" FLAG to "m", and the user's "m" flag to (?ms:_)
along the lines that @annettejanewilson illustrated. Possible, but maybe not so easy.
What do you think?
(*) Footnote: https://regex101.com/
(a) The jq documentation is muddled in that the descriptions of the "m" and "s" have been incorrectly swapped (that is, "m" should (in accordance with universal agreement) refer to ^ and $; and "s" to .)
I'm uncertain this is accurate. My current understanding is this:
- Most regex libraries/tools (e.g. regex101, Python, Perl, Rust) use "m"/"multi-line" mode to mean that
^
and$
match at every newline (as opposed to only at the start and end of the string), and use "s"/"single-line" mode to mean that.
matches\n
(when normally it wouldn't). - Some don't. For example, Ruby uses "m" to mean that
.
matches\n
. - In particular, Oniguruma uses ONIG_OPTION_MULTILINE to mean that
.
matches\n
and ONIG_OPTION_SINGLELINE to mean that^
and$
DO NOT match at newlines. Note that this is not a simple inversion. ONIG_OPTION_SINGLELINE is the negation of Perl's multi-line mode. - The jq documentation accurately states that the
m
flag causes.
to match newlines. - The jq documentation accurately, if tersely and misleadingly, states that the
s
flag causes^
and$
to behave the same as\A
and\Z
and only match at the beginning and end of the string. This is misleading because they already have that behaviour - the flag has no effect. - The jq documentation accurately states that the
p
flag has the effects of both them
ands
flags. As thes
flag has no observable effect, this means it has the same effect as them
flag. - The jq documentation notes that "certain" flags may be specified within the regex, but does not elaborate. This obscures the surprising fact that
(?m)
and(?s)
inside the regex mean different things.(?s)
means what them
flag does, while(?m)
means the opposite of what thes
flag does.
If my understanding here is correct, I don't think it's accurate to say the descriptions of "m" and "s" have been incorrectly swapped in the jq documentation. The documentation is misleading but it's not incorrect with respect to jq's behaviour.
nicowilliams, earlier in the conversation:
Near as I can tell onig_new() doesn't honor ONIG_OPTION_MULTILINE and ONIG_OPTION_SINGLELINE being set in its option argument.
This surprises me, and I wonder whether this is borne of confusion over the meaning and polarity of Oniguruma's options. When I first looked into this I got confused myself. At present I don't believe I've seen evidence that Oniguruma has a bug, but I am definitely feeling like I need to be more thorough in my testing, because it's clear there are areas I've come to different conclusions, and I want to be more certain I'm not working on faulty assumptions.
What are your thoughts about dealing with this for jq 1.7, assuming no fixes will be available for this release?
I don't have recommendations right now, I'm too uncertain until I've had time to research further. I'm a bit wary of inadvertently making the docs worse by compounding misunderstandings.
@annettejanewilson - Yes, "universal" was a bit of an overstatement, but in my defense it agrees not only with the explicit terminology used at https://regex101.com, but also with the tacit agreement of all the regex engines listed there (PCRE2, PCRE, ECMAScript, Python, Golang, Java8, .NET, and Rust), in that when given the FLAGS "m" and "s", they behave in accordance with the description.
At about the same time as you were writing your most recent post, I was revising the text in manual.yml and would ask that you review it again, with a view to:
(1) retaining the distinction between:
a) FLAGS as they appear in the second arg of jq's test
, or as they
appear on the RHS of the delimiters of regexs (e.g. /regex/m
); and
b) single-letter options as used in "Extended Groups" (e.g. (?i:regex)
).
(2) for the terms "multi-line" and "single line", and the FLAGS "m" and "s", using the conventions of https://regex101.com, which agree with your point (1):
-
the "multi-line" FLAG is "m" - "^ and $ match start/end of line" e.g. "a\nb"~/^b/m
-
the "single-line" FLAG is "s" - "dot matches newline" e.g. "a\nb"~/a.b/s
(3) avoiding any reference to the words used in Oniguruma's defined symbols.
In other words, I would like to focus on what jqMaster DOES and what we anticipate jq should do in the future if there is a discrepancy.
Thanks.
p.s. p.s. I thought it would be useful if we could agree on a table showing the various FLAG values, and the behavior that actually goes with them:
-
In the header, the "m" and "s" refer to the FLAG as defined above;
-
In the body, "yes" means the behavior of the engine is in accordance with the descriptions above.
FLAG FLAG Ext Gp option Ext Gp option Ext Gp option
m s i m s
----------- ------------- -------------- ------------- -------------
example "a\nb"~/^b/m "a\nb"~/a.b/s "A"~(?i:a) "a\nb"~(?m:^b) "a\nb"~(?s:a.b)
PCRE2 yes yes yes yes yes
PCRE yes yes yes yes yes
ECMAScript yes yes error error error
Python yes yes error error error
Golang yes yes yes yes yes
Java8 yes yes yes yes yes
.NET yes yes yes yes yes
Rust yes yes yes yes yes
jaq true true true true true
jq false! false! true true true
I just wanted to explicitly add there is currently no way to change the behavior of ^/$ even if the documentation is also confusing and/or wrong, and that i would think is definitely a bug somewhere.
1504:0:karl@Amalthea:~$ ~/bin/jq --version
jq-1.7
1505:0:karl@Amalthea:~$ echo -e "a\nb\nc" | ~/bin/jq -sR 'test("^b"; "")'
false
1506:0:karl@Amalthea:~$ echo -e "a\nb\nc" | ~/bin/jq -sR 'test("^b"; "m")'
false
1507:0:karl@Amalthea:~$ echo -e "a\nb\nc" | ~/bin/jq -sR 'test("^b"; "s")'
false
1508:0:karl@Amalthea:~$ echo -e "a\nb\nc" | ~/bin/jq -sR 'test("^b"; "p")'
false
1509:0:karl@Amalthea:~$ echo -e "a\nb\nc" | ~/bin/jq -sR 'test("^a"; "")'
true
1510:0:karl@Amalthea:~$ echo -e "a\nb\nc" | ~/bin/jq -sR 'test("^a"; "m")'
true
1511:0:karl@Amalthea:~$ echo -e "a\nb\nc" | ~/bin/jq -sR 'test("^a"; "s")'
true
1512:0:karl@Amalthea:~$ echo -e "a\nb\nc" | ~/bin/jq -sR 'test("^a"; "p")'
true