jq icon indicating copy to clipboard operation
jq copied to clipboard

capture returns inconsistent results for optional named group

Open elmaimbo opened this issue 1 year ago • 2 comments

Describe the bug I've just upgraded from JQ version 1.6 to 1.7, and noticed that the capture function is returning an empty string instead of null for an optional named group that doesn't match, but only when no other part of the regex matches.

For example, if your regex contains (?<x>a)?, then if the overall regular expression matches, the output from capture will contain a field x which should either have the value "a" if the capturing group is present, or null if it isn't. The problem is that there are some cases where x has the value "" (i.e. empty string).

To Reproduce Running the following code shows the issue: jq -cn '"a","b","c" | capture("(?<x>a)?b?")'

The third line that is output from the command above is wrong because there is no valid case where x can have the value "".

Expected behavior A capturing group that is followed by "?" should have the value null if the capturing group isn't present.

i.e. The expected result from running the example code above should be:

{"x":"a"}
{"x":null}
{"x":null}

However the output produced by JQ versions 1.7 and 1.7.1 are:

{"x":"a"}
{"x":null}
{"x":""}

i.e. The third line produced an x field with the value "" instead of null.

(FYI JQ 1.6 produces {} as the third line of output, which is also arguably wrong, but IMHO is better than what JQ 1.7 produces.)

Environment (please complete the following information):

  • OS and Version: Linux Ubuntu 23.10
  • jq version 1.7

elmaimbo avatar Apr 15 '24 09:04 elmaimbo

The change from empty object to object with empty string seem to be have happen in https://github.com/jqlang/jq/commit/01dfd8b86d1248e80d868523c63c87356bca9c6f

The comment in the code says it's about "Zero-width match" but that dosen't seem to be true in this case? the group did not match at all or?

I wonder if the code here https://github.com/jqlang/jq/blob/master/src/builtin.c#L931 should do the same region->beg[i] == -1 check as the code below? is that how oniguruma signals that a group did not match?

wader avatar Apr 15 '24 10:04 wader

With:

diff --git a/src/builtin.c b/src/builtin.c
index ebc1863..0fbdfdf 100644
--- a/src/builtin.c
+++ b/src/builtin.c
@@ -930,8 +930,13 @@ static jv f_match(jq_state *jq, jv input, jv regex, jv modifiers, jv testmode) {
         jv captures = jv_array();
         for (int i = 1; i < region->num_regs; ++i) {
           jv cap = jv_object();
-          cap = jv_object_set(cap, jv_string("offset"), jv_number(idx));
-          cap = jv_object_set(cap, jv_string("string"), jv_string(""));
+          if (region->beg[i] == -1) {
+            cap = jv_object_set(jv_object(), jv_string("offset"), jv_number(-1));
+            cap = jv_object_set(cap, jv_string("string"), jv_null());
+          } else {
+            cap = jv_object_set(cap, jv_string("offset"), jv_number(idx));
+            cap = jv_object_set(cap, jv_string("string"), jv_string(""));
+          }
           cap = jv_object_set(cap, jv_string("length"), jv_number(0));
           cap = jv_object_set(cap, jv_string("name"), jv_null());
           captures = jv_array_append(captures, cap);

I get:

$ ./jq -cn '"a","b","c" | capture("(?<x>a)?b?")'
{"x":"a"}
{"x":null}
{"x":null}

$ ./jq -cn '"a","b","c" | match("(?<x>a)?b?")'
{"offset":0,"length":1,"string":"a","captures":[{"offset":0,"length":1,"string":"a","name":"x"}]}
{"offset":0,"length":1,"string":"b","captures":[{"offset":-1,"string":null,"length":0,"name":"x"}]}
{"offset":0,"length":0,"string":"","captures":[{"offset":-1,"string":null,"length":0,"name":"x"}]}

# with group that allows empty match

$ ./jq -cn '"a","b","c" | capture("(?<x>a?)?b?")'
{"x":"a"}
{"x":""}
{"x":""}

$ ./jq -cn '"a","b","c" | match("(?<x>a?)?b?")'
{"offset":0,"length":1,"string":"a","captures":[{"offset":0,"length":1,"string":"a","name":"x"}]}
{"offset":0,"length":1,"string":"b","captures":[{"offset":0,"string":"","length":0,"name":"x"}]}
{"offset":0,"length":0,"string":"","captures":[{"offset":0,"string":"","length":0,"name":"x"}]}

But this changes how "abc" | [match("( )*"; "g")] behaves in this test https://github.com/jqlang/jq/blob/master/tests/onig.test#L1-L4 new result is [{"offset":0,"length":0,"string":"","captures":[{"offset":-1,"string":null,"length":0,"name":null}]},{"offset":1,"length":0,"string":"","captures":[{"offset":-1,"string":null,"length":0,"name":null}]},{"offset":2,"length":0,"string":"","captures":[{"offset":-1,"string":null,"length":0,"name":null}]},{"offset":3,"length":0,"string":"","captures":[{"offset":-1,"string":null,"length":0,"name":null}]}] but i'm not sure i follow how the current expected result is correct? why would the capture group that didn't match have an offset?

wader avatar Apr 15 '24 11:04 wader

@wader Your patch looks correct to me. Could you create a PR?

itchyny avatar Jan 26 '25 05:01 itchyny

@itchyny sure! had forgot about this, here is a PR https://github.com/jqlang/jq/pull/3238

wader avatar Jan 26 '25 15:01 wader