zed icon indicating copy to clipboard operation
zed copied to clipboard

collect() and union() skipping null and error("missing") values

Open philrz opened this issue 2 years ago • 3 comments

Repro is with Zed commit 4dc6236.

While helping a community Slack user I bumped into the following surprising result:

$ zq -version
Version: v1.14.0-2-g4dc62369

$ echo '1 2 null 4' | zq -z 'collect(this)' -
[1,2,4]

I would have expected the null value to still be in the array created.

Presumably due to the same effect, here collect() is not even returning an array.

$ echo 'null null' | zq -z 'collect(this)' -
null

union() has similar behaviors.

$ echo '1 2 null 4' | zq -z 'union(this)' -
|[1,2,4]|

$ echo 'null null' | zq -z 'union(this)' -
null

If we decide the current behavior is intentional and something we intend to keep, I'd suggest we start mentioning it in the docs with an explanation.

philrz avatar Feb 16 '24 21:02 philrz

#5041 was opened to change the behavior so collect() would preserve these values, but then we discussed it as a team and have decided to hold off on merging anything yet. Some things that were touched on in the discussion:

  1. There was consensus that if we change this for collect() we should do the same for union() so they're consistent.

  2. One reason for hesitation on merging #5041 is that this change is significant enough that it might break some programs, e.g., those of a particular community zync user that has some particularly large/complex Zed that relies heavily on collect() at times. If Zed were more mature it would probably be appropriate to do a careful/slow transition, e.g., maintain the current behavior with some kind of flag that becomes a new default, then surface a warning to users whose programs are relying on that behavior because we intend to swap the default in a future version, since this would give them time to understand and react to the change. Instead we discussed helping them to audit their programs for possible exposure and also giving a heads-up to the wider community via Slack before switching the behavior. But then we identified other behaviors with aggregate functions we might want to consider, so we concluded that we should gather up our thoughts on all those changes and then figure out a transition plan.

  3. As an example of "other behaviors", @mccanne cited avg() as an example, since here null values are currently treated as "not present", whereas if other aggregate functions like collect() and union() are changed to start preserving the null values, users might now expect a different behavior from avg() too, e.g., treating them as "present" and implying some meaningful value such as 0, which would change the result.

$ zq -version
Version: v1.14.0-2-g4dc62369

$ echo '1 3 5 7' | zq -z 'avg(this)' -
4.

$ echo '1 null 3 5 7' | zq -z 'avg(this)' -
4.

  1. @mattnibs pointed out that if we did toggle the behavior, it would be easy for users to get back the current behavior using the where clause, such as can be seen with his branch from #5041:
$ zq -version
Version: v1.14.0-3-ga33df054

$ echo '1 2 null 4' | zq -z 'collect(this)' -
[1,2,null,4]([(int64,null)])

$ echo '1 2 null 4' | zq -z 'collect(this) where this != null' -
[1,2,4]

philrz avatar Feb 20 '24 23:02 philrz

In the time since this issue was opened, the language has been evolving into SuperSQL (currently at commit 4084e01) and so mimicking SQL becomes our typical starting position when picking behaviors. In this regard, it does seem like preserving the null values would indeed still be the way to go.

For instance, we don't yet have support for the SQL aggregate function by the name ARRAY_AGG(), but this is what provides the functional equivalent to what we've traditionally done with collect(). As we can see using DuckDB to run through the same examples shown above, the null values are preserved.

$ duckdb --version
v1.1.3 19864453f7

$ cat data.json 
{"foo":1}
{"foo":2}
{"foo":null}
{"foo":4}

$ duckdb -c 'select array_agg(foo) from data.json'
┌─────────────────┐
│ array_agg(foo)  │
│     int64[]     │
├─────────────────┤
│ [1, 2, NULL, 4] │
└─────────────────┘

$ cat nulls.json 
{"foo":null}
{"foo":null}

$ duckdb -c 'select array_agg(foo) from nulls.json'
┌────────────────┐
│ array_agg(foo) │
│     json[]     │
├────────────────┤
│ [NULL, NULL]   │
└────────────────┘

As for the example with avg() noted above, it does seem to skip them there, i.e., the calculation is still based on only the actual numeric values shown.

$ cat numbers.json 
{"foo": 1}
{"foo": null}
{"foo": 3}
{"foo": 5}
{"foo": 7}

$ duckdb -c 'select avg(foo) from numbers.json '
┌──────────┐
│ avg(foo) │
│  double  │
├──────────┤
│      4.0 │
└──────────┘

As for the concerns about whether this change could break users' programs, we're already making a lot of other significant changes as part of the transition to SQL, so the direction things are headed we're likely going to just have to publish some docs to help guide users through modifying their programs, so this would likely just be covered as part of that.

philrz avatar Dec 26 '24 21:12 philrz

A community user noticed in #5759 that error("missing") values are being similarly skipped, e.g.,

$ super -version
Version: v1.18.0-366-g7f5ca96c0

$ echo '1 2 error("missing") 4' | zq -z 'collect(this)' -
[1,2,4]

$ echo '1 2 error("missing") 4' | zq -z 'union(this)' -
|[1,2,4]|

In a group discussion, there was some debate if this was maybe the correct behavior, but there was ultimately consensus that the error("missing") should indeed be included, and if the user explicitly wanted it dropped, they could use the quiet function if desired. Modifying the example above to play around with how a user would attempt that:

$ echo '1 2 error("missing") 4' | zq -z 'collect(quiet(this))' -
[1,2,error("quiet"),4]

Upon seeing that, the group consensus was that the error("quiet") should definitely not be present here, since for instance the docs page for quiet says:

Quiet errors are ignored by operators aggregate, cut, and yield.

In conclusion, since this has such a similar flavor to what we saw with null I'm grafting on to this existing issue, but if needs to be addressed separately I can break it out to its own issue.

philrz avatar Apr 04 '25 01:04 philrz