collect() and union() skipping null and error("missing") values
Repro is with Zed commit 4dc6236.
While helping a community Slack user I bumped into the following surprising result:
$ zq -version
Version: v1.14.0-2-g4dc62369
$ echo '1 2 null 4' | zq -z 'collect(this)' -
[1,2,4]
I would have expected the null value to still be in the array created.
Presumably due to the same effect, here collect() is not even returning an array.
$ echo 'null null' | zq -z 'collect(this)' -
null
union() has similar behaviors.
$ echo '1 2 null 4' | zq -z 'union(this)' -
|[1,2,4]|
$ echo 'null null' | zq -z 'union(this)' -
null
If we decide the current behavior is intentional and something we intend to keep, I'd suggest we start mentioning it in the docs with an explanation.
#5041 was opened to change the behavior so collect() would preserve these values, but then we discussed it as a team and have decided to hold off on merging anything yet. Some things that were touched on in the discussion:
-
There was consensus that if we change this for
collect()we should do the same forunion()so they're consistent. -
One reason for hesitation on merging #5041 is that this change is significant enough that it might break some programs, e.g., those of a particular community zync user that has some particularly large/complex Zed that relies heavily on
collect()at times. If Zed were more mature it would probably be appropriate to do a careful/slow transition, e.g., maintain the current behavior with some kind of flag that becomes a new default, then surface a warning to users whose programs are relying on that behavior because we intend to swap the default in a future version, since this would give them time to understand and react to the change. Instead we discussed helping them to audit their programs for possible exposure and also giving a heads-up to the wider community via Slack before switching the behavior. But then we identified other behaviors with aggregate functions we might want to consider, so we concluded that we should gather up our thoughts on all those changes and then figure out a transition plan. -
As an example of "other behaviors", @mccanne cited
avg()as an example, since herenullvalues are currently treated as "not present", whereas if other aggregate functions likecollect()andunion()are changed to start preserving thenullvalues, users might now expect a different behavior fromavg()too, e.g., treating them as "present" and implying some meaningful value such as0, which would change the result.
$ zq -version
Version: v1.14.0-2-g4dc62369
$ echo '1 3 5 7' | zq -z 'avg(this)' -
4.
$ echo '1 null 3 5 7' | zq -z 'avg(this)' -
4.
- @mattnibs pointed out that if we did toggle the behavior, it would be easy for users to get back the current behavior using the
whereclause, such as can be seen with his branch from #5041:
$ zq -version
Version: v1.14.0-3-ga33df054
$ echo '1 2 null 4' | zq -z 'collect(this)' -
[1,2,null,4]([(int64,null)])
$ echo '1 2 null 4' | zq -z 'collect(this) where this != null' -
[1,2,4]
In the time since this issue was opened, the language has been evolving into SuperSQL (currently at commit 4084e01) and so mimicking SQL becomes our typical starting position when picking behaviors. In this regard, it does seem like preserving the null values would indeed still be the way to go.
For instance, we don't yet have support for the SQL aggregate function by the name ARRAY_AGG(), but this is what provides the functional equivalent to what we've traditionally done with collect(). As we can see using DuckDB to run through the same examples shown above, the null values are preserved.
$ duckdb --version
v1.1.3 19864453f7
$ cat data.json
{"foo":1}
{"foo":2}
{"foo":null}
{"foo":4}
$ duckdb -c 'select array_agg(foo) from data.json'
┌─────────────────┐
│ array_agg(foo) │
│ int64[] │
├─────────────────┤
│ [1, 2, NULL, 4] │
└─────────────────┘
$ cat nulls.json
{"foo":null}
{"foo":null}
$ duckdb -c 'select array_agg(foo) from nulls.json'
┌────────────────┐
│ array_agg(foo) │
│ json[] │
├────────────────┤
│ [NULL, NULL] │
└────────────────┘
As for the example with avg() noted above, it does seem to skip them there, i.e., the calculation is still based on only the actual numeric values shown.
$ cat numbers.json
{"foo": 1}
{"foo": null}
{"foo": 3}
{"foo": 5}
{"foo": 7}
$ duckdb -c 'select avg(foo) from numbers.json '
┌──────────┐
│ avg(foo) │
│ double │
├──────────┤
│ 4.0 │
└──────────┘
As for the concerns about whether this change could break users' programs, we're already making a lot of other significant changes as part of the transition to SQL, so the direction things are headed we're likely going to just have to publish some docs to help guide users through modifying their programs, so this would likely just be covered as part of that.
A community user noticed in #5759 that error("missing") values are being similarly skipped, e.g.,
$ super -version
Version: v1.18.0-366-g7f5ca96c0
$ echo '1 2 error("missing") 4' | zq -z 'collect(this)' -
[1,2,4]
$ echo '1 2 error("missing") 4' | zq -z 'union(this)' -
|[1,2,4]|
In a group discussion, there was some debate if this was maybe the correct behavior, but there was ultimately consensus that the error("missing") should indeed be included, and if the user explicitly wanted it dropped, they could use the quiet function if desired. Modifying the example above to play around with how a user would attempt that:
$ echo '1 2 error("missing") 4' | zq -z 'collect(quiet(this))' -
[1,2,error("quiet"),4]
Upon seeing that, the group consensus was that the error("quiet") should definitely not be present here, since for instance the docs page for quiet says:
Quiet errors are ignored by operators
aggregate,cut, andyield.
In conclusion, since this has such a similar flavor to what we saw with null I'm grafting on to this existing issue, but if needs to be addressed separately I can break it out to its own issue.