Explicitly confirming the absence of results
Repro is with Zed commit d64a090.
We started a debate years ago about this difference between SQL and Zed behaviors. Start with this data.csv test data.
id,key,val
1,max,hi
2,zap,my
3,patty,name
4,zap,is
5,max,phil
6,gris,these
7,carrot,are
8,thomas,a
9,max,bunch
10,max,of
11,zap,lines
In DuckDB, we'll search & count to determine the presence and absence of a particular field value.
$ duckdb
v0.8.1 6536a77232
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D create table pets as select * from 'data.csv';
D select count(*) from pets where key='max';
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 4 │
└──────────────┘
D select count(*) from pets where key='foo';
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 0 │
└──────────────┘
Compare this with Zed behavior.
$ zq -version
Version: v1.8.1-63-gd64a0909
$ zq 'grep("max") | count()' data.csv
4(uint64)
$ zq 'grep("foo") | count()' data.csv
[no output]
$ echo $?
0
The last time this was debated, there were differing opinions about if Zed should adapt to SQL's behavior of having count() always return a scalar (so, 0 in this case) vs. the way it's currently silent and leaving that for the user to interpret.
Having left the topic alone for some time while the project has evolved, here's my updated thinking on the topic. I sense we may not be able to definitively settle on one true behavior. Instead, like we've done in other areas, perhaps we can employ Zed's first-class errors to make it possible to test for this condition.
One thought I had was that if pipeline elements are able to detect the condition that their upstream dataflow has never produced results and has effectively shut down (is that called EOF in the code?), the pipeline element (count() in this case) could raise error(missing), so the user would have the option to invoke quiet() if they wanted to restore the prior behavior of seeing pure silence when there's no results.
Another approach might be to let pipeline elements behave as they currently do but have an operator with the specific role of detecting this condition & raising such an error. Since we already have a pass operator, perhaps this could be a new option on pass. For illustrative purposes I'll call it -mustflow. So if the user wanted the SQL-like behavior it might look like:
$ zq 'grep("foo") | count() | pass -mustflow | yield is_error(this) ? 0 : this' simple-comma.csv
0
I discussed this one with @mccanne and @nwt where we paged in more of the debate from the last time this topic came up. When initially confronted with a simple count() or sum() we've initially all found ourselves tempted to match the SQL behavior and return a 0 rather than silence. However, we paused when considering what should happen when a by clause is added. In this case instead of a 0 SQL returns "a table with no rows`", e.g.:
D select sum(id) from pets where key='foo' group by key;
┌─────────┐
│ sum(id) │
│ int128 │
├─────────┤
│ 0 rows │
└─────────┘
Zed doesn't have the concept of "tables" so if we were trying to match SQL it's not clear what we'd do here.
In the end we came back to agreeing that Zed's current behavior of returning "nothing" in these situations seems defensibly reasonable. And for the example I cited where a user wants the pipeline to produce some kind of explicit output/error when a branch of a pipeline produces no data, @mccanne agreed that something like the proposed pass -mustflow should indeed do the trick and would not be difficult to implement.
Another user independently asked about this functionality in a community Slack thread. in their own words:
greetings! question for y'all - if search doesn't find anything, what is it returning?
❯ echo '{a:null}' | zq 'search not_found' - ❯ echo '{a:null}' | zq 'search not_found | yield this' - ❯ echo '{a:null}' | zq 'search not_found | yield this=="hey" ? "found" : "not found" | yield this' -Is there a way to construct a conditional or continue a query if search comes up with nothing?
@nwt replied with an interim approach that uses existing building blocks which he admitted was "pretty convoluted".
$ echo '{a:"search_for_me"}' | zq '
fork (
=> search search_for_me | yield {k:0,v:"found"}
=> yield {k:1,v:"not found"}
) |
sort k |
head 1 |
yield v
' -
"found"
$ echo '{a:null}' | zq '
fork (
=> search search_for_me | yield {k:0,v:"found"}
=> yield {k:1,v:"not found"}
) |
sort k |
head 1 |
yield v
' -
"not found"
fwiw, without having the benefit of time and past discussions, I'd vote that search <sexpr> should return something simple, like false or an empty array [].
For those that may be watching this issue, in addition to the approach shown above that can be made to work for any situation where you want to see certain output instead of what would otherwise be silence, for the specific case of wanting to do this with count(), there's a more direct way. Starting from the count() example shown above that showed no output:
$ super -version
Version: v1.18.0-47-g940ac2fa
$ zq 'grep("foo") | count()' data.csv
[no output]
Moving the grep() (which was an implied where operator here) to a where clause in the aggregation gives the explicit 0 result in this case.
$ zq 'count() where grep("foo")' data.csv
0(uint64)
An example of this has been added to the count() docs.
I came across a variation of this tonight:
❯ super -z -c '[] | over this | where a==1 | yield f"<{this}>"'
❯
I just feel like this is broken and/or wrong. To have where with no matches produce a kind of "nothing" thing that's not an empty array or null or ... anything, feels weird (and in my particular case where I ran into this again, inconvenient).
Maybe, to pivot from the earlier idea for is_error(this), is_empty(this) ... or somesuch? I'd probably still vote for null though.
At first, I thought this seemed like a decent workaround, but then I went back to the empty array
❯ super -z -c '[{a:3},{a:2}] | over this | a==1 ? this : null | collect(this)'
null
❯ super -z -c '[{a:1},{a:2}] | over this | a==1 ? this : null | collect(this)'
[{a:1}]
❯ super -z -c '[] | over this | a==1 ? this : null'
❯
To put myself in y'all's shoes, I understand that if null takes the place of something, and [null] is an array with null in it, and [] is an array with literally nothing in it, then [] | over this can only return ... this magical nothing ... it's a bit awkward.
But as a user who just would like to get on with what he was doing ... I'd love some way out of the philosophical dilemma 😆
Thanks @chrismo. Speaking as someone that bumped into this myself (hence why the issue existed 😄), it's helpful to have an external user's perspective.
First I'll revisit some things shown earlier in this issue, since I now have a bit more clarity. Starting from the COUNT in SQL that produces an explicit 0, whereas SuperDB shows nothing:
$ duckdb -c "SELECT COUNT(*) from 'data.csv' WHERE key='foo';"
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 0 │
└──────────────┘
$ super -version
Version: v1.18.0-366-g7f5ca96c0
$ super -c "SELECT COUNT(*) from 'data.csv' WHERE key='foo';"
[no output]
I previously was confused because DuckDB seemed to show "nothing" for aSUM when there was similarly no matching key for the WHERE clause:
$ duckdb -c "SELECT SUM(id) from 'data.csv' WHERE key='foo';"
┌─────────┐
│ sum(id) │
│ int128 │
├─────────┤
│ │
└─────────┘
$ super -c "SELECT SUM(id) from 'data.csv' WHERE key='foo';"
[no output]
But it turns out the result is a NULL, but here we're just seeing a UI thing: For some reason they choose to show nothing as a default, but the -nullvalue option is available to change this.
$ duckdb -nullvalue NULL -c "SELECT SUM(id) from 'data.csv' WHERE key='foo';"
┌─────────┐
│ sum(id) │
│ int128 │
├─────────┤
│ NULL │
└─────────┘
Postgres is much the same, where \pset null null can be used to put something instead of the nothing.
postgres=# SELECT SUM(id) from data WHERE key='foo';
sum
-----
(1 row)
postgres=# \pset null null
Null display is "null".
postgres=# SELECT SUM(id) from data WHERE key='foo';
sum
------
null
(1 row)
In conclusion, because one of the goals of SuperSQL is to have behavior similar to Postgres, we'll definitely plan to mirror what's shown above (the 0 in the COUNT example, NULL in the SUM example) for SQL queries, though we may not get to it until after the initial GA release of SuperDB. So if you're game start some queries with SQL in these situations, you should be able to get the null you seek and do more stuff with it via pipes if that's your preference.
While that covers behaviors for SQL-based queries, we're not yet certain if we'd mirror the same behaviors in queries expressed purely via pipes (e.g., like your original examples), since that's where we've got some flexibility to do things differently than SQL if we think there's room for improvement. Your vote for null above is duly noted though.
But as a user who just would like to get on with what he was doing ... I'd love some way out of the philosophical dilemma 😆
Ha! So - I found myself in a "purer" querying context on something I was working on last night - and I needed to add a where to the first part of a fairly involved 3 line bit of spq, and didn't even think about how great it was that if nothing matched, the whole command would just fall through without any problems.
I reviewed the code right before I checked it in and this whole conversation came back suddenly, and I think I literally slapped my forehead!
I think it speaks to how much super is bringing together the classic Database Query with follow-on processing usually done in The Programming Language ... and how much there can be a clash of expectations from my personal experience: when I'm more interested in the former, a no-match where being the end of the line is "normal"; when I'm more interested in the latter, I expect to be able to conditionally work with the output and that's the "normal" expectation then.
Anyway - nothing very material here, just wanted to share that. I'm sure there's an elegant way through this, maybe where stays where and other criteria can be used when I need to continue on?
@chrismo (and whoever else might be watching this issue): Working with some of the recently-added enhancements such as the initial subquery support in #6070 made me think of this issue, as it does indeed open up another building block to achieve what was originally sought here.
Using recent super commit cc6b1e6, let's start once again with contrasting examples where our search via grep gives a number or "nothing".
$ super -version
Version: cc6b1e6c2
$ super -c "from data.csv | grep('max') | count()"
4::uint64
$ super -c "from data.csv | grep('foo') | count()"
[no output]
Once we put the whole of this operation in a subquery, now we get a "something" in the form of a null in the latter case.
$ super -c "values {result: (from data.csv | grep('max') | count())}"
{result:4::uint64}
$ super -c "values {result: (from data.csv | grep('foo') | count())}"
{result:null}
This, in turn, means we can add conditional logic to turn that null into something else when we want, e.g.,
$ super -c "
values {result: (from data.csv | grep ('max') | count())}
| values (result is null) ? 'not found' : result"
4::uint64
$ super -c "
values {result: (from data.csv | grep ('foo') | count())}
| values (result is null) ? 'not found' : result"
"not found"
I'm not sure if this is enough to be deemed success or if one of the more direct approaches described above might still be justified, but I'll be discussing it with the team. In the meantime I figured I'd write it up here in case anyone watching the issue has a reaction to share.
Ah, very interesting! Thx for staying diligent and connecting the dots here. :)