zed Changing "case" match causes join failure

Repro is with Zed commit 6ff0586.

In a recent Slack thread, a community user was getting help debugging their join use case. We were building off this working example of a "self-join".

$ cat all.json 
{"attr": {"addrrss": "1.2.3.4"}, "sessionid": 1, "thing": "evilthing", "s": "start"}
{"attr": {"addrrss": "1.2.3.4"}, "sessionid": 1, "thing": "evilthing", "s": "end", "message": "A match!"}

$ cat self-join.zed 
switch (
  case thing=="evilthing" and s=="start" => Fieldx:=ip(attr.addrrss), sessionstartid:=sessionid
  case thing=="evilthing" and s=="end" => Fieldx:=ip(attr.addrrss), sessionendid:=sessionid
) | left join on sessionstartid=sessionendid did_it_match:=message

$ zq -version
Version: v1.11.1-21-g6ff05865

$ zq -I self-join.zed all.json 
{attr:{addrrss:"1.2.3.4"},sessionid:1,thing:"evilthing",s:"start",Fieldx:1.2.3.4,sessionstartid:1,did_it_match:"A match!"}

The user had some difficulty applying this to their specific data, then circled back to report where they struggled. In their own words:

Omg I found the problem - I was doing a regular search as part of the switch conditions - so something like case "searchterm" and attr=="val"

We can repro their experience by dropping the leading thing== portions in each case, which results in the join no longer producing output.

$ cat case-oops.zed 
switch (
  case "evilthing" and s=="start" => Fieldx:=ip(attr.addrrss), sessionstartid:=sessionid
  case "evilthing" and s=="end" => Fieldx:=ip(attr.addrrss), sessionendid:=sessionid
) | left join on sessionstartid=sessionendid did_it_match:=message

$ zq -I case-oops.zed all.json 
[no output]

What I find interesting is that if I run the case matches from each variant a la carte they produce the same values.

$ zq 'thing=="evilthing" and s=="start"' all.json 
{attr:{addrrss:"1.2.3.4"},sessionid:1,thing:"evilthing",s:"start"}

$ zq 'thing=="evilthing" and s=="end"' all.json 
{attr:{addrrss:"1.2.3.4"},sessionid:1,thing:"evilthing",s:"end",message:"A match!"}

$ zq '"evilthing" and s=="start"' all.json 
{attr:{addrrss:"1.2.3.4"},sessionid:1,thing:"evilthing",s:"start"}

$ zq '"evilthing" and s=="end"' all.json 
{attr:{addrrss:"1.2.3.4"},sessionid:1,thing:"evilthing",s:"end",message:"A match!"}

The canonical Zed tells us a subtle difference: One becomes a where and one becomes a search.

$ zq -C 'thing=="evilthing" and s=="start"'
where thing=="evilthing" and s=="start"

$ zq -C '"evilthing" and s=="start"'
search "evilthing" and s=="start"

My main concern is that the silence from the case-oops.zed example doesn't give the user much to go off. If we need to maintain the behaviors shown here, I'm hoping there's room for a syntax error or runtime error/warning to help the user confirm if they made a mistake.

Dec 04 '23 19:12 philrz

Yeah feels like you case shouldn't allow and expression "evilthing" and s=="start"

Dec 05 '23 21:12 mattnibs

@mccanne pointed out that the runtime error does get surfaced, so if one puts a yield at the correct spot it can be seen.

$ zq -I case-oops-error.zed all.json 
{endbranch:error({message:"not type bool",on:"evilthing"})}
{endbranch:error({message:"not type bool",on:"evilthing"})}
{startbranch:error({message:"not type bool",on:"evilthing"})}
{startbranch:error({message:"not type bool",on:"evilthing"})}

There's surely something to be learned/improved here, but it still seems TBD precisely what/how. A list of ones that come to mind:

Adding the proposed debug/inspect operator proposed in #4487.
As pointed out above by @mattnibs and also mentioned to me by @nwt, as long as case requires a boolean expression, we could catch mistakes like the one above as syntax errors at compile time.
...but even if we did that, @nwt indicated there would still be variations that couldn't be detected at compile time and so there still could be runtime errors.
Knowing there could always be some runtime errors, plus seeing how these top-level error values effectively got swallowed up by the downstream join, had me wondering if there could be other ways to surface these so they're harder to miss, e.g., similar to how stderr and stdout at the Unix shell go to different places unless I do things like 2>&1.
Maybe case could start to take search expressions? Not sure of the downsides to that, though.
I can see the docs could stand to be more explicit about what precisely defines a "boolean expression" and such text could be linked to from places like the switch doc.

Dec 06 '23 19:12 philrz

Ok actually the runtime error is surfaced in the switch statement, it just gets suppressed in the join because it doesn't pass the equi join test with its counter part. Perhaps if join encounters an error value it should emit it?

Dec 11 '23 19:12 mattnibs

We took a shot at this in #4931 but then realized it was not the right approach. Will revisit at a later time.

Jan 04 '24 19:01 philrz