zed icon indicating copy to clipboard operation
zed copied to clipboard

Order of join result differs with vector runtime

Open philrz opened this issue 10 months ago • 1 comments

Repro is with super commit 7a5d7b95.

As pointed out in https://github.com/brimdata/super/issues/5802#issuecomment-2797782585, the result of the Left Join example from the join tutorial differs when run in vector runtime.

$ super -version
Version: v1.18.0-403-g7a5d7b95c

$ super -c "
file fruit.json
| left join (
  file people.json
) on flavor=likes eater:=name,age"
{name:"figs",color:"brown",flavor:"plain",eater:"jessie",age:30}
{name:"avocado",color:"green",flavor:"savory"}
{name:"banana",color:"yellow",flavor:"sweet",eater:"quinn",age:14}
{name:"strawberry",color:"red",flavor:"sweet",eater:"quinn",age:14}
{name:"dates",color:"brown",flavor:"sweet",note:"in season",eater:"quinn",age:14}
{name:"apple",color:"red",flavor:"tart",eater:"morgan",age:61}
{name:"apple",color:"red",flavor:"tart",eater:"chris",age:47}

$ SUPER_VAM=1 super -c "
file fruit.json
| left join (
  file people.json
) on flavor=likes eater:=name,age"
{name:"apple",color:"red",flavor:"tart",eater:"morgan",age:61}
{name:"apple",color:"red",flavor:"tart",eater:"chris",age:47}
{name:"banana",color:"yellow",flavor:"sweet",eater:"quinn",age:14}
{name:"avocado",color:"green",flavor:"savory"}
{name:"strawberry",color:"red",flavor:"sweet",eater:"quinn",age:14}
{name:"dates",color:"brown",flavor:"sweet",note:"in season",eater:"quinn",age:14}
{name:"figs",color:"brown",flavor:"plain",eater:"jessie",age:30}

philrz avatar Apr 13 '25 18:04 philrz

After discussing with the team, I now understand that it was incorrect for me to frame this as a bug. It's actually an expected consequence of an enhancement and hence something I can work on mentioning in the docs to the degree that users need to be aware.

To summarize, in the past our join implementation relied on the left/right inputs being sorted. At one time users were required to ensure this themselves by either being confident their inputs were already ordered correctly or adding explicit sort operator(s) upstream of the left/right inputs. Later we enhanced our optimizer so that the necessary sorting would be done automatically and that relieved the users from having to worry about it. However, the hash join implementation added in #5659 no longer relies on left/right inputs being sorted, and having our optimizer still trigger the sorting of inputs automatically just to ensure the output order was exactly the same as it had been with the non-hash join would have a performance cost. Since it's hard to know for sure if any users were relying on our join's output order remaining exactly the same, the consensus is to stick with the current behavior and hence lead with the best performance. If enough users speak up and say that the change was disruptive and they have a strong desire to have the old behavior back even if it impacts performance, we could consider adding some backward-compatible mode for their benefit at that time.

For now, the exercise therefore morphs into one of documentation. There's already spots in the user-facing docs where we set expectations that the relative order of query results may not follow the order of the input data (specifically when the pipeline has been split with an operator like fork or switch, or when an aggregation is performed) and so join would become another item added to this list. We've also recognized the need to write some kind of guide for users migrating their queries from the Zed era to the SuperDB era, and this would surely be worth mentioning in that doc as well.

philrz avatar Apr 16 '25 19:04 philrz

The changes in #6052 have simplified this a bit since now the vectorized join is used in all circumstances, so the need for an explicit sort to guarantee stable output is now consistent in all join contexts. Furthermore, as we've continued to adopt more SQL behaviors, users will be less likely to make assumptions about default sort order. I'll hold onto this issue until the draft of the new "book" docs has some coverage on the topic.

philrz avatar Jul 23 '25 16:07 philrz

The current coverage of join in the book docs draft says plainly:

The output order of the joined values is undefined.

This seems to set expectations appropriately. I've also included this topic in a list of things to touch on in an eventual Zed/SuperSQL transition guide.

Closing this issue.

philrz avatar Oct 05 '25 17:10 philrz