feat: clarify the expected behavior, and rationale, of the post join filter
The post join filter has very little explanation. It can also be confusing because, from a purely logical perspective, it is possible to see the post join filter as redundant. This PR attempts to clarify the description of the post join filter.
A join relation with Join Expression X and Post-Join Filter Y is equivalent to a join relation with Join Expression X AND Y
Is this strictly true? As in a consumer must resolve both expressions on the same inputs? If so, I think it'd be nice to add a comment in the .proto file to the effect of "post_join_filter should be resolved in conjunction (AND) with expression."
It's difficult to follow the threads in this discussion.
One can think of a a join with a post join filter as a composite operation that is a join followed by a filter relation. It is entirely valid translation to take a post join filter out of the join and put in a filter relation directly afterwards and vice versa.
The post join filter does not logically interact with the join type at all. The composite exists because many systems have it and it can be a beneficial physical pattern. The reason it has to be stated separately from the join predicate is to have covering behavior of all possible filter conditions. I always have to remind myself of which conditions can and cannot be moved into a join evaluation clause.
I'm supportive of clarifying the text if people are unclear as to what post join filter means.
One can think of a a join with a post join filter as a composite operation that is a join followed by a filter relation. It is entirely valid translation to take a post join filter out of the join and put in a filter relation directly afterwards and vice versa.
@jacques-n
This is not the conclusion we came to. I believe the content of the PR is still accurate with the threads, so you can just review the content and ignore the discussion.
For example:
SELECT * FROM a LEFT OUTER JOIN b ON a.id = b.id AND b.other_field IS NOT NULL
As it stands this filter will emit one row for each row in a. If the filter is moved into the WHERE clause then it will emit fewer rows (less or equal to the number of rows emitted by an inner join).
From your GPT link this matches:
Conditions on the non-preserved side that would otherwise eliminate rows that should remain when there's no match
There is a thread that discusses the description that would go in the website: discussion on website description
I think further discussion on this can be deferred until agreement on post_join_filter semantics is finalized.
Then, there's a thread discussing the comment in the .proto file for post_join_filter: discussion on comment in spec
This discussion assumes that Weston's assertion in the description is the correct semantics of post_join_filter. As Weston points out, that description is in contradiction to Jacques's comment.
This is not the conclusion we came to
In this PR, we never collectively discussed what it should be versus what it is. The description says:
The post join filter has very little explanation... from a purely logical perspective, it is possible to see the post join filter as redundant.
And I asked:
Is this strictly true? As in a consumer must resolve both expressions on the same inputs? If so, ...
One thing that was referenced in slack is the substrait FAQ: "The post-join filter on the various Join relations is not always equivalent to an explicit Filter relation AFTER the Join." This FAQ then references velox hash-join implementation, which says: "Filter is optional. If specified it can be any expression over the results of the join."
It occurs to me that the FAQ says "post-join filter... is not always equivalent to an explicit Filter relation AFTER the join," yet the referenced velox documentation says "If specified, it can be any expression over the results of the join." These seem directly contradictory to me, since "the results of of the join" sounds quite a bit like "AFTER the join".