feat: add anchor for non-ordinal plan reference
This adds an "anchor" to PlanRel that can be referenced from a ReferenceRel. This anchor and reference relationship provides a non-ordinal method for identifying and accessing a "subtree" or sub-graph. To maintain consistency with type_anchor and function_anchor, the new subtree_anchor attribute is uint32.
This PR leaves in the original subtree_ordinal attribute since it seems a (mildly) more performant method for referencing a subtree, but also since it is still relevant in the typical case. The new anchor attribute is an improvement for cases where multiple plans are merged and at least one already contains a ReferenceRel.
It is expected that only one of subtree_ordinal or subtree_reference will be used, however I don't see a good reason to enforce the use of only one, so I did not group the attributes in a oneof constraint.
If either of the above points ((1) keeping subtree_ordinal and (2) not enforcing a oneof constraint) are changed, then this PR would likely become a breaking change.
Also, this PR addresses #725 , but uses the name subtree_reference and subtree_anchor (instead of plan_anchor) to be consistent with the pre-existing subtree_ordinal attribute. I don't know that relations in a substrait plan are strictly trees, so I think this could be a good time to change "subtree" to "planrel" or "subgraph" or something similar.
BREAKING CHANGE: The field subtree_ordinal is made optional with explicit presence due to being moved into a new oneof field, "ref_type".
This PR leaves in the original
subtree_ordinalattribute since it seems a (mildly) more performant method for referencing a subtree
This is technically true, but on the other hand we are already doing the same thing with function extensions when the number of functions used in a typical plan most likely will be a lot bigger than the number of subtrees. I don't think it makes sense to keep the existing behavior only due to performance considerations.
It is expected that only one of
subtree_ordinalorsubtree_referencewill be used, however I don't see a good reason to enforce the use of only one, so I did not group the attributes in aoneofconstraint.
I think leaving both options especially w/o oneof will lead to a lot more confusion and possible divergence in implementations. My choice would be to deprecate subtree_ordinal and eventually get rid of it, but oneof is also a viable option.
If either of the above points ((1) keeping
subtree_ordinaland (2) not enforcing aoneofconstraint) are changed, then this PR would likely become a breaking change.
Pretty sure putting an existing field in a oneof is a backwards-compatible change.
I should have time to finish this up in the next few days
rebased on main
hm, according to a stackoverflow post, moving an optional field into a oneof is not breaking. This PR essentially makes subtree_ordinal an optional field instead of an implicit field by moving it into a oneof, but that seems backwards-compatible. However, the check for breaking changes still flags it so I decided to add a BREAKING CHANGES: footer to the PR description.
By making the anchor on the plan object does that mean you can no longer use a ReferenceRel to point to a subtree of the existing plan? For instance, TPC-H 15 processes the same subtree twice and often ends up with different results (when using float). By using a reference you could process the subtree once and point to it avoiding differences.
By making the anchor on the plan object...
I think you may be misunderstanding where the anchor is? I don't think of the PlanRel as the Plan object
...does that mean you can no longer use a ReferenceRel to point to a subtree of the existing plan?
The change is essentially to change a relative ID (ordinal) to an absolute ID (anchor). I think in either case you need to know what you're pointing to and you can point to a subtree of the existing plan.
@vbarua I'm not opposed to planrel_reference, but one slight caveat regarding naming: this field will most likely act as a reference to the list of relations in ExtendedExpression message as well (see #794). In my PR I decided to introduce new ExtendedExpressionRel message to hold Rels. If we go that route, calling the field planrel_reference doesn't make as much sense anymore. Alternatively we can reuse PlanRel message for ExtendedExpression relations, which is doable but slightly awkward, because relations in PlanRel for extended expressions will only ever be of type Rel, never RelRoot.
I think more context about the interaction with ExtendedExpressionRel would be helpful.
I commented on #794 (moved my comment from here).
@tokoko left a comment on your proposal in https://github.com/substrait-io/substrait/pull/794 with regards to including relations in ExtendedExpressions.
I accommodated feedback from @vbarua and rebased on main
@tokoko @vbarua as long as planrel_reference is a new field in ReferenceRel anyways, what if we actually nest it in a oneof, that way we can be explicit about other things the reference may point to (e.g. extendedexpr_reference) in a way that is a bit future proof (although a bit more verbose). Or, we can change it to be anchor_reference and have an enum that specifies what the expected type of message being pointed to is?