substrait icon indicating copy to clipboard operation
substrait copied to clipboard

Type Variation: clarify the spec around Funtion Behavior

Open yongchul opened this issue 8 months ago • 0 comments

Type Variation Issues around Function Behavior

Type variations are referring different physical representations (e.g., physical encoding) of a logical type. The concept has not been fully integrated in the type system thus leaving ambiguities in the implementation.

Function Behavior

Type variation as of today defines two function behaviors: INHERITS and SEPARATE.

  • With an INHERITS, a value of the specific variation can bind where its base type is accepted. The consumer is assumed to handle the variation correctly.
  • With a SEPARATE, a value of the specific variation can not bind where its base type is accepted but must be bound to specialized place holders (e.g., specialized function).

There are three problems here.

Type variation of the return type

If a function returns the same type as one of its arguments, what is the type variation of the return type? Will it reuse the variation of one of the parameters or clear out the variation (i.e., system preferred)?

What need to be done: extend the return type derivation to incorporate the type variations.

It could be similar as NULL propagation (i.e., a couple of predefined behavior like override to specific, inherit the first) or a full expressions including conditional statement (e.g., if variation x and variation y, then variation z).

Binding to specialized functions

With SEPARATE behavior, there is no way to define a function specialized to a specific variation because there is no way to express the type variation in the YAML definition as well as the unique key construction because they are defined in terms of the "base/parent" type of a type variation.

What need to be done: the syntax needs to be extended so that we can express the specialized functions.

In today's syntax, the type variation is referred by a number in square brackets where 0 is reserved for system preferred (e.g., struct[0]). Unfortunately, the numbers (not explicitly stated but many aggree that this is from the SimpleExtensionDeclaration.extension_type_variation) are not known in advance and leave the YAML definitions context-sensitive rather than context-free (e.g., the meaning of coolstuff(struct[1]) function changes per Substrait plan).

So the syntax should accept a type variation name with an optional qualified YAML reference (e.g., ext.myvariation1 where ext is declared as a dependent YAML defining myvariation1 type variation).

Toil of SEPARATE

INHERITS and SEPARATE are two extreme of the spectrum in terms of bindings. If we have a small specialized type (almost equivalent to the base type modulo a couple of different functional behaviors), the toil of going full SEPARATE or user-defined type is too high.

What need to be done: It would be great if we can reduce the toil.

Potential Solution: EXTENDS

One way is to introducing something in between INHERITS and SEPARATE, say EXTENDS, where we allow a small number of fucntions specialized for the variation so that it binds to those specialized if exists, this can greatly reduce the toil given above points (variation of return type and binding to specialized functions) addressed.

The semantic of EXTENDS is precisely between INHERITS and SEPARATE, or the same as dynamic dispatch: it binds to the base unless specialization is defined.

A caveat of this approach is that the binding resolution can be complex. If you have an EXTENDS variation, you always first tries to look up specialization first, then fallback to base type if binding fails. The problem arises when there are multiple arguments with type variations. In theory, you will have to try all possible combinations of the variations but we can prescribe simpler algorithm like a linear resolution (i.e., from left to right, try to bind specialized to base).

yongchul avatar Apr 23 '25 16:04 yongchul