Case Study to design logical TableFunction operator
I am interested in helping to design the spec for TableFunction in substrait and am hoping to track my own development in a way that provides a reference for discussion. As I mentioned in https://github.com/substrait-io/substrait/issues/745#issuecomment-2524165647, I will eventually try to have an implementation of 3 different types of table functions and explore what feels natural and what feels like an antipattern (if anyone already has strong opinions feel free to share, I have very little experience).
The 3 types of table functions:
- A
TableFunctionacting as a leaf operator, e.g.scan_arrow_ipcin duckdb. - A
TableFunctionacting as a transformation of an input table to an output table. For this, I want to implement something a bit unique: a function that maintains the cardinality of the table (output rows == input rows) but applies a function across the columns, thereby changing the schema (like a projection, e.g. [, , , ..., <datacolN>] -> [ , ]. - A
TableFunctionacting like a fused operator, e.g. "GroupJoin" as in Accelerating Queries with... Join by GroupJoin.
I'll add more information to this issue as I prototype, if anyone has recommendations on a different way to track or reference the work, let me know and I can adjust as we like.
I think about it more like two distinct behaviors as opposed to three.
A. Generator table function. Takes in only constant arguments and produces 0..N records. Operates as leaf in a tree. B. Set-based table function: takes in a set of records and adds one or more additional columns to each.
For your type 2, I think of that as a window function which possibly excludes certain (or all) input columns. I guess the one distinction is you want a window function that returns multiple output values...
Type A (your type 1) feels pretty simple and that we have most of the low-level concepts to build against. I could see:
- Add a new table function extension type.
- Declare input arguments for table function as a collection of constant arguments.
- Define the output to be a collection of output columns/fields rather than a single column
Type B (likely matches your type 3), we need to likely introduce a lateral operator or similar. Can you remind me how some tools like Calcite represent this? The lateral would accept a table function extension type that could be constants or field references. First use would probably be FLATTEN/UNNEST.
Hi, wondering if there is any update on the status of this? I'm interested in spending some time exploring solutions and wanted to make sure I have all available information.
I can update it with some information sometime today, but I never got around to prototyping any implementations.
The short update is that I had talked to the author of A Relational Matrix Algebra and its Implementation in a Column Store to get some understanding of what table functions that can accommodate relational matrix operations might look like. But, I dont remember them having any special layout that warranted any special consideration.
My intention was to be able to accommodate a special layout that I am using in research which is used often in bioinformatics research. In bioinformatics (and other matrix layouts) it's common to have a column represent a dimension of the data. A proper relational normalization might be to model the row and column dimension as relational attributes (basically a compressed sparse column (csc) or compressed sparse row (csr) format). Then, the typical matrix layout can be thought of as the projection from a group by operator. This was the 3rd type of table function I described that I would want to be able to express.
It seems datafusion's table functions return a table provider: user-defined table function.
Based on a table provider scan returning an execution plan, TableProvider::scan, this seems similar to the 2nd type of table function I imagined. I can imagine an implementation that references a separate relation tree to be run on the input operator and that produces the output.
Then, I think the mapping from my type 2 to jacques's type b is just a difference between what columns get emitted with the caveat of cardinality changes (in my type 2 and datafusion's table provider approach, cardinality of the input is independent of the output. I believe accommodating this with jacques's type would require duplication of input rows, potentially like the aggregation node does for many grouping sets)