substrait icon indicating copy to clipboard operation
substrait copied to clipboard

Case Study to design logical TableFunction operator

Open drin opened this issue 1 year ago • 5 comments

I am interested in helping to design the spec for TableFunction in substrait and am hoping to track my own development in a way that provides a reference for discussion. As I mentioned in https://github.com/substrait-io/substrait/issues/745#issuecomment-2524165647, I will eventually try to have an implementation of 3 different types of table functions and explore what feels natural and what feels like an antipattern (if anyone already has strong opinions feel free to share, I have very little experience).

The 3 types of table functions:

  1. A TableFunction acting as a leaf operator, e.g. scan_arrow_ipc in duckdb.
  2. A TableFunction acting as a transformation of an input table to an output table. For this, I want to implement something a bit unique: a function that maintains the cardinality of the table (output rows == input rows) but applies a function across the columns, thereby changing the schema (like a projection, e.g. [, , , ..., <datacolN>] -> [, ].
  3. A TableFunction acting like a fused operator, e.g. "GroupJoin" as in Accelerating Queries with... Join by GroupJoin.

I'll add more information to this issue as I prototype, if anyone has recommendations on a different way to track or reference the work, let me know and I can adjust as we like.

drin avatar Dec 06 '24 21:12 drin

I think about it more like two distinct behaviors as opposed to three.

A. Generator table function. Takes in only constant arguments and produces 0..N records. Operates as leaf in a tree. B. Set-based table function: takes in a set of records and adds one or more additional columns to each.

For your type 2, I think of that as a window function which possibly excludes certain (or all) input columns. I guess the one distinction is you want a window function that returns multiple output values...

Type A (your type 1) feels pretty simple and that we have most of the low-level concepts to build against. I could see:

  • Add a new table function extension type.
  • Declare input arguments for table function as a collection of constant arguments.
  • Define the output to be a collection of output columns/fields rather than a single column

Type B (likely matches your type 3), we need to likely introduce a lateral operator or similar. Can you remind me how some tools like Calcite represent this? The lateral would accept a table function extension type that could be constants or field references. First use would probably be FLATTEN/UNNEST.

jacques-n avatar Dec 10 '24 01:12 jacques-n

Hi, wondering if there is any update on the status of this? I'm interested in spending some time exploring solutions and wanted to make sure I have all available information.

benbellick avatar Oct 21 '25 13:10 benbellick

I can update it with some information sometime today, but I never got around to prototyping any implementations.

The short update is that I had talked to the author of A Relational Matrix Algebra and its Implementation in a Column Store to get some understanding of what table functions that can accommodate relational matrix operations might look like. But, I dont remember them having any special layout that warranted any special consideration.

My intention was to be able to accommodate a special layout that I am using in research which is used often in bioinformatics research. In bioinformatics (and other matrix layouts) it's common to have a column represent a dimension of the data. A proper relational normalization might be to model the row and column dimension as relational attributes (basically a compressed sparse column (csc) or compressed sparse row (csr) format). Then, the typical matrix layout can be thought of as the projection from a group by operator. This was the 3rd type of table function I described that I would want to be able to express.

drin avatar Oct 21 '25 14:10 drin

It seems datafusion's table functions return a table provider: user-defined table function.

Based on a table provider scan returning an execution plan, TableProvider::scan, this seems similar to the 2nd type of table function I imagined. I can imagine an implementation that references a separate relation tree to be run on the input operator and that produces the output.

drin avatar Oct 21 '25 14:10 drin

Then, I think the mapping from my type 2 to jacques's type b is just a difference between what columns get emitted with the caveat of cardinality changes (in my type 2 and datafusion's table provider approach, cardinality of the input is independent of the output. I believe accommodating this with jacques's type would require duplication of input rows, potentially like the aggregation node does for many grouping sets)

drin avatar Oct 21 '25 14:10 drin