datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

DataFusion does not validate that Substrait NamedScan schemas match registered tables

Open vbarua opened this issue 1 year ago • 1 comments

Describe the bug

As written, the test assertion in https://github.com/apache/datafusion/blob/1fce2a98ef9c7f8dbd7f3dedcaf4aa069ab92154/datafusion/substrait/tests/cases/logical_plans.rs#L46-L50 should fail because DataFusion registers the data table with 5 fields [a, b, c, d, e] but the schema for the table in the Substrait plan only has a single field [D].

To Reproduce

No response

Expected behavior

DataFusion should reject Substrait plans in which NamedScan schemas do not match the corresponding table that is is registered.

Additional context

Generally speaking, if the plan consumer (DataFusion) and the producer do not agree on column names and types, it is unlikely that execution will be meaningful.

vbarua avatar Aug 28 '24 22:08 vbarua

I'm in the process of preparing a PR for this issue.

vbarua avatar Aug 28 '24 22:08 vbarua

From conversations with @Blizzara, the requirement that the DataFusion and Substrait schemas match exactly is stricter than it needs to be. In practice, if the Substrait schema is a subset of the DataFusion schema, the consumer can adapt the plan as it consumes it to make it match the shape expected by Substrait.

For example, if DataFusion has a schema [a, b, c] for table t, and Substrait has a schema [b, c] for table t, as DataFusion consumes the plan it may add a project for fields [b,c] immediately after the read from table t to bring it in line with what the Substrait plan expects.

vbarua avatar Sep 04 '24 23:09 vbarua