datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Substrait integration doesn't recognize typed functions

Open Blizzara opened this issue 9 months ago • 2 comments

Describe the bug

Since https://github.com/substrait-io/substrait/pull/537, the standard has declared that functions must be named with compound names consisting of both the function and its types. For example, for the function "not", the full name would now be "not:bool". However trying to use "not:bool" in Substrait input for DataFusion results in This feature is not implemented: Unsupported function name: \"not:bool\" as "not:bool" doesn't match into the list used by https://github.com/apache/datafusion/blob/f0e96c670108ba0ffdebb9dd9e764bba4d2dca8c/datafusion/substrait/src/logical_plan/consumer.rs#L125

To Reproduce

Using e.g. isthmus from Substrait-java, one can produce a simple substrait:

substrait-java % ./isthmus-cli/build/graal/isthmus "select not d from t" -c "create table t (d boolean)"  --outputformat=BINARY | base64 
ChsIARIXL2Z1bmN0aW9uc19ib29sZWFuLnlhbWwSDhoMCAEaCG5vdDpib29sGkcSRQo7OjkKBRIDCgEBEhoKGAoCCgASDQoBRBIICgQKAhABGAI6AwoBVBoUGhIaBAoCEAEiChoIEgYKAhIAIgASBkVYUFIkMA==

(Drop the outputformat=BINARY | base64 to see the plan as text)

Then using DataFusion to execute that plan:

#[tokio::test]
async fn test_not() -> datafusion::common::Result<()> {
    let protobuf = general_purpose::STANDARD.decode("ChsIARIXL2Z1bmN0aW9uc19ib29sZWFuLnlhbWwSDhoMCAEaCG5vdDpib29sGkcSRQo7OjkKBRIDCgEBEhoKGAoCCgASDQoBRBIICgQKAhABGAI6AwoBVBoUGhIaBAoCEAEiChoIEgYKAhIAIgASBkVYUFIkMA==").unwrap();
    let ctx = SessionContext::new();
    ctx.register_csv("T", "tests/testdata/data.csv", CsvReadOptions::new())
        .await?;
    let plan = deserialize_bytes(protobuf).await?;
    let _ = from_substrait_plan(&ctx, plan.as_ref()).await?;
    Ok(())
}

Fails with Error: NotImplemented("Unsupported function name: \"not:bool\"")

Expected behavior

not:bool should be found and the plan be created correctly

Additional context

No response

Blizzara avatar May 07 '24 16:05 Blizzara

Thanks for the report @Blizzara -- this would be a great thing to fix

alamb avatar May 08 '24 17:05 alamb

Cool, I'll take a look in the next days!

Blizzara avatar May 15 '24 17:05 Blizzara

Thanks for taking this @Blizzara! Just out of curiosity, are you using this datafusion-substrait somewhere and find this inconsistency?

waynexia avatar May 28 '24 15:05 waynexia

Yes! I'm working on using DataFusion to basically execute Spark dataframes through Spark -> Substrait -> DataFusion. The Spark -> Substrait part is a (currently closed-source, but I hope to open it too) fork of "substrait-spark" from https://github.com/apache/incubator-gluten/tree/v1.1.0/substrait/substrait-spark (forked as it's no longer included in gluten).

Blizzara avatar May 28 '24 15:05 Blizzara

Sounds great!! I'm very interested in this project, looking forward to your progress!

waynexia avatar May 28 '24 16:05 waynexia

Be wary, Gluten contains a copy of Substrait instead of depending on the main repo. As a result its Substrait is incompatible with the rest of the ecosystem. That works for them as they only use the Substrait internally but the other tools are handy sometimes (especially the Substrait Validator).

It may also interest you that a generic Spark to Substrait tool is out there: https://github.com/voltrondata/spark-substrait-gateway

It's mostly there for DuckDB but other backends require some tweaks (DataFusion is one of them). We have a tweak to remove the compound names for DataFusion but would love to see this issue addressed. If you have questions (especially from the Substrait side) feel free to reach out.

EpsilonPrime avatar May 28 '24 21:05 EpsilonPrime

Be wary, Gluten contains a copy of Substrait instead of depending on the main repo. As a result its Substrait is incompatible with the rest of the ecosystem. That works for them as they only use the Substrait internally but the other tools are handy sometimes (especially the Substrait Validator).

Yup, the substrait-spark submodule was not using Gluten's Substrait but the vanilla one. (The whole thing wasn't really used by Gluten and it's no longer part of the repo, thus the fork).

It may also interest you that a generic Spark to Substrait tool is out there: https://github.com/voltrondata/spark-substrait-gateway

It does, thanks! Not directly useful as we need a solution in Java, but it's always good to have more references to look at.

Blizzara avatar May 29 '24 08:05 Blizzara