datafusion
datafusion copied to clipboard
Substrait integration doesn't recognize typed functions
Describe the bug
Since https://github.com/substrait-io/substrait/pull/537, the standard has declared that functions must be named with compound names consisting of both the function and its types. For example, for the function "not", the full name would now be "not:bool". However trying to use "not:bool" in Substrait input for DataFusion results in
This feature is not implemented: Unsupported function name: \"not:bool\"
as "not:bool" doesn't match into the list used by https://github.com/apache/datafusion/blob/f0e96c670108ba0ffdebb9dd9e764bba4d2dca8c/datafusion/substrait/src/logical_plan/consumer.rs#L125
To Reproduce
Using e.g. isthmus from Substrait-java, one can produce a simple substrait:
substrait-java % ./isthmus-cli/build/graal/isthmus "select not d from t" -c "create table t (d boolean)" --outputformat=BINARY | base64
ChsIARIXL2Z1bmN0aW9uc19ib29sZWFuLnlhbWwSDhoMCAEaCG5vdDpib29sGkcSRQo7OjkKBRIDCgEBEhoKGAoCCgASDQoBRBIICgQKAhABGAI6AwoBVBoUGhIaBAoCEAEiChoIEgYKAhIAIgASBkVYUFIkMA==
(Drop the outputformat=BINARY | base64
to see the plan as text)
Then using DataFusion to execute that plan:
#[tokio::test]
async fn test_not() -> datafusion::common::Result<()> {
let protobuf = general_purpose::STANDARD.decode("ChsIARIXL2Z1bmN0aW9uc19ib29sZWFuLnlhbWwSDhoMCAEaCG5vdDpib29sGkcSRQo7OjkKBRIDCgEBEhoKGAoCCgASDQoBRBIICgQKAhABGAI6AwoBVBoUGhIaBAoCEAEiChoIEgYKAhIAIgASBkVYUFIkMA==").unwrap();
let ctx = SessionContext::new();
ctx.register_csv("T", "tests/testdata/data.csv", CsvReadOptions::new())
.await?;
let plan = deserialize_bytes(protobuf).await?;
let _ = from_substrait_plan(&ctx, plan.as_ref()).await?;
Ok(())
}
Fails with Error: NotImplemented("Unsupported function name: \"not:bool\"")
Expected behavior
not:bool
should be found and the plan be created correctly
Additional context
No response
Thanks for the report @Blizzara -- this would be a great thing to fix
Cool, I'll take a look in the next days!
Thanks for taking this @Blizzara! Just out of curiosity, are you using this datafusion-substrait somewhere and find this inconsistency?
Yes! I'm working on using DataFusion to basically execute Spark dataframes through Spark -> Substrait -> DataFusion. The Spark -> Substrait part is a (currently closed-source, but I hope to open it too) fork of "substrait-spark" from https://github.com/apache/incubator-gluten/tree/v1.1.0/substrait/substrait-spark (forked as it's no longer included in gluten).
Sounds great!! I'm very interested in this project, looking forward to your progress!
Be wary, Gluten contains a copy of Substrait instead of depending on the main repo. As a result its Substrait is incompatible with the rest of the ecosystem. That works for them as they only use the Substrait internally but the other tools are handy sometimes (especially the Substrait Validator).
It may also interest you that a generic Spark to Substrait tool is out there: https://github.com/voltrondata/spark-substrait-gateway
It's mostly there for DuckDB but other backends require some tweaks (DataFusion is one of them). We have a tweak to remove the compound names for DataFusion but would love to see this issue addressed. If you have questions (especially from the Substrait side) feel free to reach out.
Be wary, Gluten contains a copy of Substrait instead of depending on the main repo. As a result its Substrait is incompatible with the rest of the ecosystem. That works for them as they only use the Substrait internally but the other tools are handy sometimes (especially the Substrait Validator).
Yup, the substrait-spark submodule was not using Gluten's Substrait but the vanilla one. (The whole thing wasn't really used by Gluten and it's no longer part of the repo, thus the fork).
It may also interest you that a generic Spark to Substrait tool is out there: https://github.com/voltrondata/spark-substrait-gateway
It does, thanks! Not directly useful as we need a solution in Java, but it's always good to have more references to look at.