datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Projects from Substrait do not include input fields as output fields

Open EpsilonPrime opened this issue 6 months ago • 2 comments

Describe the bug

According to the Substrait specification project relations emit all if the input fields followed by the list of new expressions. Datafusion only emits the new expressions.

To Reproduce

Pass a Substrait plan such as the following to Datafusion. (A literal can be used instead of a window function but this is what I had handy.)

{
  "extensionUris": [
    {
      "extensionUriAnchor": 1,
      "uri": "/functions_arithmetic.yaml"
    }
  ],
  "extensions": [
    {
      "extensionFunction": {
        "extensionUriReference": 1,
        "functionAnchor": 1,
        "name": "row_number"
      }
    }
  ],
  "relations": [
    {
      "root": {
        "input": {
          "project": {
            "common": {
              "direct": {}
            },
            "input": {
              "read": {
                "common": {
                  "direct": {}
                },
                "baseSchema": {
                  "names": [
                    "user_id",
                    "name",
                    "paid_for_service"
                  ],
                  "struct": {
                    "types": [
                      {
                        "string": {
                          "nullability": "NULLABILITY_REQUIRED"
                        }
                      },
                      {
                        "string": {
                          "nullability": "NULLABILITY_REQUIRED"
                        }
                      },
                      {
                        "bool": {
                          "nullability": "NULLABILITY_REQUIRED"
                        }
                      }
                    ],
                    "nullability": "NULLABILITY_REQUIRED"
                  }
                },
                "namedTable": {
                  "names": [
                    "users"
                  ]
                }
              }
            },
            "expressions": [
              {
                "windowFunction": {
                  "functionReference": 1,
                  "sorts": [
                    {
                      "expr": {
                        "selection": {
                          "directReference": {
                            "structField": {
                              "field": 1
                            }
                          },
                          "rootReference": {}
                        }
                      },
                      "direction": "SORT_DIRECTION_ASC_NULLS_FIRST"
                    }
                  ],
                  "upperBound": {
                    "unbounded": {}
                  },
                  "lowerBound": {
                    "unbounded": {}
                  },
                  "outputType": {
                    "i64": {
                      "nullability": "NULLABILITY_REQUIRED"
                    }
                  },
                  "invocation": 3
                }
              }
            ]
          }
        },
        "names": [
          "user_id",
          "name",
          "paid_for_service",
          "row_number"
        ]
      }
    }
  ],
  "version": {
    "minorNumber": 52,
    "producer": "spark-substrait-gateway"
  }
}

Expected behavior

The result of the plan above would be 4 columns to match the 4 names provided. The current behavior is that Datafusion returns just one column (row_number) for the project.

Additional context

No response

EpsilonPrime avatar Aug 28 '24 00:08 EpsilonPrime