spark
spark copied to clipboard
[SPARK-49676][SS][PYTHON] Add Support for Chaining of Operators in transformWithStateInPandas API
What changes were proposed in this pull request?
This PR adds support to define event time column in the output dataset of TransformWithStateInPandas
operator. The new event time column will be used to evaluate watermark expressions in downstream operators.
Why are the changes needed?
This change is to couple with the scala implementation of chaining of operators. PR in Scala: https://github.com/apache/spark/pull/45376
Does this PR introduce any user-facing change?
Yes. User can now specify a event time column as:
df.groupBy("id")
.transformWithStateInPandas(
statefulProcessor=stateful_processor,
outputStructType=output_schema,
outputMode="Update",
timeMode=timeMode,
eventTimeColumnName="outputTimestamp"
)
How was this patch tested?
Integration tests.
Was this patch authored or co-authored using generative AI tooling?
No.