spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-49676][SS][PYTHON] Add Support for Chaining of Operators in transformWithStateInPandas API

Open jingz-db opened this issue 5 months ago • 0 comments

What changes were proposed in this pull request?

This PR adds support to define event time column in the output dataset of TransformWithStateInPandas operator. The new event time column will be used to evaluate watermark expressions in downstream operators.

Why are the changes needed?

This change is to couple with the scala implementation of chaining of operators. PR in Scala: https://github.com/apache/spark/pull/45376

Does this PR introduce any user-facing change?

Yes. User can now specify a event time column as:

df.groupBy("id")
  .transformWithStateInPandas(
      statefulProcessor=stateful_processor,
      outputStructType=output_schema,
      outputMode="Update",
      timeMode=timeMode,
      eventTimeColumnName="outputTimestamp"
  )

How was this patch tested?

Integration tests.

Was this patch authored or co-authored using generative AI tooling?

No.

jingz-db avatar Sep 16 '24 20:09 jingz-db