spark
spark copied to clipboard
[SPARK-46798][Structured Streaming] Kafka custom partition location assignment (rack awareness)
This is rework of https://github.com/apache/spark/pull/44954
What changes were proposed in this pull request? Add support for custom partition location assignment for Kafka sources in Structured Streaming.
Why are the changes needed? Please see the design doc for greater detail and further discussion
SPARK-15406 Added Kafka consumer support to Spark Structured Streaming, but it did not add custom partition location assignment as a feature. The Structured Streaming Kafka consumer as it exists today evenly allocates Kafka topic partitions to executors without regard to Kafka broker rack information or executor location. This behavior can drive large cross-AZ networking costs in large deployments.
The design doc for SPARK-15406 described the ability to assign Kafka partitions to particular executors (a feature which would enable rack awareness), but it seems that feature was never implemented.
For DStreams users, there does seem to be a way to assign Kafka partitions to Spark executors in a custom fashion with LocationStrategies.PreferFixed, so this sort of functionality has a precedent.
Does this PR introduce any user-facing change? An additional parameter will be accepted on the Kafka source provider. This parameter is provisionally named partitionlocationassigner. The parameter takes a class name, which when instantiated gives the Kafka source with user-provided Kafka partition location suggestions. The class should implement a new trait defined in this PR and described in the design document.
How was this patch tested? Unit tests
Was this patch authored or co-authored using generative AI tooling? No.