[SPARK-47050][SQL] Collect and publish partition level metrics for V1
We currently capture metrics which include the number of files, bytes and rows for a task along with the updated partitions. This change captures metrics for each updated partition, reporting the partition sub-paths along with the number of files, bytes, and rows per partition for each task.
This is the V1 implementation, extracted from https://github.com/apache/spark/pull/45123
What changes were proposed in this pull request?
- Update the
WriteTaskStatsTrackerimplementation to associate a partition with the file during writing, and to track the number of rows written to each file. The final stats now include a map of partitions and the associated partition stats - Update the
WriteJobStatsTrackerimplementation to capture the partition subpaths and to publish a new Event to the listener bus. The processed stats aggregate the statistics for each partition
Why are the changes needed?
This increases our understanding of written data by tracking the impact for each task on our datasets
Does this PR introduce any user-facing change?
This makes partition-level data accessible through a new event.
How was this patch tested?
In addition to the new unit tests, this was run in a Kubernetes environment writing tables with differing partitioning strategies and validating the reported stats. Unit tests using both InsertIntoHadoopFsRelationCommand and InsertIntoHiveTable now verify partition stats when dynamic partitioning is enabled. We also verified that the aggregated partition metrics matched the existing metrics for number of files, bytes, and rows.
Was this patch authored or co-authored using generative AI tooling?
No
cc @cloud-fan
Gently pinging @cloud-fan
is the end goal to automatically update table statistics?
We're looking to collect deeper insights into what jobs are doing, beyond the current read/write statistics such as bytes, num files, etc.
@cloud-fan Spark already collects information about the number of rows and bytes written, but only reports the total aggregate. If you're concerned about the overall size, it is limited to the number of partitions instead of collecting it by file. The currently V1 writers only know about the path they are writing to, which is why I wanted to augment the newFIle with additional information.
@cloud-fan Did you still have concerns about collecting and reporting the stats per partition?
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!