spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-47050][SQL] Collect and publish partition level metrics for V1

Open snmvaughan opened this issue 1 year ago • 6 comments

We currently capture metrics which include the number of files, bytes and rows for a task along with the updated partitions. This change captures metrics for each updated partition, reporting the partition sub-paths along with the number of files, bytes, and rows per partition for each task.

This is the V1 implementation, extracted from https://github.com/apache/spark/pull/45123

What changes were proposed in this pull request?

  1. Update the WriteTaskStatsTracker implementation to associate a partition with the file during writing, and to track the number of rows written to each file. The final stats now include a map of partitions and the associated partition stats
  2. Update the WriteJobStatsTracker implementation to capture the partition subpaths and to publish a new Event to the listener bus. The processed stats aggregate the statistics for each partition

Why are the changes needed?

This increases our understanding of written data by tracking the impact for each task on our datasets

Does this PR introduce any user-facing change?

This makes partition-level data accessible through a new event.

How was this patch tested?

In addition to the new unit tests, this was run in a Kubernetes environment writing tables with differing partitioning strategies and validating the reported stats. Unit tests using both InsertIntoHadoopFsRelationCommand and InsertIntoHiveTable now verify partition stats when dynamic partitioning is enabled. We also verified that the aggregated partition metrics matched the existing metrics for number of files, bytes, and rows.

Was this patch authored or co-authored using generative AI tooling?

No

snmvaughan avatar Apr 23 '24 16:04 snmvaughan

cc @cloud-fan

snmvaughan avatar Apr 23 '24 16:04 snmvaughan

Gently pinging @cloud-fan

dbtsai avatar Apr 26 '24 17:04 dbtsai

is the end goal to automatically update table statistics?

cloud-fan avatar Apr 28 '24 09:04 cloud-fan

We're looking to collect deeper insights into what jobs are doing, beyond the current read/write statistics such as bytes, num files, etc.

snmvaughan avatar Apr 30 '24 02:04 snmvaughan

@cloud-fan Spark already collects information about the number of rows and bytes written, but only reports the total aggregate. If you're concerned about the overall size, it is limited to the number of partitions instead of collecting it by file. The currently V1 writers only know about the path they are writing to, which is why I wanted to augment the newFIle with additional information.

snmvaughan avatar May 21 '24 20:05 snmvaughan

@cloud-fan Did you still have concerns about collecting and reporting the stats per partition?

snmvaughan avatar Jun 06 '24 15:06 snmvaughan

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Sep 15 '24 00:09 github-actions[bot]