delta icon indicating copy to clipboard operation
delta copied to clipboard

[Spark][Version Checksum] Track histogram of file sizes in the checksum

Open dhruvarya-db opened this issue 11 months ago • 1 comments

Which Delta project/connector is this regarding?

  • [X] Spark
  • [ ] Standalone
  • [ ] Flink
  • [ ] Kernel
  • [ ] Other (fill in here)

Description

Follow up for https://github.com/delta-io/delta/pull/3899. Major changes:

  1. Makes it so that histograms of file sizes are tracked in the version checksum --- these are later used for validation. These are computed incrementally to avoid state reconstruction.
  2. Also, now that Version Checksum can be incrementally computed without triggering a full state reconstruction (see https://github.com/delta-io/delta/pull/3899, https://github.com/delta-io/delta/pull/3895), this PR enables writing the version checksum by default.

How was this patch tested?

Added FileSizeHistogramSuite and updated some existing suites.

Does this PR introduce any user-facing changes?

No

dhruvarya-db avatar Nov 27 '24 20:11 dhruvarya-db