delta
delta copied to clipboard
[Spark][Version Checksum] Track histogram of file sizes in the checksum
Which Delta project/connector is this regarding?
- [X] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)
Description
Follow up for https://github.com/delta-io/delta/pull/3899. Major changes:
- Makes it so that histograms of file sizes are tracked in the version checksum --- these are later used for validation. These are computed incrementally to avoid state reconstruction.
- Also, now that Version Checksum can be incrementally computed without triggering a full state reconstruction (see https://github.com/delta-io/delta/pull/3899, https://github.com/delta-io/delta/pull/3895), this PR enables writing the version checksum by default.
How was this patch tested?
Added FileSizeHistogramSuite and updated some existing suites.
Does this PR introduce any user-facing changes?
No