GH-40361: [C++] Make flatbuffers serialization more deterministic
Rationale for this change
This is the start of a PR to address #40361, and in turn #40202, to make metadata in parquet files written by arrow to be identical irrespective of the platform configuration. This is limited, as platform-specific differences in R or Python versions or compression libraries could still result in differences.
What changes are included in this PR?
So far I have only made a partial change to part of the metadata serialization. I need to look at whether other calls to flatbuffers require similar treatment.
Are these changes tested?
Not yet, this is a draft PR
Are there any user-facing changes?
No
- GitHub Issue: #40361
:warning: GitHub issue #40361 has been automatically assigned in GitHub to PR creator.
:warning: GitHub issue #40361 has no components, please add labels for components.
:warning: GitHub issue #40361 has no components, please add labels for components.
Thanks @noamross!
@kou can you please approve all workflows to run?
Approved!
:warning: GitHub issue #40361 has no components, please add labels for components.
As discussed on the GH issue, it would be useful to add a test using output from your machine as a reference. (also, perhaps open a separate PR with just the test to validate that it would fail on CI)
Hi @noamross, would you be interested in adding the test described above to this PR? If not, I could take a shot at it.
@amoeba Sorry to open this and abandon it, it just became a very busy month and my C++ is pretty limited. I'd be happy if you would!
No problem at all @noamross, it's awesome you filed a PR for the issue and it's very much appreciated. I'll work on adding and testing the test.
Thanks! I expect the files attached to https://github.com/apache/arrow/issues/40202#issuecomment-1977273164 could serve for a test of whether a generated file is identical to a reference from a different system, but I note that they would change with the R version being used.
I put up a draft PR with a test to exercise this and hopefully watch it fail on CI at https://github.com/apache/arrow/pull/41018. It fails on my amd64 linux machine so I think it's close to what we need here. I'll watch CI on that PR and report back.
@github-actions crossbow submit -g cpp
Only contributors can submit requests to this bot. Please ask someone from the community for help with getting the first commit in.
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/8668333493
@github-actions crossbow submit -g cpp
Revision: 911794535bb22b6012a2de8d262c37384324017e
Submitted crossbow builds: ursacomputing/crossbow @ actions-e9bd83c5da
Things look good locally (testing on macOS-aarch64-clang and Linux-amd64-gcc) but I see an issue with my includes in a few of the above jobs. I'll fix those soon.
@github-actions crossbow submit -g cpp
Revision: c3af79dbaabf59cd61fee65cdd9e47a47a9bdd62
Submitted crossbow builds: ursacomputing/crossbow @ actions-08f54365f0
I've pushed up a fix to address the includes issue and the crossbow jobs looked clear (failures are unrelated and the test we expect to pass does indeed pass).
I've marked this as ready for review. @pitrou do you have time to review this? cc @kou
Thanks for taking a look @kou, I accepted all your changes. I feel pretty good about the state of this PR at this point and am not sure we need another review. Let me know what you think.
@github-actions crossbow submit -g cpp
I'll let CI and the crossbow jobs run then merge if all looks good.
Revision: 4a81743474b50fdfbb9df50724f8a921059f0252
Submitted crossbow builds: ursacomputing/crossbow @ actions-20acdf73a0
Thanks @pitrou for the review. I rebased from apache/main, accepted your three changes as-is, and added one extra change I caught while dealing with conflicts on rebase, 4bd0339e596d410bd5ad3a73f06d9294c7a87e7f. In https://github.com/apache/arrow/commit/4c8eff2ec8bae76364fcb28ffac574c1df262705 @kou tweaked the CMakeList and I think this PR should make the same change as was made there.
PS: After pushing I found the ipc/CMakeLists.txt needed formatting, so that was added here too.
Parquet tests were failing all over on CI and due to needing to bump submodules so I pushed 4ebe53bd21fd1f28ce6babebf6802488283c692c so we could see CI.
@github-actions crossbow submit -g cpp
Revision: 4ebe53bd21fd1f28ce6babebf6802488283c692c
Submitted crossbow builds: ursacomputing/crossbow @ actions-d1985bc136
Thanks all. The submodule change to help CI pass has been reverted in this PR. I'm merging this now.
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 74f7578f77adca6b0fd79f7d37e28941330221eb.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them.