arrow GH-40361: [C++] Make flatbuffers serialization more deterministic

Rationale for this change

This is the start of a PR to address #40361, and in turn #40202, to make metadata in parquet files written by arrow to be identical irrespective of the platform configuration. This is limited, as platform-specific differences in R or Python versions or compression libraries could still result in differences.

What changes are included in this PR?

So far I have only made a partial change to part of the metadata serialization. I need to look at whether other calls to flatbuffers require similar treatment.

Are these changes tested?

Not yet, this is a draft PR

Are there any user-facing changes?

No

GitHub Issue: #40361

Mar 07 '24 00:03 noamross

:warning: GitHub issue #40361 has been automatically assigned in GitHub to PR creator.

Mar 07 '24 00:03 github-actions[bot]

:warning: GitHub issue #40361 has no components, please add labels for components.

Mar 07 '24 00:03 github-actions[bot]

:warning: GitHub issue #40361 has no components, please add labels for components.

Mar 07 '24 00:03 github-actions[bot]

Thanks @noamross!

@kou can you please approve all workflows to run?

Mar 07 '24 03:03 amoeba

Approved!

Mar 07 '24 04:03 kou

:warning: GitHub issue #40361 has no components, please add labels for components.

Mar 07 '24 04:03 github-actions[bot]

As discussed on the GH issue, it would be useful to add a test using output from your machine as a reference. (also, perhaps open a separate PR with just the test to validate that it would fail on CI)

Mar 08 '24 14:03 pitrou

Hi @noamross, would you be interested in adding the test described above to this PR? If not, I could take a shot at it.

Apr 02 '24 23:04 amoeba

@amoeba Sorry to open this and abandon it, it just became a very busy month and my C++ is pretty limited. I'd be happy if you would!

Apr 03 '24 12:04 noamross

No problem at all @noamross, it's awesome you filed a PR for the issue and it's very much appreciated. I'll work on adding and testing the test.

Apr 03 '24 21:04 amoeba

Thanks! I expect the files attached to https://github.com/apache/arrow/issues/40202#issuecomment-1977273164 could serve for a test of whether a generated file is identical to a reference from a different system, but I note that they would change with the R version being used.

Apr 03 '24 21:04 noamross

I put up a draft PR with a test to exercise this and hopefully watch it fail on CI at https://github.com/apache/arrow/pull/41018. It fails on my amd64 linux machine so I think it's close to what we need here. I'll watch CI on that PR and report back.

Apr 04 '24 23:04 amoeba

@github-actions crossbow submit -g cpp

Apr 12 '24 21:04 amoeba

Only contributors can submit requests to this bot. Please ask someone from the community for help with getting the first commit in.
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/8668333493

Apr 12 '24 21:04 github-actions[bot]

@github-actions crossbow submit -g cpp

Apr 12 '24 22:04 amoeba

Revision: 911794535bb22b6012a2de8d262c37384324017e

Submitted crossbow builds: ursacomputing/crossbow @ actions-e9bd83c5da

Task	Status
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp
test-debian-11-cpp-amd64
test-debian-11-cpp-i386
test-fedora-39-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-gcc-14

Apr 12 '24 22:04 github-actions[bot]

Things look good locally (testing on macOS-aarch64-clang and Linux-amd64-gcc) but I see an issue with my includes in a few of the above jobs. I'll fix those soon.

Apr 13 '24 01:04 amoeba

@github-actions crossbow submit -g cpp

May 01 '24 01:05 amoeba

Revision: c3af79dbaabf59cd61fee65cdd9e47a47a9bdd62

Submitted crossbow builds: ursacomputing/crossbow @ actions-08f54365f0

Task	Status
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp
test-debian-11-cpp-amd64
test-debian-11-cpp-i386
test-fedora-39-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-gcc-14

May 01 '24 01:05 github-actions[bot]

I've pushed up a fix to address the includes issue and the crossbow jobs looked clear (failures are unrelated and the test we expect to pass does indeed pass).

I've marked this as ready for review. @pitrou do you have time to review this? cc @kou

May 01 '24 01:05 amoeba

Thanks for taking a look @kou, I accepted all your changes. I feel pretty good about the state of this PR at this point and am not sure we need another review. Let me know what you think.

May 01 '24 21:05 amoeba

@github-actions crossbow submit -g cpp

May 01 '24 21:05 amoeba

I'll let CI and the crossbow jobs run then merge if all looks good.

May 01 '24 21:05 amoeba

Revision: 4a81743474b50fdfbb9df50724f8a921059f0252

Submitted crossbow builds: ursacomputing/crossbow @ actions-20acdf73a0

Task	Status
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp
test-debian-11-cpp-amd64
test-debian-11-cpp-i386
test-fedora-39-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-gcc-14

May 01 '24 21:05 github-actions[bot]

Thanks @pitrou for the review. I rebased from apache/main, accepted your three changes as-is, and added one extra change I caught while dealing with conflicts on rebase, 4bd0339e596d410bd5ad3a73f06d9294c7a87e7f. In https://github.com/apache/arrow/commit/4c8eff2ec8bae76364fcb28ffac574c1df262705 @kou tweaked the CMakeList and I think this PR should make the same change as was made there.

PS: After pushing I found the ipc/CMakeLists.txt needed formatting, so that was added here too.

May 16 '24 00:05 amoeba

Parquet tests were failing all over on CI and due to needing to bump submodules so I pushed 4ebe53bd21fd1f28ce6babebf6802488283c692c so we could see CI.

May 16 '24 03:05 amoeba

@github-actions crossbow submit -g cpp

May 16 '24 06:05 pitrou

Revision: 4ebe53bd21fd1f28ce6babebf6802488283c692c

Submitted crossbow builds: ursacomputing/crossbow @ actions-d1985bc136

Task	Status
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-39-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-gcc-14

May 16 '24 06:05 github-actions[bot]

Thanks all. The submodule change to help CI pass has been reverted in this PR. I'm merging this now.

May 16 '24 23:05 amoeba

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 74f7578f77adca6b0fd79f7d37e28941330221eb.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them.

May 17 '24 02:05 conbench-apache-arrow[bot]