arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-40361: [C++] Make flatbuffers serialization more deterministic

Open noamross opened this issue 1 year ago • 7 comments

Rationale for this change

This is the start of a PR to address #40361, and in turn #40202, to make metadata in parquet files written by arrow to be identical irrespective of the platform configuration. This is limited, as platform-specific differences in R or Python versions or compression libraries could still result in differences.

What changes are included in this PR?

So far I have only made a partial change to part of the metadata serialization. I need to look at whether other calls to flatbuffers require similar treatment.

Are these changes tested?

Not yet, this is a draft PR

Are there any user-facing changes?

No

  • GitHub Issue: #40361

noamross avatar Mar 07 '24 00:03 noamross

:warning: GitHub issue #40361 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Mar 07 '24 00:03 github-actions[bot]

:warning: GitHub issue #40361 has no components, please add labels for components.

github-actions[bot] avatar Mar 07 '24 00:03 github-actions[bot]

:warning: GitHub issue #40361 has no components, please add labels for components.

github-actions[bot] avatar Mar 07 '24 00:03 github-actions[bot]

Thanks @noamross!

@kou can you please approve all workflows to run?

amoeba avatar Mar 07 '24 03:03 amoeba

Approved!

kou avatar Mar 07 '24 04:03 kou

:warning: GitHub issue #40361 has no components, please add labels for components.

github-actions[bot] avatar Mar 07 '24 04:03 github-actions[bot]

As discussed on the GH issue, it would be useful to add a test using output from your machine as a reference. (also, perhaps open a separate PR with just the test to validate that it would fail on CI)

pitrou avatar Mar 08 '24 14:03 pitrou

Hi @noamross, would you be interested in adding the test described above to this PR? If not, I could take a shot at it.

amoeba avatar Apr 02 '24 23:04 amoeba

@amoeba Sorry to open this and abandon it, it just became a very busy month and my C++ is pretty limited. I'd be happy if you would!

noamross avatar Apr 03 '24 12:04 noamross

No problem at all @noamross, it's awesome you filed a PR for the issue and it's very much appreciated. I'll work on adding and testing the test.

amoeba avatar Apr 03 '24 21:04 amoeba

Thanks! I expect the files attached to https://github.com/apache/arrow/issues/40202#issuecomment-1977273164 could serve for a test of whether a generated file is identical to a reference from a different system, but I note that they would change with the R version being used.

noamross avatar Apr 03 '24 21:04 noamross

I put up a draft PR with a test to exercise this and hopefully watch it fail on CI at https://github.com/apache/arrow/pull/41018. It fails on my amd64 linux machine so I think it's close to what we need here. I'll watch CI on that PR and report back.

amoeba avatar Apr 04 '24 23:04 amoeba

@github-actions crossbow submit -g cpp

amoeba avatar Apr 12 '24 21:04 amoeba

Only contributors can submit requests to this bot. Please ask someone from the community for help with getting the first commit in.
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/8668333493

github-actions[bot] avatar Apr 12 '24 21:04 github-actions[bot]

@github-actions crossbow submit -g cpp

amoeba avatar Apr 12 '24 22:04 amoeba

Revision: 911794535bb22b6012a2de8d262c37384324017e

Submitted crossbow builds: ursacomputing/crossbow @ actions-e9bd83c5da

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-11-cpp-amd64 GitHub Actions
test-debian-11-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

github-actions[bot] avatar Apr 12 '24 22:04 github-actions[bot]

Things look good locally (testing on macOS-aarch64-clang and Linux-amd64-gcc) but I see an issue with my includes in a few of the above jobs. I'll fix those soon.

amoeba avatar Apr 13 '24 01:04 amoeba

@github-actions crossbow submit -g cpp

amoeba avatar May 01 '24 01:05 amoeba

Revision: c3af79dbaabf59cd61fee65cdd9e47a47a9bdd62

Submitted crossbow builds: ursacomputing/crossbow @ actions-08f54365f0

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-11-cpp-amd64 GitHub Actions
test-debian-11-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

github-actions[bot] avatar May 01 '24 01:05 github-actions[bot]

I've pushed up a fix to address the includes issue and the crossbow jobs looked clear (failures are unrelated and the test we expect to pass does indeed pass).

I've marked this as ready for review. @pitrou do you have time to review this? cc @kou

amoeba avatar May 01 '24 01:05 amoeba

Thanks for taking a look @kou, I accepted all your changes. I feel pretty good about the state of this PR at this point and am not sure we need another review. Let me know what you think.

amoeba avatar May 01 '24 21:05 amoeba

@github-actions crossbow submit -g cpp

amoeba avatar May 01 '24 21:05 amoeba

I'll let CI and the crossbow jobs run then merge if all looks good.

amoeba avatar May 01 '24 21:05 amoeba

Revision: 4a81743474b50fdfbb9df50724f8a921059f0252

Submitted crossbow builds: ursacomputing/crossbow @ actions-20acdf73a0

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-11-cpp-amd64 GitHub Actions
test-debian-11-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

github-actions[bot] avatar May 01 '24 21:05 github-actions[bot]

Thanks @pitrou for the review. I rebased from apache/main, accepted your three changes as-is, and added one extra change I caught while dealing with conflicts on rebase, 4bd0339e596d410bd5ad3a73f06d9294c7a87e7f. In https://github.com/apache/arrow/commit/4c8eff2ec8bae76364fcb28ffac574c1df262705 @kou tweaked the CMakeList and I think this PR should make the same change as was made there.

PS: After pushing I found the ipc/CMakeLists.txt needed formatting, so that was added here too.

amoeba avatar May 16 '24 00:05 amoeba

Parquet tests were failing all over on CI and due to needing to bump submodules so I pushed 4ebe53bd21fd1f28ce6babebf6802488283c692c so we could see CI.

amoeba avatar May 16 '24 03:05 amoeba

@github-actions crossbow submit -g cpp

pitrou avatar May 16 '24 06:05 pitrou

Revision: 4ebe53bd21fd1f28ce6babebf6802488283c692c

Submitted crossbow builds: ursacomputing/crossbow @ actions-d1985bc136

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

github-actions[bot] avatar May 16 '24 06:05 github-actions[bot]

Thanks all. The submodule change to help CI pass has been reverted in this PR. I'm merging this now.

amoeba avatar May 16 '24 23:05 amoeba

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 74f7578f77adca6b0fd79f7d37e28941330221eb.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them.