arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[C++] Make Flatbuffers serialization more deterministic

Open pitrou opened this issue 1 year ago • 7 comments

Describe the enhancement requested

In https://github.com/apache/arrow/issues/40202#issuecomment-1978589865 it was determined that Flatbuffers serialization of a Arrow schema did not always result in the same binary encoding.

A changeset in Flatbuffers led us to a likely explanation, as there's a place in our code where serialization of strings depends on argument evaluation order: https://github.com/apache/arrow/blob/3ba6d286caad328b8572a3b9228045da8c8d2043/cpp/src/arrow/ipc/metadata_internal.cc#L478-L481

Binary inspection of the data files provided in that issue seems to confirm that hypothesis.

This is obviously a very minor issue, but should also be easy to fix.

Component(s)

C++

pitrou avatar Mar 05 '24 12:03 pitrou

cc @felipecrv

pitrou avatar Mar 05 '24 12:03 pitrou

Any thoughts on how one could add a test when submitting a PR for this? I'm not much of a C++ programmer but I know enough to follow the change from flatbuffers and attempt to apply it. A test, however requires a way to trigger different evaluation order in the current code that would be corrected in the change.

noamross avatar Mar 06 '24 22:03 noamross

I think a regression test would be enough, so long as you can reproduce the non-determinism in that test before making the code change.

amoeba avatar Mar 06 '24 23:03 amoeba

What I mean is that I'm not sure how to reproduce the non-determinism in the original. In my understanding, the order is actually determined at compile-time and could differ across compilers or platforms.

noamross avatar Mar 06 '24 23:03 noamross

Right. There might be a better way than this but adding a test that exercises SchemaToFlatbuffer and asserts the resulting Flatbuffer schema is byte-identical to some value you get, that test should fail on CI one one or more platforms. You could put up a draft PR and let CI exercise it to save you work. @felipecrv will probably have a better idea though.

amoeba avatar Mar 06 '24 23:03 amoeba

Any thoughts on how one could add a test when submitting a PR for this? I'm not much of a C++ programmer but I know enough to follow the change from flatbuffers and attempt to apply it. A test, however requires a way to trigger different evaluation order in the current code that would be corrected in the change.

I wouldn't worry to much about adding a test. The non-determinism here doesn't even lead to bugs. But it creates non-determinism in output that can make testing and file comparisons harder. In a way, by fixing this you are already contributing to testability itself.

felipecrv avatar Mar 07 '24 23:03 felipecrv

I disagree and think we should try to add a test for this, if only to validate that we are actually fixing something.

pitrou avatar Mar 07 '24 23:03 pitrou

Issue resolved by pull request 40392 https://github.com/apache/arrow/pull/40392

amoeba avatar May 16 '24 23:05 amoeba

This has been merged. Thanks to those who reviewed and extra thanks to @noamross for contributing the initial PR.

amoeba avatar May 16 '24 23:05 amoeba