[C++] Make Flatbuffers serialization more deterministic
Describe the enhancement requested
In https://github.com/apache/arrow/issues/40202#issuecomment-1978589865 it was determined that Flatbuffers serialization of a Arrow schema did not always result in the same binary encoding.
A changeset in Flatbuffers led us to a likely explanation, as there's a place in our code where serialization of strings depends on argument evaluation order: https://github.com/apache/arrow/blob/3ba6d286caad328b8572a3b9228045da8c8d2043/cpp/src/arrow/ipc/metadata_internal.cc#L478-L481
Binary inspection of the data files provided in that issue seems to confirm that hypothesis.
This is obviously a very minor issue, but should also be easy to fix.
Component(s)
C++
cc @felipecrv
Any thoughts on how one could add a test when submitting a PR for this? I'm not much of a C++ programmer but I know enough to follow the change from flatbuffers and attempt to apply it. A test, however requires a way to trigger different evaluation order in the current code that would be corrected in the change.
I think a regression test would be enough, so long as you can reproduce the non-determinism in that test before making the code change.
What I mean is that I'm not sure how to reproduce the non-determinism in the original. In my understanding, the order is actually determined at compile-time and could differ across compilers or platforms.
Right. There might be a better way than this but adding a test that exercises SchemaToFlatbuffer and asserts the resulting Flatbuffer schema is byte-identical to some value you get, that test should fail on CI one one or more platforms. You could put up a draft PR and let CI exercise it to save you work. @felipecrv will probably have a better idea though.
Any thoughts on how one could add a test when submitting a PR for this? I'm not much of a C++ programmer but I know enough to follow the change from flatbuffers and attempt to apply it. A test, however requires a way to trigger different evaluation order in the current code that would be corrected in the change.
I wouldn't worry to much about adding a test. The non-determinism here doesn't even lead to bugs. But it creates non-determinism in output that can make testing and file comparisons harder. In a way, by fixing this you are already contributing to testability itself.
I disagree and think we should try to add a test for this, if only to validate that we are actually fixing something.
Issue resolved by pull request 40392 https://github.com/apache/arrow/pull/40392
This has been merged. Thanks to those who reviewed and extra thanks to @noamross for contributing the initial PR.