GH-32276: [C++][FlightRPC] Align RecordBatch buffers given to IPC
Rationale for this change
Data retrieved via IPC is expected to provide memory-aligned arrays, but data retrieved via C++ Flight client is mis-aligned. Datafusion (Rust), which requires proper alignment, cannot handle such data: #43552.
What changes are included in this PR?
This aligns RecordBatch array buffers decoded by IPC if mis-aligned according to the data type byte width.
Implementation mirrors that of align_buffers in arrow-rs (https://github.com/apache/arrow-rs/pull/4681).
Are these changes tested?
Configuration flag tested in unit test. Manually end-to-end tested that memory alignment fixes issue with reproduction code provided in #43552.
Are there any user-facing changes?
Memory alignment is checked and fixed by default. This is configurable via IpcReadOptions.ensure_memory_alignment.
- GitHub Issue: #32276
@pitrou do you think this fix is viable?
While attempting to write some unit tests I found there is util::EnsureAlignment: https://github.com/apache/arrow/blob/e62fbaafd129931b1c217fcaa1b4c254087ab289/cpp/src/arrow/util/align_util.cc#L169-L205
I will try to reuse that method rather than re-implementing it. There is also test infrastructure for misaligned array data.
Test arrow-ipc-read-write-test fails with a SIGSEGV in line 45 of https://github.com/apache/arrow/blob/bcb4653c6387a2b22df52a3bbc91317607abdccc/cpp/src/arrow/util/align_util.cc#L44-L52
https://github.com/apache/arrow/actions/runs/11462607112/job/31894398411?pr=44279#step:13:1548
That test complains a lot about /home/enrico/Work/git/arrow/cpp/src/arrow/status.cc:152: Invalid: RequiredValueAlignmentForBuffer called with invalid type id 29 due to RequiredValueAlignmentForBuffer being called for DICTIONARY data type in line 62: https://github.com/apache/arrow/blob/bcb4653c6387a2b22df52a3bbc91317607abdccc/cpp/src/arrow/util/align_util.cc#L56-L76
Looks like util::EnsureAlignment and util::CheckAlignment is not quite ready for this use case.
@westonpace @sanjibansg
Any pointers in which situations array.type can be null (causing GetDtypeForBuffers to segfault)?
Any rational on why RequiredValueAlignmentForBuffer does not support DICTIONARY data type?
https://github.com/apache/arrow/pull/44279#issuecomment-2429582435
Problem was that tests define an ExtensionType with storage type DICTIONARY, which caused an invalid cast, returning an empty pointer that was dereferenced. That has been fixes in a1ad4da.
This means that user code that defines ExtensionTypes with other storage types than DICTIONARY may see unexpected behaviour. What would be a good default behaviour in GetTypeForBuffer? Return Type::UINT8 to not realign those unknown data types? Or how can we tell from an unknown ExtensionTypes what to return by GetTypeForBuffers?
I'm not sure I follow. An extension type should be treated the same as its storage type. I think the problem with the current code is that it blindly casts the array type to the storage type, not accounting for the fact that the point of the storage type is to allow the extension type.
Any non-dictionary type (including extension types) therefore should be covered with array.type->storage_id().
Any dictionary extension types are expected to provide a DictionaryType via array.type->storage_type(), so they are covered with array.type->storage_type()->index_type()->id().
Than that fix should be sound and safe. https://github.com/apache/arrow/blob/7a010294bf0b2436f7302a847ddcdcc2bf912898/cpp/src/arrow/util/align_util.cc#L47-L59
@lidavidm are you happy with https://github.com/apache/arrow/pull/44279#issuecomment-2522603956?
Oops, that's right, you can't rebind a const reference...
@lidavidm reverted
Sorry, I think you might want to rebase again for the CI failures
@pitrou are we okay with the new IPC option?
@lidavidm this change uses util::EnsureAlignment, which errors for non-CPU buffers:
https://github.com/apache/arrow/pull/44279/files#diff-fbfb27689f69623e457521d861b3a39557fa99394b47c57e4143da5b1aa575bfR162
This will break existing code. It would require users to set
FlightCallOptions.read_options.ensure_memory_alignment = False
Should this re-alignment silently be skipped for non-CPU buffers?
This would be the best place to skip re-alignment, but we the batch at this point, not individual buffers: https://github.com/apache/arrow/pull/44279/files#diff-e992169684aea9845ac776ada4cbb2b5dc711b49e5a3fbc6046c92299e1aefceR642-R645
Can we make the option disabled by default?
Sure, but the Arrow spec enforces alignment for IPC:
Implementations are recommended to allocate memory on aligned addresses (multiple of 8- or 64-bytes) and
pad (overallocate) to a length that is a multiple of 8 or 64 bytes. When serializing Arrow data for
interprocess communication, these alignment and padding requirements are enforced.
https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding
As it states, that's a recommendation. (Serialized data is aligned, but just because the internal structure is aligned doesn't mean that it will remain aligned.)
I have changed the default to false and added this to the docs. Merged latest main branch and updated the PR description and title.
@lidavidm rebased to latest main, is this ready to go?
I'm waiting for a second pair of eyes since this adds a new IPC option (which doesn't seem to be forthcoming, unfortunately...)
How about sending an e-mail to
dev@before we merge this? We can wait a few days before we merge this.
Sure, hope we can get this in before the feature freeze next week.
By the way, do we want to make this option an enum to allow for more options in the future?
For example:
enum Alignment {
// "natural" or "native", not sure which one is better here
kNoAlignment, kNaturalAlignment
};
Alignment ensure_alignment = kNoAlignment;
and perhaps one day we want to add other alternatives:
enum Alignment {
kNoAlignment, kNaturalAlignment, k64ByteAlignment
};
(unfortunately C++ doesn't have rich enums like Rust has)
This is the [email protected] thread for this pull request: https://lists.apache.org/thread/sqg37ff3gh0f2voxyjvggtcjk50yvyqj
By the way, do we want to make this option an
enumto allow for more options in the future?
I'd suggest these constants:
kAnyAlignmentorkNoSpecificAlignmentkDataTypeSpecificAlignmentorkDataTypeAlignmentk64ByteAlignment
I have also added the 64-byte alignment, as it was easy to add. fda0101fd5f00e36e3e7572c6fcb468e4f66aebe
@pitrou I have addressed your comments
@pitrou all comments addressed
Thanks @EnricoMi . This looks good to me except for the nit above.
Thanks @EnricoMi!
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f45594d1f9f3dd1ed4f6b11a30d9685bf3a87dc7.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details. It also includes information about 15 possible false positives for unstable benchmarks that are known to sometimes produce them.