arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-32276: [C++][FlightRPC] Align RecordBatch buffers given to IPC

Open EnricoMi opened this issue 1 year ago • 3 comments

Rationale for this change

Data retrieved via IPC is expected to provide memory-aligned arrays, but data retrieved via C++ Flight client is mis-aligned. Datafusion (Rust), which requires proper alignment, cannot handle such data: #43552.

What changes are included in this PR?

This aligns RecordBatch array buffers decoded by IPC if mis-aligned according to the data type byte width. Implementation mirrors that of align_buffers in arrow-rs (https://github.com/apache/arrow-rs/pull/4681).

Are these changes tested?

Configuration flag tested in unit test. Manually end-to-end tested that memory alignment fixes issue with reproduction code provided in #43552.

Are there any user-facing changes?

Memory alignment is checked and fixed by default. This is configurable via IpcReadOptions.ensure_memory_alignment.

  • GitHub Issue: #32276

EnricoMi avatar Oct 01 '24 08:10 EnricoMi

@pitrou do you think this fix is viable?

EnricoMi avatar Oct 02 '24 05:10 EnricoMi

While attempting to write some unit tests I found there is util::EnsureAlignment: https://github.com/apache/arrow/blob/e62fbaafd129931b1c217fcaa1b4c254087ab289/cpp/src/arrow/util/align_util.cc#L169-L205

I will try to reuse that method rather than re-implementing it. There is also test infrastructure for misaligned array data.

EnricoMi avatar Oct 07 '24 12:10 EnricoMi

Test arrow-ipc-read-write-test fails with a SIGSEGV in line 45 of https://github.com/apache/arrow/blob/bcb4653c6387a2b22df52a3bbc91317607abdccc/cpp/src/arrow/util/align_util.cc#L44-L52 https://github.com/apache/arrow/actions/runs/11462607112/job/31894398411?pr=44279#step:13:1548

That test complains a lot about /home/enrico/Work/git/arrow/cpp/src/arrow/status.cc:152: Invalid: RequiredValueAlignmentForBuffer called with invalid type id 29 due to RequiredValueAlignmentForBuffer being called for DICTIONARY data type in line 62: https://github.com/apache/arrow/blob/bcb4653c6387a2b22df52a3bbc91317607abdccc/cpp/src/arrow/util/align_util.cc#L56-L76

Looks like util::EnsureAlignment and util::CheckAlignment is not quite ready for this use case.

EnricoMi avatar Oct 22 '24 15:10 EnricoMi

@westonpace @sanjibansg Any pointers in which situations array.type can be null (causing GetDtypeForBuffers to segfault)? Any rational on why RequiredValueAlignmentForBuffer does not support DICTIONARY data type? https://github.com/apache/arrow/pull/44279#issuecomment-2429582435

EnricoMi avatar Oct 26 '24 09:10 EnricoMi

Problem was that tests define an ExtensionType with storage type DICTIONARY, which caused an invalid cast, returning an empty pointer that was dereferenced. That has been fixes in a1ad4da.

This means that user code that defines ExtensionTypes with other storage types than DICTIONARY may see unexpected behaviour. What would be a good default behaviour in GetTypeForBuffer? Return Type::UINT8 to not realign those unknown data types? Or how can we tell from an unknown ExtensionTypes what to return by GetTypeForBuffers?

EnricoMi avatar Dec 03 '24 16:12 EnricoMi

I'm not sure I follow. An extension type should be treated the same as its storage type. I think the problem with the current code is that it blindly casts the array type to the storage type, not accounting for the fact that the point of the storage type is to allow the extension type.

lidavidm avatar Dec 06 '24 08:12 lidavidm

Any non-dictionary type (including extension types) therefore should be covered with array.type->storage_id(). Any dictionary extension types are expected to provide a DictionaryType via array.type->storage_type(), so they are covered with array.type->storage_type()->index_type()->id().

Than that fix should be sound and safe. https://github.com/apache/arrow/blob/7a010294bf0b2436f7302a847ddcdcc2bf912898/cpp/src/arrow/util/align_util.cc#L47-L59

EnricoMi avatar Dec 06 '24 09:12 EnricoMi

@lidavidm are you happy with https://github.com/apache/arrow/pull/44279#issuecomment-2522603956?

EnricoMi avatar Feb 11 '25 08:02 EnricoMi

Oops, that's right, you can't rebind a const reference...

lidavidm avatar Feb 14 '25 23:02 lidavidm

@lidavidm reverted

EnricoMi avatar Mar 07 '25 06:03 EnricoMi

Sorry, I think you might want to rebase again for the CI failures

lidavidm avatar Mar 07 '25 06:03 lidavidm

@pitrou are we okay with the new IPC option?

lidavidm avatar Mar 07 '25 06:03 lidavidm

@lidavidm this change uses util::EnsureAlignment, which errors for non-CPU buffers: https://github.com/apache/arrow/pull/44279/files#diff-fbfb27689f69623e457521d861b3a39557fa99394b47c57e4143da5b1aa575bfR162

This will break existing code. It would require users to set

FlightCallOptions.read_options.ensure_memory_alignment = False

Should this re-alignment silently be skipped for non-CPU buffers?

This would be the best place to skip re-alignment, but we the batch at this point, not individual buffers: https://github.com/apache/arrow/pull/44279/files#diff-e992169684aea9845ac776ada4cbb2b5dc711b49e5a3fbc6046c92299e1aefceR642-R645

EnricoMi avatar Mar 07 '25 06:03 EnricoMi

Can we make the option disabled by default?

lidavidm avatar Mar 07 '25 07:03 lidavidm

Sure, but the Arrow spec enforces alignment for IPC:

Implementations are recommended to allocate memory on aligned addresses (multiple of 8- or 64-bytes) and
pad (overallocate) to a length that is a multiple of 8 or 64 bytes. When serializing Arrow data for
interprocess communication, these alignment and padding requirements are enforced. 

https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding

EnricoMi avatar Mar 07 '25 07:03 EnricoMi

As it states, that's a recommendation. (Serialized data is aligned, but just because the internal structure is aligned doesn't mean that it will remain aligned.)

lidavidm avatar Mar 07 '25 07:03 lidavidm

I have changed the default to false and added this to the docs. Merged latest main branch and updated the PR description and title.

EnricoMi avatar Mar 07 '25 08:03 EnricoMi

@lidavidm rebased to latest main, is this ready to go?

EnricoMi avatar Mar 25 '25 20:03 EnricoMi

I'm waiting for a second pair of eyes since this adds a new IPC option (which doesn't seem to be forthcoming, unfortunately...)

lidavidm avatar Mar 27 '25 05:03 lidavidm

How about sending an e-mail to dev@ before we merge this? We can wait a few days before we merge this.

Sure, hope we can get this in before the feature freeze next week.

EnricoMi avatar Mar 27 '25 08:03 EnricoMi

By the way, do we want to make this option an enum to allow for more options in the future? For example:

enum Alignment {
  // "natural" or "native", not sure which one is better here
  kNoAlignment, kNaturalAlignment
};

Alignment ensure_alignment = kNoAlignment;

and perhaps one day we want to add other alternatives:

enum Alignment {
  kNoAlignment, kNaturalAlignment, k64ByteAlignment
};

(unfortunately C++ doesn't have rich enums like Rust has)

pitrou avatar Mar 27 '25 08:03 pitrou

This is the [email protected] thread for this pull request: https://lists.apache.org/thread/sqg37ff3gh0f2voxyjvggtcjk50yvyqj

EnricoMi avatar Mar 27 '25 09:03 EnricoMi

By the way, do we want to make this option an enum to allow for more options in the future?

I'd suggest these constants:

  • kAnyAlignment or kNoSpecificAlignment
  • kDataTypeSpecificAlignment or kDataTypeAlignment
  • k64ByteAlignment

EnricoMi avatar Mar 27 '25 10:03 EnricoMi

I have also added the 64-byte alignment, as it was easy to add. fda0101fd5f00e36e3e7572c6fcb468e4f66aebe

EnricoMi avatar Mar 27 '25 20:03 EnricoMi

@pitrou I have addressed your comments

EnricoMi avatar May 05 '25 04:05 EnricoMi

@pitrou all comments addressed

EnricoMi avatar May 08 '25 08:05 EnricoMi

Thanks @EnricoMi . This looks good to me except for the nit above.

pitrou avatar May 15 '25 15:05 pitrou

Thanks @EnricoMi!

lidavidm avatar May 19 '25 23:05 lidavidm

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f45594d1f9f3dd1ed4f6b11a30d9685bf3a87dc7.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 15 possible false positives for unstable benchmarks that are known to sometimes produce them.