arrow GH-39858: [C++][Device] Add Copy/View slice functions to MemoryManager

Rationale for this change

Currently MemoryManager objects define functionality to Copy or View entire buffers. Occasionally there is the need to only copy a single value or slice from a buffer (see https://github.com/apache/arrow/pull/39770#discussion_r1470135438). It's overkill to do a bunch of whole Buffer operations and manually slicing just to copy 4 or 8 bytes.

Instead we can add new methods to the MemoryManager interface such as CopyBufferSlice etc.

What changes are included in this PR?

Add the MemoryManager::CopyBufferSlice function, which initially attempts to use memcpy for the specified slice. If this is not possible, it defaults to copying the entire buffer and then viewing/copying the slice.

Update ArrayImporter::ImportStringValuesBuffer to use this function.

Are these changes tested?

ArrayImporter::ImportStringValuesBuffer is tested as a part of arrow-c-bridge-test

GitHub Issue: #39858

Apr 30 '24 21:04 alanstoate

Hi @zeroshade I've had a go at adding these MemoryManager methods, just a few q's:

I've added all the copy/view methods in MemoryManager as slice versions I'm not sure if this is more than requested?
Should the existing copy/view methods in CPUMemoryManager call these slice versions with offset = 0 & length = buf.size() to reuse some code?
I should probably add a test for all of these, just wondering where that should live?

Thanks!

Apr 30 '24 21:04 alanstoate

When I wrote

Perhaps copying a slice of an address() should be a MemoryManager operation.

I was wishing for an API that didn't always have to allocate unique_ptr or shared_ptrs of anything to copy 4 or 8 bytes from a CPU/non-CPU Buffer to a regular CPU memory pointer. It would lower to a memcpy if both source and destination are CPU memory managers.

A combinatorial explosion of functions to support on every MemoryManager is not desirable.

CopyBufferSliceFrom
CopyBufferSliceTo
CopyNonOwnedSliceFrom
CopyNonOwnedSliceTo
ViewBufferSliceFrom
ViewBufferSliceTo

We just need copying as it's supposed to be used for small slices. Most often operations should be based on whole buffers because processing item by item in a columnar/vectorized system is not the way to go.

The motivation for this API was the loading of a single offset (the last one) from a list to determine its inner values array length.

May 02 '24 15:05 felipecrv

@alanstoate what do you think of @felipecrv's feedback and refactoring a bit based on his suggestions?

May 03 '24 14:05 zeroshade

@zeroshade @felipecrv thanks for the feedback and happy to refactor a bit

So if I'm understanding correctly we want a CopyBufferSlice that copies from a buffer using the MemoryManager into a raw pointer?

May 05 '24 03:05 alanstoate

That's my understanding. @felipecrv that sound about right to you?

May 06 '24 14:05 zeroshade

Yes. You would be giving the length and CPU memory pointer to the MemoryManager and it would decide the best way to copy (always copy) the data.

It is a virtual function so it can have the code that @zeroshade wrote to copy those 4 or 8 bytes as the default implementation, but the CPU memory manager can have a much simpler one using memcpy.

May 06 '24 20:05 felipecrv

@felipecrv I've had another go with just the CopyBufferSliceFrom function

Not too sure about the interface especially copying into a void*, but I'm not really sure how else to copy 1 or more OffsetType (or another buffer item type)

May 09 '24 11:05 alanstoate

I don't really understand what the purpose of this PR is. It claims that slicing a Buffer is overkill, but I'm a bit skeptical: are there really situations where the overhead would be unacceptable?

Especially, the CopyBufferSlice implementation is always first calling ViewBufferTo, then falls back to CopyBufferTo (both of which instantiate a std::shared_ptr<Buffer>), then does a memcpy always. If the data length is small, this is still considerable overhead. If the data length is large, then we're doing a spurious copy compared to ViewOrCopy...

May 16 '24 13:05 pitrou

The goal is to have a definitive function that can copy bytes from any device buffer to CPU by delegating to MemoryManager who should know the best way to do it. Having to write that code that @zeroshade wrote in bridge.cc to copy 4 or 8 bytes from a Buffer was very annoying. Without getting into the details of what is fast and what is slow, I think there should be a helper for doing this:

    if (device_type_ == DeviceAllocationType::kCPU) {
      auto offsets = data_->GetValues<OffsetType>(offsets_buffer_id);
      // Compute visible size of buffer
      int64_t buffer_size =
          (c_struct_->length > 0) ? byte_width * offsets[c_struct_->length] : 0;
      return ImportBuffer(buffer_id, buffer_size);
    }

    // we only need the value of the last offset so let's just copy that
    // one value from device to host.
    auto single_value_buf =
        SliceBuffer(data_->buffers[offsets_buffer_id],
                    c_struct_->length * sizeof(OffsetType), sizeof(OffsetType));
    ARROW_ASSIGN_OR_RAISE(
        auto cpubuf, Buffer::ViewOrCopy(single_value_buf, default_cpu_memory_manager()));
    auto offsets = cpubuf->data_as<OffsetType>();
    // Compute visible size of buffer
    int64_t buffer_size = (c_struct_->length > 0) ? byte_width * offsets[0] : 0;

May 17 '24 01:05 felipecrv

@alanstoate don't forget to update the PR description to reflect the final state of the code.

May 17 '24 20:05 felipecrv

@github-actions crossbow submit test-cuda*

May 23 '24 14:05 pitrou

Revision: 8293e79ca6410720672e74ba0da81333e880f358

Submitted crossbow builds: ursacomputing/crossbow @ actions-aae53c830d

Task	Status
test-cuda-cpp
test-cuda-python

May 23 '24 14:05 github-actions[bot]

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit ecd769c3e9dade8dcb381fba7c2ac059d46a3a17.

There were 15 benchmark results indicating a performance regression:

Commit Run on ec2-m5-4xlarge-us-east-2 at 2024-05-23 18:29:38Z
- dataset-serialize (Python) with dataset=nyctaxi_multi_ipc_s3, format=feather, selectivity=1pc
- file-write (Python) with compression=uncompressed, dataset=fanniemae_2016Q4, file_type=feather, input_type=table
and 13 more (see the report linked below)

The full Conbench report has more details. It also includes information about 202 possible false positives for unstable benchmarks that are known to sometimes produce them.

May 24 '24 00:05 conbench-apache-arrow[bot]