GH-39858: [C++][Device] Add Copy/View slice functions to MemoryManager
Rationale for this change
Currently MemoryManager objects define functionality to Copy or View entire buffers. Occasionally there is the need to only copy a single value or slice from a buffer (see https://github.com/apache/arrow/pull/39770#discussion_r1470135438). It's overkill to do a bunch of whole Buffer operations and manually slicing just to copy 4 or 8 bytes.
Instead we can add new methods to the MemoryManager interface such as CopyBufferSlice etc.
What changes are included in this PR?
Add the MemoryManager::CopyBufferSlice function, which initially attempts to use memcpy for the specified slice. If this is not possible, it defaults to copying the entire buffer and then viewing/copying the slice.
Update ArrayImporter::ImportStringValuesBuffer to use this function.
Are these changes tested?
ArrayImporter::ImportStringValuesBuffer is tested as a part of arrow-c-bridge-test
- GitHub Issue: #39858
Hi @zeroshade I've had a go at adding these MemoryManager methods, just a few q's:
- I've added all the copy/view methods in MemoryManager as slice versions I'm not sure if this is more than requested?
- Should the existing copy/view methods in CPUMemoryManager call these slice versions with
offset = 0&length = buf.size()to reuse some code? - I should probably add a test for all of these, just wondering where that should live?
Thanks!
When I wrote
Perhaps copying a slice of an address() should be a MemoryManager operation.
I was wishing for an API that didn't always have to allocate unique_ptr or shared_ptrs of anything to copy 4 or 8 bytes from a CPU/non-CPU Buffer to a regular CPU memory pointer. It would lower to a memcpy if both source and destination are CPU memory managers.
A combinatorial explosion of functions to support on every MemoryManager is not desirable.
CopyBufferSliceFrom
CopyBufferSliceTo
CopyNonOwnedSliceFrom
CopyNonOwnedSliceTo
ViewBufferSliceFrom
ViewBufferSliceTo
We just need copying as it's supposed to be used for small slices. Most often operations should be based on whole buffers because processing item by item in a columnar/vectorized system is not the way to go.
The motivation for this API was the loading of a single offset (the last one) from a list to determine its inner values array length.
@alanstoate what do you think of @felipecrv's feedback and refactoring a bit based on his suggestions?
@zeroshade @felipecrv thanks for the feedback and happy to refactor a bit
So if I'm understanding correctly we want a CopyBufferSlice that copies from a buffer using the MemoryManager into a raw pointer?
That's my understanding. @felipecrv that sound about right to you?
Yes. You would be giving the length and CPU memory pointer to the MemoryManager and it would decide the best way to copy (always copy) the data.
It is a virtual function so it can have the code that @zeroshade wrote to copy those 4 or 8 bytes as the default implementation, but the CPU memory manager can have a much simpler one using memcpy.
@felipecrv I've had another go with just the CopyBufferSliceFrom function
Not too sure about the interface especially copying into a void*, but I'm not really sure how else to copy 1 or more OffsetType (or another buffer item type)
I don't really understand what the purpose of this PR is. It claims that slicing a Buffer is overkill, but I'm a bit skeptical: are there really situations where the overhead would be unacceptable?
Especially, the CopyBufferSlice implementation is always first calling ViewBufferTo, then falls back to CopyBufferTo (both of which instantiate a std::shared_ptr<Buffer>), then does a memcpy always. If the data length is small, this is still considerable overhead. If the data length is large, then we're doing a spurious copy compared to ViewOrCopy...
The goal is to have a definitive function that can copy bytes from any device buffer to CPU by delegating to MemoryManager who should know the best way to do it. Having to write that code that @zeroshade wrote in bridge.cc to copy 4 or 8 bytes from a Buffer was very annoying. Without getting into the details of what is fast and what is slow, I think there should be a helper for doing this:
if (device_type_ == DeviceAllocationType::kCPU) {
auto offsets = data_->GetValues<OffsetType>(offsets_buffer_id);
// Compute visible size of buffer
int64_t buffer_size =
(c_struct_->length > 0) ? byte_width * offsets[c_struct_->length] : 0;
return ImportBuffer(buffer_id, buffer_size);
}
// we only need the value of the last offset so let's just copy that
// one value from device to host.
auto single_value_buf =
SliceBuffer(data_->buffers[offsets_buffer_id],
c_struct_->length * sizeof(OffsetType), sizeof(OffsetType));
ARROW_ASSIGN_OR_RAISE(
auto cpubuf, Buffer::ViewOrCopy(single_value_buf, default_cpu_memory_manager()));
auto offsets = cpubuf->data_as<OffsetType>();
// Compute visible size of buffer
int64_t buffer_size = (c_struct_->length > 0) ? byte_width * offsets[0] : 0;
@alanstoate don't forget to update the PR description to reflect the final state of the code.
@github-actions crossbow submit test-cuda*
Revision: 8293e79ca6410720672e74ba0da81333e880f358
Submitted crossbow builds: ursacomputing/crossbow @ actions-aae53c830d
| Task | Status |
|---|---|
| test-cuda-cpp | |
| test-cuda-python |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit ecd769c3e9dade8dcb381fba7c2ac059d46a3a17.
There were 15 benchmark results indicating a performance regression:
- Commit Run on
ec2-m5-4xlarge-us-east-2at 2024-05-23 18:29:38Z - and 13 more (see the report linked below)
The full Conbench report has more details. It also includes information about 202 possible false positives for unstable benchmarks that are known to sometimes produce them.