cudf Proposal: Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation.

trafficstars

This PR adds a new interface to cuIO which controls where host memory allocations come from. It adds two core functions:

Addresses https://github.com/rapidsai/cudf/issues/14314

void set_current_host_memory_resource(cudf::host_resource_ref mr);
cudf::host_resource_ref get_current_host_memory_resource();

cudf::host_resource_ref is a cuda::mr::resource_ref and is defined in cudf/utilities/resource_ref.hpp along with an async version.

using host_resource_ref       = cuda::mr::resource_ref<cuda::mr::host_accessible>;
using async_host_resource_ref = cuda::mr::async_resource_ref<cuda::mr::host_accessible>;

cudf::io::hostdevice_vector was currently implemented in terms of a thrust::host_vector<> that explicitly uses an allocator called pinned_host_vector. I copied that and made a new class called rmm_host_vector which takes any host_resource_ref. This probably makes pinned_host_vector obsolete.

Names and locations for all of these things are up for debate. This is just a demo.

Parquet benchmarks have a new commandline option which lets you toggle between 3 modes:

--cuio_host_mem pinned              (the default, an unpooled, pinned memory source)
--cuio_host_mem pinned_pool         (the pooled/pinned resource)
--cuio_host_mem pageable            (plain pageable memory, via rmm::mr::new_delete_resource)

The ultimate intent here is to reduce the cpu-side overhead of the setup code that comes before the decode kernels in the parquet reader. The wins are pretty significant for our faster kernels (that is, where we are less dominated by gpu time)

Pinned

| data_type |    io_type    | cardinality | run_length | Samples | CPU Time  | Noise | GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|---------------|-------------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------|
|  INTEGRAL | DEVICE_BUFFER |           0 |          1 |     25x | 20.443 ms | 0.45% | 20.438 ms | 0.45% |      26268890178 |         1.072 GiB |       498.123 MiB |
|  INTEGRAL | DEVICE_BUFFER |        1000 |          1 |     26x | 19.571 ms | 0.42% | 19.565 ms | 0.42% |      27440146729 |       756.210 MiB |       161.438 MiB |
|  INTEGRAL | DEVICE_BUFFER |           0 |         32 |     28x | 18.150 ms | 0.18% | 18.145 ms | 0.18% |      29587789525 |       602.424 MiB |        27.720 MiB |
|  INTEGRAL | DEVICE_BUFFER |        1000 |         32 |     29x | 17.306 ms | 0.37% | 17.300 ms | 0.37% |      31032523423 |       597.181 MiB |        14.403 MiB |

Pooled/pinned

| data_type |    io_type    | cardinality | run_length | Samples | CPU Time  | Noise | GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|---------------|-------------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------|
|  INTEGRAL | DEVICE_BUFFER |           0 |          1 |    117x | 17.258 ms | 0.50% | 17.254 ms | 0.50% |      31115706389 |         1.072 GiB |       498.123 MiB |
|  INTEGRAL | DEVICE_BUFFER |        1000 |          1 |     31x | 16.413 ms | 0.43% | 16.408 ms | 0.43% |      32719609450 |       756.210 MiB |       161.438 MiB |
|  INTEGRAL | DEVICE_BUFFER |           0 |         32 |    576x | 14.885 ms | 0.58% | 14.881 ms | 0.58% |      36077859564 |       602.519 MiB |        27.720 MiB |
|  INTEGRAL | DEVICE_BUFFER |        1000 |         32 |     36x | 14.069 ms | 0.48% | 14.065 ms | 0.48% |      38171646940 |       597.243 MiB |        14.403 MiB |

Checklist

[x] I am familiar with the Contributing Guidelines.
[ ] New or existing tests cover these changes.
[ ] The documentation is up to date with these changes.

Feb 16 '24 21:02 nvdbaranec

Nice work @nvdbaranec !!

Feb 16 '24 23:02 GregoryKimball

I like this ability. My only question is if we should follow the current optional memory resource passed into functions or if we should add this as a set/get.

table_with_metadata read_parquet(
  parquet_reader_options const& options,
  rmm::cuda_stream_view stream        = cudf::get_default_stream(),
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

Maybe this becomes:

table_with_metadata read_parquet(
  parquet_reader_options const& options,
  rmm::cuda_stream_view stream        = cudf::get_default_stream(),
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource(),
  cudf::host_resource_ref* host_mr    = rmm::mr::pinned_memory_resource);

I don't know where all this applies and the trouble of passing it through.

Feb 20 '24 19:02 hyperbolic2346

/merge

Mar 07 '24 22:03 nvdbaranec

cudf cudf copied to clipboard

Proposal: Add general purpose host memory allocator reference to cuIO with a demo of pooled-pinned allocation.

Checklist

cudf
cudf copied to clipboard