cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Increase the default thread count for kvikIO file reads

Open GregoryKimball opened this issue 1 year ago • 4 comments

Is your feature request related to a problem? Please describe. For hot-cache files, we can increase the IO throughput with a multi-threaded kvikIO read.

Describe the solution you'd like

We should increase the number of threads in kvikIO file reads in libcudf. You can see this effect when using kvikIO on a hot cache file data source (which is a pageable host buffer pretending to be a file), where with 8 threads we get to utilization of 80-90% and throughput of 50 GB/s. I suggest trying a behavior where we use up to 8 threads depending on the size of the file and the default task size. For example, with a task size of 4 MiB, we might use 1 thread for 0-8MiB, two threads for 8-16 MiB, etc up to 8 threads.

We may need to add new plumbing to let us change the kvikIO threadcount per read operation.

Image

Additional context A multi-threaded copy would be great for single-threaded tools like cudf.pandas and could give 3-4x faster IO operations.

GregoryKimball avatar Aug 31 '24 21:08 GregoryKimball

I believe that a staged copy strategy with a pinned bounce buffer will be helpful, from the standpoint of pinned memory management. We can see a similar pattern D2H for shuffle, where we have to (today) hold on to pinned memory while we write to the file system, which often is CPU bound due to compression, if not disk bound.

I would hope a MT copy strategy could be general, so we can use it for both H2D and D2H, perhaps with independent pinned bounce buffers.

abellina avatar Sep 03 '24 15:09 abellina

Thank you @abellina for this feedback. @kingcrimsontianyu and I did some benchmarking and found good results from increasing the thread count.

./PARQUET_READER_NVBENCH  -d 0 -b 1 --timeout 1 -a cardinality=0 -a run_length=1
KVIKIO_NTHREADS=8 ./PARQUET_READER_NVBENCH  -d 0 -b 1 --timeout 1 -a cardinality=0 -a run_length=1

For this benchmark and the FILEPATH datasource, going from 1->8 threads on x86-H100 yielded 141 ms to 74 ms. On GH200 the difference was 100 ms to 67 ms.

[I am opening an issue in kvikIO about MT memcpy and will update here]

GregoryKimball avatar Sep 04 '24 16:09 GregoryKimball

@ayushdg noted that KVIKIO_NTHREADS could also impact performance on other file-like data sources like network-attached, Lustre, slurm and others.

GregoryKimball avatar Sep 24 '24 20:09 GregoryKimball

In terms of MT D2H/H2D memory copy, is there any callable APIs for real-world application like spark-rapids ?

sperlingxx avatar Oct 11 '24 07:10 sperlingxx

Following up on this, looks like setting KVIKIO_NTHREADS=8 impacts performance negatively on data high performance network filesystem like lustre based on internal testing.

ayushdg avatar Oct 21 '24 21:10 ayushdg