Improve parallel POSIX read performance
Not for review
Investigating https://github.com/rapidsai/kvikio/issues/629
Number of subtasks per task
KVIKIO_NUM_SUBTASKS_PER_TASK
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.
Contributors can view more details about this message here.
(Outdated) Performance checkpoint 926fbb8
-
branch-25.04, 72 threads, GH200 This figure shows the task profiles from 4 out of 72 threads. The first task takes 85 ms to complete.
cuMemHostAlloctakes 1 ms per thread. -
This PR, 72 threads, GH200 Now the first task takes 79 ms.
cuMemHostAlloctakes 27 ms per thread.
Conclusion: small reduction in latency spike, 85 to 79 ms. It may be worth having a page-locked memory pool, or at least pre-allocating nthreads * task_size page-locked memory block.
It may be worth having a page-locked memory pool, or at least pre-allocating nthreads * task_size page-locked memory block.
How much memory would be required for this? Can we get away with 2*nthreads*subtask_size? with some form of double-buffering?
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
Performance improvement is marginal, outweighed by the increased implementation complexity. May revisit at a later time.