kvikio Improve parallel POSIX read performance

Not for review

Investigating https://github.com/rapidsai/kvikio/issues/629

Number of subtasks per task

KVIKIO_NUM_SUBTASKS_PER_TASK

Feb 22 '25 04:02 kingcrimsontianyu

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Feb 22 '25 04:02 copy-pr-bot[bot]

(Outdated) Performance checkpoint 926fbb8

branch-25.04, 72 threads, GH200 This figure shows the task profiles from 4 out of 72 threads. The first task takes 85 ms to complete. cuMemHostAlloc takes 1 ms per thread.
This PR, 72 threads, GH200 Now the first task takes 79 ms. cuMemHostAlloc takes 27 ms per thread.

Conclusion: small reduction in latency spike, 85 to 79 ms. It may be worth having a page-locked memory pool, or at least pre-allocating nthreads * task_size page-locked memory block.

Feb 22 '25 04:02 kingcrimsontianyu

It may be worth having a page-locked memory pool, or at least pre-allocating nthreads * task_size page-locked memory block.

How much memory would be required for this? Can we get away with 2*nthreads*subtask_size? with some form of double-buffering?

Mar 14 '25 02:03 vuule

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Mar 19 '25 13:03 copy-pr-bot[bot]

Performance improvement is marginal, outweighed by the increased implementation complexity. May revisit at a later time.

May 28 '25 04:05 kingcrimsontianyu