kvikio icon indicating copy to clipboard operation
kvikio copied to clipboard

Improve parallel POSIX read performance

Open kingcrimsontianyu opened this issue 1 year ago • 4 comments

Not for review

Investigating https://github.com/rapidsai/kvikio/issues/629

Number of subtasks per task

KVIKIO_NUM_SUBTASKS_PER_TASK

kingcrimsontianyu avatar Feb 22 '25 04:02 kingcrimsontianyu

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Feb 22 '25 04:02 copy-pr-bot[bot]

(Outdated) Performance checkpoint 926fbb8

  • branch-25.04, 72 threads, GH200 This figure shows the task profiles from 4 out of 72 threads. The first task takes 85 ms to complete. cuMemHostAlloc takes 1 ms per thread. image

  • This PR, 72 threads, GH200 Now the first task takes 79 ms. cuMemHostAlloc takes 27 ms per thread. image

Conclusion: small reduction in latency spike, 85 to 79 ms. It may be worth having a page-locked memory pool, or at least pre-allocating nthreads * task_size page-locked memory block.

kingcrimsontianyu avatar Feb 22 '25 04:02 kingcrimsontianyu

It may be worth having a page-locked memory pool, or at least pre-allocating nthreads * task_size page-locked memory block.

How much memory would be required for this? Can we get away with 2*nthreads*subtask_size? with some form of double-buffering?

vuule avatar Mar 14 '25 02:03 vuule

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Mar 19 '25 13:03 copy-pr-bot[bot]

Performance improvement is marginal, outweighed by the increased implementation complexity. May revisit at a later time.

kingcrimsontianyu avatar May 28 '25 04:05 kingcrimsontianyu