mpich icon indicating copy to clipboard operation
mpich copied to clipboard

hydra: GPU visibility control revamp

Open yfguo opened this issue 3 years ago • 4 comments

Pull Request Description

  1. Add interface for querying GPU device list and subdevice list in MPL. The MPL returns array of integers that represents individual GPU device or subdevice.
  2. Implementation of these interface for ZE, CUDA and HIP.
  3. Add hydra option -gpu-subdevs-per-proc for allowing GPU visibility controlled at subdevice level.
  4. Update hydra's round-robin GPU assignment algorithm for visibility control.

Author Checklist

  • [ ] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [ ] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [ ] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

yfguo avatar Feb 25 '22 19:02 yfguo

(from Yanfei)

Overview

  • CPU Affinity
    • Process bind to CPU core(s)
      • mpiexec -bind-to core:4 //bind each proc to 4 cores, rr
    • Different binding policies
      • mpiexec -bind-to gpu //bind each proc to cores closest to gpu
    • GPU locality focused policy
      • mpiexec -bind-to gpu:2
  • GPU Visibility
    • Which GPUs are visible to which processes
    • Does not implies CPU Affinity by default
      • mpiexec -gpus-per-proc 2 //each proc sees 2 GPUs, rr
  • GPU sub-device extension
    • mpiexec -gpus-per-proc 1 -gpu-assign-subdevice // 1 tile visible to each proc
    • mpiexec -bind-to gpu-subdev{<id>|:n}

hzhou avatar Jul 23 '22 15:07 hzhou

(from Yanfei) image image image

hzhou avatar Jul 23 '22 15:07 hzhou

@yfguo Do all the examples work with this PR?

hzhou avatar Jul 23 '22 15:07 hzhou

test:mpich/ch4/ofi

yfguo avatar Jul 28 '22 18:07 yfguo

test:mpich/ch4/gpu

yfguo avatar Sep 23 '22 16:09 yfguo

test:mpich/ch4/gpu

yfguo avatar Sep 23 '22 16:09 yfguo

test:mpich/ch4/gpu

yfguo avatar Sep 23 '22 20:09 yfguo

test:mpich/ch4/most

yfguo avatar Sep 23 '22 21:09 yfguo

GPU failures seems unrelated.

yfguo avatar Sep 24 '22 04:09 yfguo