xccl icon indicating copy to clipboard operation
xccl copied to clipboard

Adding the XCCL DPU team, and DPU daemon

Open janjust opened this issue 4 years ago • 4 comments

This PR adds the new DPU team as well as a contrib directory with the accompanying DPU daemon app.

This is a first but comprehensive attempt which successfully runs pytorch param-comms benchmark. Tested over 32 bluefield enabled nodes.

There are several configury options to keep in mind when running.

new config options: --with-dpu=yes

client/host side:
two new flags and additional dpu parameter for TLS:
-x TORCH_UCC_TLS=dpu
-x XCCL_TEAM_DPU_ENABLE=1
-x XCCL_TEAM_DPU_HOST_DPU_LIST=

the host_dpu_list file is a 1 to 1 mapping host file that dpu team will use to identify the IP address of his DPU.
eg:
host1 dpu1
host2 dpu2
etc.

dpu side:
-x DPU_DATA_BUFFER_SIZE=$((16 * 1024 * 1024))
En environment variable that sets the buffer size available on the DPU.
If not provided, default is 16MB.
./dpu_server <threads (int)> by default it will use a single thread.

eg.
mpirun -np 4 --map-by ppr:1:node -x UCX_NET_DEVICES=mlx5_0:1 -x XCCL_TEST_TLS=ucx --bind-to none --report-bindings --tag-output -hostfile file.dpus -x LD_LIBRARY_PATH  ./dpu_server 4

Signed-off-by: Tomislavj Janjusic [email protected]

Co-authored-by: Artem Polyakov [email protected] Sergey Lebedev [email protected]

janjust avatar Jan 06 '21 23:01 janjust

@manjugv @Sergei-Lebedev @vspetrov Hey guys - this is PR which adds the DPU team, developed during the hackathon by @artpol84 @Sergei-Lebedev and me.

It's the first attempt that successfully runs, but obviously needs strong vetting. We did preliminary data-checks with the xccl allreduce tests, seems to pass - and it successfully runs the pytorch param/comms bench.

janjust avatar Jan 06 '21 23:01 janjust

@janjust please change the commit message as follows:

Co-authored-by: Artem Polyakov <[email protected]>
Co-authored-by: Sergey Lebedev <[email protected]>

Per https://docs.github.com/en/free-pro-team@latest/github/committing-changes-to-your-project/creating-a-commit-with-multiple-authors

artpol84 avatar Jan 07 '21 00:01 artpol84

I tried it out of curiosity and it works as expected: https://github.com/artpol84/xccl/commit/91a6466ba984109480c51fc7125559fdcc0b97d6

artpol84 avatar Jan 07 '21 00:01 artpol84

@janjust please change the commit message as follows:

Co-authored-by: Artem Polyakov <[email protected]>
Co-authored-by: Sergey Lebedev <[email protected]>

Per https://docs.github.com/en/free-pro-team@latest/github/committing-changes-to-your-project/creating-a-commit-with-multiple-authors

done

janjust avatar Jan 07 '21 14:01 janjust