Ma, Guokai
Ma, Guokai
@loadams I ran these two tests on my local environment. It didn't took so long. Can you help run this workflow again to see whether it is reproducible? Thanks!
Hi @loadams, I tried run these UTs in my environment and didn't see this timeout. Since CPU UT is already covered by workflow `cpu-torch-latest`. I removed unit tests in this...
Hi @xuanhua From this line it looks like the default launcher is used. Can you try `impi` launcher with the following? ``` deepspeed --launcher impi --num_nodes=2 --hostfile=./hostfile_linux pipeline_model.py ```
Hi, @xuanhua This error indicates there is connection timeout. Can you confirm whether you have set ssh passwordless login? https://www.redhat.com/sysadmin/passwordless-ssh 2024:03:27-22:24:25:(86734) |CCL_ERROR| internal_kvs.cpp:529 kvs_init: connection time (131) >= limit (120)
Hi @xuanhua , pipeline should work across multiple nodes. My understanding if combine pipeline with zero 1, you will have 2 dimensional parallel. In the first dimension weights and optimization...
@xuanhua I think this error is because oneCCL binding for PyTorch does not support send/recv yet. I think there are two way around this: 1. Switch to gloo backend by...
Hi @duli2012 thanks for adding this interface. I have always been worring accelerator interface# may grow too big when we propose more and more capabilities into it, this interface is...
Thanks @duli2012 , my intuition is zero 1/2/3 should not among accelerator feature list. Zero stage code is shared between different accelerators and there is no interface specific to zero...
@jingxu10 I saw you are one of the authors of this file, can you help review this PR? Thanks!
@jingxu10 looks like there is a random failure in retriving pytorch repo. Let me merge again.