xuanhua

Results 9 comments of xuanhua

`#error "You're trying to build PyTorch with a too old version of GCC. We need GCC 9 or later."` You might need a newer version of gcc.

> Hi @xuanhua From this line it looks like the default launcher is used. Can you try `impi` launcher with the following? > > ``` > deepspeed --launcher impi --num_nodes=2...

> Hi, @xuanhua This error indicates there is connection timeout. Can you confirm whether you have set ssh passwordless login? https://www.redhat.com/sysadmin/passwordless-ssh 2024:03:27-22:24:25:(86734) |CCL_ERROR| internal_kvs.cpp:529 kvs_init: connection time (131) >= limit...

Hi, @delock, thankyou so much for your patience. One more thing for double check. Deepspeed's pipeline parallel could suport model training across multiple nodes right ? And it could work...

Hi, @delock I used two docker containers (built from the dockerfile provided by the deepspeed's master branch on github), And they could now communicate with each other over network. Their...

The same issue on my side, is there any git revision that could work ? I mean, at least the demo should work.

@duli2012 Hi, I'm not sure if this pull request meet the project's requirement ? Or any suggestions on this PR, expect your reply :)

```text Error 803: system has unsupported display driver / cuda driver combination ``` It looks like you have a mismatching version between your GPU driver and your cuda, maybe you...

@Armarella Glad to hear that it could work on Deepspeed v0.14.1, I will try this later. And with docker container, you could have a scalable training infrastructure :)