xuanhua comments

Results 9 comments of


                                            xuanhua

[BUG] Trying to finetune mistral using deepspeed but running into an error: Error building extension 'cpu_adam'

`#error "You're trying to build PyTorch with a too old version of GCC. We need GCC 9 or later."` You might need a newer version of gcc.

[BUG] Failed for using cpu for pipeline based training across multiple machines (2 machines actually)

> Hi @xuanhua From this line it looks like the default launcher is used. Can you try `impi` launcher with the following? > > ``` > deepspeed --launcher impi --num_nodes=2...

[BUG] Failed for using cpu for pipeline based training across multiple machines (2 machines actually)

> Hi, @xuanhua This error indicates there is connection timeout. Can you confirm whether you have set ssh passwordless login？ https://www.redhat.com/sysadmin/passwordless-ssh 2024:03:27-22:24:25:(86734) |CCL_ERROR| internal_kvs.cpp:529 kvs_init: connection time (131) >= limit...

[BUG] Failed for using cpu for pipeline based training across multiple machines (2 machines actually)

Hi, @delock, thankyou so much for your patience. One more thing for double check. Deepspeed's pipeline parallel could suport model training across multiple nodes right ? And it could work...

[BUG] Failed for using cpu for pipeline based training across multiple machines (2 machines actually)

Hi, @delock I used two docker containers (built from the dockerfile provided by the deepspeed's master branch on github), And they could now communicate with each other over network. Their...

[BUG/Help] <模型和代码，都使用今天最新的，跑的ptuning v2模型，预测为空，怎么处理呀>

The same issue on my side, is there any git revision that could work ? I mean, at least the demo should work.

Fix training of pipeline based peft's lora model

@duli2012 Hi, I'm not sure if this pull request meet the project's requirement ? Or any suggestions on this PR, expect your reply :)

Baichuan fine-tune.py bug

```text Error 803: system has unsupported display driver / cuda driver combination ``` It looks like you have a mismatching version between your GPU driver and your cuda, maybe you...

[BUG] Failed for using cpu for pipeline based training across multiple machines (2 machines actually)

@Armarella Glad to hear that it could work on Deepspeed v0.14.1, I will try this later. And with docker container, you could have a scalable training infrastructure :)