Qinlong Wang comments

Results 55 comments of


                                            Qinlong Wang

Why model_optim_rng.pt is saved in a seperate directory?

> Just another question: Megatron-LM has supported asynchronous checkpoint saving since v0.7.0. Have you compared between dlrover and v0.7.0? Not yet.

Why model_optim_rng.pt is saved in a seperate directory?

> ut 50sec. BTW, the memory saving time is also about 50sec when using Megatron-LM's async save. Maybe the bandwidth of my env's disk is Yeah, the performance disk may...

When performing multi-node, multi-GPU training with Megatron-LM, if the 'rank' is only input in the startup script and not set in the environment variables, an exception may occur (stroagetype is disk)

Do you use the shared storage by nodes to save the checkpoint?

Worker pod stuck in Pending state causing TimeoutError and incorrect handling by master

It is not good to directly delete pending Pods. If the pod pends because of not sufficient resource like GPU/GPU/Memory, the relaunched Pod will pends again. You can set the...

Add sockct close v2

> @workingloong **How to supply exception testing?** You can patch the [method ](https://github.com/intelligent-machine-learning/dlrover/pull/1168/files#diff-a530d2bc0337355d7e29a93e763c955e68b969437f8ac4df30bfca2d70cf6f88R68)of socket in your test cases to raise an OSError like https://github.com/intelligent-machine-learning/dlrover/blob/0c18cc5c82c9500de5a7c1b3e5b0f330a6a52aed/dlrover/python/tests/test_elastic_training_agent.py#L202-L207

Qinlong Wang

Why model_optim_rng.pt is saved in a seperate directory?

Why model_optim_rng.pt is saved in a seperate directory?

When performing multi-node, multi-GPU training with Megatron-LM, if the 'rank' is only input in the startup script and not set in the environment variables, an exception may occur (stroagetype is disk)

Worker pod stuck in Pending state causing TimeoutError and incorrect handling by master

Add sockct close v2

部署elasticjob-controller时，发现节点反复重启

部署elasticjob-controller时，发现节点反复重启

部署elasticjob-controller时，发现节点反复重启

部署elasticjob-controller时，发现节点反复重启

The controller manager restarts frequently