Qinlong Wang comments

Results 55 comments of


                                            Qinlong Wang

straggler-detection

1. 节点间会通信，耗时基本一样，但是也有微小差别，主要是还有 GEMM 的耗时可能不一样，而且 cuda launch 啥的时间也会有点区别。 2. 具体网卡检测不出来，只能定位到哪个节点有问题。

应该相差不大吧，完全一样那肯定不可能。这是我们在两个节点上的测试，每个节点2个 rank。 worker-0 ``` [2024-05-23 17:09:10,658] [INFO] [utils.py:41:wrapper] Time to execute bm_allreduce on local rank 1 is 5.369s. [2024-05-23 17:09:10,659] [INFO] [utils.py:41:wrapper] Time to execute bm_allreduce on local rank...

example failed: examples/tensorflow/criteo_deeprec/manual_job.yaml

这个例子我已在 PR #1141 中修复了。你可以按如下步骤 ``` kubectl -n dlrover apply -f examples/tensorflow/criteo_deeprec/manual_job.yaml ``` 这个job 将有如下 Pods ``` NAME READY STATUS RESTARTS AGE deepctr-manual-scale-edljob-chief-0 1/1 Running 0 117s deepctr-manual-scale-edljob-ps-0 1/1 Running 0...

Incomplete save of ckpt files

> 同问，当设置 tp，pp ，deepsped+zero 等并行策略，遇到网络问题，GPU故障，节点异常，能结合弹性伸缩恢复 DLRover 的容错基于 torchelastic 的重启子进程的方案，理论上只要有 checkpoint 就可以恢复。针对具体的并行方案，能否恢复只要区别是故障后重启子进程的数量是否和故障之前的子进程数量一致，即 global world size 是否会有变化。 - 如果发现了故障，比如网络导致的 NCCL timeout 等，只要可用的节点数量没有变化，任意的并行方式都可以恢复。也就是说和手动重启训练没有区别。 - 如果是节点故障了，但是集群中还有备份可用节点，DLRover 的 ElasticJob 可以重新拉起一个新的 Pod 替换故障机的 Pod，那就和上面的情况没有区别。 -...

megatron-lm flash-ckpt can not save ckpt to disk when use pipeline parallel

I have successfully tested the flash checkpoint using the following command with 4 A100 nodes in the [forked repo](https://github.com/workingloong/Megatron-LM-CKPT) whose commit id is [cb995d5](https://github.com/NVIDIA/Megatron-LM/tree/cb995d571faea19d01a1bf55ed0fd89523b9ce64). ``` dlrover-run --max-restarts=2 --nnodes=$NNODES --nproc_per_node=$GPUS_PER_NODE pretrain_gpt.py...

megatron-lm flash-ckpt can not save ckpt to disk when use pipeline parallel

dlrover[torch]==0.3.7. I have reproduced the issue if I do not use `--use-distributed-optimizer`.

[Error] When using deepspeed to start a megatron training task, only rank 0 of the flash checkpoint saves the model

You can check whether other ranks have the non-empty state dict when calling save_checkpoint.

easydl/elasticjob-controller:master image pull error

You can `cd dlrover/go/operator` and `make docker-build docker-push IMG=/operator:tag ` to build the image by yourself. https://github.com/intelligent-machine-learning/dlrover/blob/master/dlrover/go/operator/README.md

Why model_optim_rng.pt is saved in a seperate directory?

The flash checkpoint in DLRover saves and loads the distributed optimizer checkpoint of Megatron-LM in parallel. This is, each rank saves and loads its owner shard of optimizer states into...

Why model_optim_rng.pt is saved in a seperate directory?

> @workingloong Thanks for your quick reply. I got it. > > I tried benchmarking dlrover and **found `save_to_memory` costs ~55sec. Is it normal? From the [blogs](https://github.com/intelligent-machine-learning/dlrover/blob/master/docs/blogs/megatron_flash_checkpoint.md#experiment-using-gpt-15b-on-multi-node-multi-gpu) the cost of...