Qinlong Wang
Qinlong Wang
I have submitted a PR in #1026 to show how to implement a new processor support in DLRover.
You need to format your commits to pass the test of atorch-pre-commit.
The `make deploy` can not download go modules for the reason that the network is not connected with `sigs.k8s.io`.
You should execute `kubectl -n dlrover apply -f dlrover/go/operator/config/manifests/bases/default-role.yaml` to grant permission for the DLRover master to access CRDs.
The example to finetune llama2 is available in #1067
Can you retry it with `dlrover[torch]==0.3.7`. We have fixed some bugs for Megatron-LM after 0.3.5 and the bug may have been fixed.
You can test it with the repo https://github.com/workingloong/Megatron-LM-CKPT forked from Megatron-LM in 2024.02.
I have added the example to fine-tune the llama2 with huggingface trainer in the PR #782.
We can support the `Trainer` in lighting and implement a lighting [callback](https://github.com/Lightning-AI/lightning/blob/master/src/lightning/pytorch/callbacks/callback.py)
我们采用的多节点两两分组测试。举个例子,假设job 有6个节点,{0,1,2,3,4,5},首先dlrover 的 elasticjob master 会把这6个节点分为3组,每组两个节点,即[{0,1}, {2,3}, {4,5}],每个组内的节点执行 GEMM 和 allreduce。每个节点都收集elapsed time 并上报给 elasticjob master。比如各节点的执行耗时为[5.1, 5.2, 5.6, 5.9, 10.2, 10.3] (单位秒),虽然每个组内的两个节点的elapsed time 差不多,但是不同组的 elapsed time 可能相差很大。上面的例子中,节点 4, 5 可能就是慢节点。 然后...