Qinlong Wang comments

Results 55 comments of


                                            Qinlong Wang

Does DLRover have plan to support the MindSpore backend framework?

I have submitted a PR in #1026 to show how to implement a new processor support in DLRover.

add util for loss spike save and decode.

You need to format your commits to pass the test of atorch-pre-commit.

make deploy IMG=easydl/elasticjob-controller:master

The `make deploy` can not download go modules for the reason that the network is not connected with `sigs.k8s.io`.

llama2 failed

You should execute `kubectl -n dlrover apply -f dlrover/go/operator/config/manifests/bases/default-role.yaml` to grant permission for the DLRover master to access CRDs.

Can you share the training cases on Huawei acceleration card?

The example to finetune llama2 is available in #1067

OSError: [Errno 98] Address already in use

Can you retry it with `dlrover[torch]==0.3.7`. We have fixed some bugs for Megatron-LM after 0.3.5 and the bug may have been fixed.

OSError: [Errno 98] Address already in use

You can test it with the repo https://github.com/workingloong/Megatron-LM-CKPT forked from Megatron-LM in 2024.02.

More LLM examples.

I have added the example to fine-tune the llama2 with huggingface trainer in the PR #782.

Torch Trainer Hook

We can support the `Trainer` in lighting and implement a lighting [callback](https://github.com/Lightning-AI/lightning/blob/master/src/lightning/pytorch/callbacks/callback.py)

我们采用的多节点两两分组测试。举个例子，假设job 有6个节点，{0,1,2,3,4,5}，首先dlrover 的 elasticjob master 会把这6个节点分为3组，每组两个节点，即[{0,1}, {2,3}, {4,5}]，每个组内的节点执行 GEMM 和 allreduce。每个节点都收集elapsed time 并上报给 elasticjob master。比如各节点的执行耗时为[5.1, 5.2, 5.6, 5.9, 10.2, 10.3] (单位秒)，虽然每个组内的两个节点的elapsed time 差不多，但是不同组的 elapsed time 可能相差很大。上面的例子中，节点 4, 5 可能就是慢节点。然后...