Tianyi Chen

Results 32 comments of Tianyi Chen

Still working on it. Will sync the impl to the repo after the internal release.

Does this PR still need updates and merging? If so, please reply. This PR will be closed by the end of the month if there is no response. Thanks a...

A blacklist mechanism can be introduced to this case: throw an explicit error for user code errors with no more retry and free up resources.

[user_code_bug_demo_log.txt](https://github.com/intelligent-machine-learning/dlrover/files/15089846/user_code_bug_demo_log.txt)

> > I add `--use-distributed-optimizer`,get a new error. Env: 4*8=32 a100 gpu, tp2 pp8 > > ``` > > [2024-06-03 08:11:01,842] [INFO] [ckpt_saver.py:892:commit_checkpoint] The number of ready shards is 26...

https://github.com/intelligent-machine-learning/dlrover/issues/1529

U should encapsulate the usage of your CLI within your training script.

These part implementation is under construction. Please refactor ur implement after the next release(v0.4.0)

Can u provide more information? The more detailed, the better. e.g. Detail of killing. (failed cp step?, load cp step after failover?)