Qinlong Wang
Qinlong Wang
> Just another question: Megatron-LM has supported asynchronous checkpoint saving since v0.7.0. Have you compared between dlrover and v0.7.0? Not yet.
> ut 50sec. BTW, the memory saving time is also about 50sec when using Megatron-LM's async save. Maybe the bandwidth of my env's disk is Yeah, the performance disk may...
Do you use the shared storage by nodes to save the checkpoint?
It is not good to directly delete pending Pods. If the pod pends because of not sufficient resource like GPU/GPU/Memory, the relaunched Pod will pends again. You can set the...
> @workingloong **How to supply exception testing?** You can patch the [method ](https://github.com/intelligent-machine-learning/dlrover/pull/1168/files#diff-a530d2bc0337355d7e29a93e763c955e68b969437f8ac4df30bfca2d70cf6f88R68)of socket in your test cases to raise an OSError like https://github.com/intelligent-machine-learning/dlrover/blob/0c18cc5c82c9500de5a7c1b3e5b0f330a6a52aed/dlrover/python/tests/test_elastic_training_agent.py#L202-L207
应该是镜像问题,你可以自己重新build 一个新镜像。用老镜像会报这个。
> 还想了解下为什么要去除ScalePlan CRD呢 其实 ScalePlan CRD 一直没用到。
@BalaBalaYi 帮忙把 dockerhub 上的 controller 镜像更新下,我没有docker repo 的权限了。
> 在部署时,发现节点反复重启,查看日志显示:exec /manager: exec format error,请问下这个怎么处理 还是找不到 scaleplan 这个 CRD 吗?
Maybe we need to increase the `LeaseDuration` or `RenewDeadline` like the issue https://github.com/operator-framework/operator-sdk/issues/1813#issuecomment-523713555