dlrover icon indicating copy to clipboard operation
dlrover copied to clipboard

DLRover: An Automatic Distributed Deep Learning System

Results 50 dlrover issues
Sort by recently updated
recently updated
newest added

### What changes were proposed in this pull request? Add a simple bayesian optimization implementation based on BoTorch for DLRover Brain. ### Why are the changes needed? Useful for black-box...

**dlrover version:v0.3.5 megatron version:main** **I encountered an error when using flash checkpoint in megatron**: Exception in thread checkpoint-saver: Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run()...

the file of ``` docs/tutorial/tf_elasticjob_on_k8s.md ``` potentially contains typo. The steps for creating an auto scaling job is exactly the same as creating a manual scaling job.

while executing the tf_elasticjob_on_k8s example, job fails with error. ``` cd examples/tensorflow/criteo_deeprec ``` ``` kubectl apply -f autoscale_job.yaml ``` initial error messages as follows: ``` [root@localhost test]# kubectl get pods...

test -s /root/dlrover/dlrover/go/operator/bin/controller-gen || GOBIN=/root/dlrover/dlrover/go/operator/bin go install sigs.k8s.io/controller-tools/cmd/[email protected] go: sigs.k8s.io/controller-tools/cmd/[email protected]: sigs.k8s.io/controller-tools/cmd/[email protected]: Get "https://proxy.golang.org/sigs.k8s.io/controller-tools/cmd/controller-gen/@v/v0.9.2.info": dial tcp 172.217.163.49:443: i/o timeout make: *** [Makefile:127: /root/dlrover/dlrover/go/operator/bin/controller-gen] Error 1

![image](https://github.com/intelligent-machine-learning/dlrover/assets/18071380/f0d5d745-335f-4183-b7d4-5879132c3d8e)

For PyTorch elastic synchronous training jobs, the number of workers is typically set between `min_nodes` and `max_nodes`. If the number of nodes is less than `min_nodes`, the training iteration cannot...

### What changes were proposed in this pull request? Implement the training hang diagnosis. ### Why are the changes needed? Detect the training hang and avoid time waste. ### Does...

The restarted worker will fail again if the training fails due to a code bug. The job should exit as soon as possible to release resources on a cluster.

enhancement
question

### What changes were proposed in this pull request? 针对loss尖刺的记录和解析提供了一个工具 ### Why are the changes needed? 为了更多训练的人来使用