dlrover issues

Results 50 dlrover issues

Sort by recently updated

Add Bayesian Optimization for DLRover Brain

### What changes were proposed in this pull request? Add a simple bayesian optimization implementation based on BoTorch for DLRover Brain. ### Why are the changes needed? Useful for black-box...

yzlnew

OSError: [Errno 98] Address already in use

**dlrover version：v0.3.5 megatron version：main** **I encountered an error when using flash checkpoint in megatron**： Exception in thread checkpoint-saver: Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run()...

chencjcj

possible typo in the example of [tf_elasticjob_on_k8s]

the file of ``` docs/tutorial/tf_elasticjob_on_k8s.md ``` potentially contains typo. The steps for creating an auto scaling job is exactly the same as creating a manual scaling job.

lichadehehehe

dlrover/blob/master/docs/tutorial/tf_elasticjob_on_k8s 【tf_elasticjob_on_k8s example failed to start】【tf_elasticjob_on_k8s 示例启动失败】

while executing the tf_elasticjob_on_k8s example, job fails with error. ``` cd examples/tensorflow/criteo_deeprec ``` ``` kubectl apply -f autoscale_job.yaml ``` initial error messages as follows: ``` [root@localhost test]# kubectl get pods...

lichadehehehe

make deploy IMG=easydl/elasticjob-controller:master

test -s /root/dlrover/dlrover/go/operator/bin/controller-gen || GOBIN=/root/dlrover/dlrover/go/operator/bin go install sigs.k8s.io/controller-tools/cmd/[email protected] go: sigs.k8s.io/controller-tools/cmd/[email protected]: sigs.k8s.io/controller-tools/cmd/[email protected]: Get "https://proxy.golang.org/sigs.k8s.io/controller-tools/cmd/controller-gen/@v/v0.9.2.info": dial tcp 172.217.163.49:443: i/o timeout make: *** [Makefile:127: /root/dlrover/dlrover/go/operator/bin/controller-gen] Error 1

yangzhipeng1108