dlrover icon indicating copy to clipboard operation
dlrover copied to clipboard

DLRover: An Automatic Distributed Deep Learning System

Results 50 dlrover issues
Sort by recently updated
recently updated
newest added

Brain requires a job to specify a couple of parameters when to process the requests from the job, e.g., processor, data store, config retriever. Now those parameters are constants in...

enhancement

Provide definition and suggested usage for NodeGroupResource, lunch_nodes, removed_nodes.

Generating informer, client and lister for crd ElasticJob and scaler.

### What changes were proposed in this pull request? The nodes backup the checkpoint data in the shared meory. ### Why are the changes needed? Shorten the time to load...

请问一下对节点分组计算 elapsed time 时是怎么计算的,如果node1和node2一组运行ddp,那这两个节点的 elapsed time 不应该是一样的吗?

# 环境 1. 运行分支为 master 2. k8s 版本为 1.22 3. cuda version 12.3 # 问题 执行了`kubectl apply -f examples/tensorflow/criteo_deeprec/manual_job.yaml`,worker 节点一直未出现,只有一个 master 在 ``` kubectl get po -n dlrover NAME READY...

I am using dlrover on Megatron-DeepSpeed,and my machine has 4 GPUs. The hybrid parallel settings are as follows, TP:[0,1],[2,3] DP:[0,2],[1,3] At the same time, I also configured DeepSpeed with Zero...

In the process of training large-scale distributed models, we often encounter a variety of issues. Currently, we can only analyze the possible causes of these issues by examining the logs....

enhancement