caffe
caffe copied to clipboard
Could it be possible to implement a centralized scheduler for caffe workload?
When using Intel distribution of caffe to train multiple models, we find that it does not support scheduling for workload in large shared clusters, which makes our deploying very complex. So, could it be possible to implement a centralized scheduler for caffe workload?
In a common commercial data center, the servers are deployed in different racks, and those in the same rack share the same Ethernet switch, which makes the communication between those servers have lower latency. Based on these, we introduce a rack-aware scheduler.
Firstly, a specific structure will be made. We can organize the machines and racks in a relative way. When computing resources are applied, the scheduler can choose proper computing nodes (in the same rack for better efficiency) to start the job. Besides, every calculating job has a weight vector, which was used by the scheduler to decide degradation or preemption. The weight vector can be determined by some important factors, such as the amount of required resources, type of network, and so on.