Atin Sood

Results 15 comments of Atin Sood

@joaoderocha the cluster was deployed as a helm chart on kubernetes. we tested the autoscaling outside horovod and the cluster does seem to autoscale.

also we were wondering is there a specific code in horovod that will track on whether the cluster has autoscaled or not. we were not able to find a reference...

@joaoderocha just a FYI, I think we are trying to figure out below in context of horovod and ray: if we have a ray cluster that can autoscale, will horovod...

we were also thinking that the benchmarks that you have run where you were adding or removing gpus, what were you using to control that. were you running an external...

@cloustone there's work going on on cleaning that tight integration that we have and we should have something out relatively soon. the thought process is that you can create a...

@cloustone `I just used dynamic external storage with NFS to deploy model train. It seems ok.` curious on how you got this going from a technical perspective :) thinking more...

@cloustone other interesting thing that you can try is this https://ai.intel.com/kubernetes-volume-controller-kvc-data-management-tailored-for-machine-learning-workloads-in-kubernetes/ https://github.com/IntelAI/vck we have been looking into this as well. but this can help bring data down to your nodes...

@Tomcli @fplk did you try the intel vck approach with ffdl

@ljjsalt @d4l3k I have been thinking more about this and I am wondering if ray.util.queue is a better way of implementing this. you basically create 2 actors, PlacementGroupManager actor and...

> The reason that we should use Placement Group is yeah, not disagreeing on that. I am just thinking how do we manage the interaction between pg creation and command...