ludwig
ludwig copied to clipboard
[discussion] What's the best practice to run ludwig job with a remote Ray cluster on Kubernetes?
Currently, I am using KubeRay to start a distributed Ray cluster on kubernetes and get the endpoint which can be assessed from external either using LoadBalancer or NodePort. Then I start a Kubernetes Job to play the role of ludwig job driver, I inject RAY_ADDRESS=ray://10.227.151.166:32471 into the job so ray.init(auto) will detect the remote cluster. After the ludwig job is done, the driver job finishes and I can recycle the remote ray cluster as well.
I am asking whether that's the best practice? Is there better ideas like submit ludwig job as a Ray job to existing Ray cluster?
Hi, @Jeffwan
I think others on Ludwig/Predibase can advise on best practice here.
FWIW I personally have only run Ludwig AutoML against an existing Ray cluster (with the Ray cluster sometimes deployed on K8s, sometimes on individual VMs). I start the Ludwig AutoML job(s) from the Ray head (sometimes with the Ray autoscaler enabled, sometimes not). I typically then spin down the Ray cluster after I've finished all my AutoML jobs & retrieved artifacts of interest from the head after the run completes.
Hi @Jeffwan, I think the approach of using a Job should work well, particularly to support retires in the event of node failure. The docs we have for running with Kuberay only give an example of running directly on the head node, so we can definitely add detail for running production jobs.
We actually use something a bit more involved internally at Predibase, where we run a Temporal worker inside the Ray head node that pulls training requests from a job queue. But this would be a lot of infra setup if you're not already using Temporal (that said, it has been working quite well for us).
Probably the biggest gotcha when using Kubernetes it to ensure syncing is setup properly. If you have checkpoints being written to object storage, it should be fine, but if you want to copy checkpoints between nodes in the k8s cluster, we have a PR in review (https://github.com/ludwig-ai/ludwig/pull/2115) that should address some issues we were seeing with using the Kubernetes namespace syncer in Ray.
Hi @Jeffwan, we've now merged #2115 that enables syncing checkpoints between k8s nodes for a Kubernetes deployment.
above comments address all my questions and concerns. Thanks a lot on the support! I can close the issue now.