codeflare-sdk icon indicating copy to clipboard operation
codeflare-sdk copied to clipboard

Adding a worker group to an additional network

Open kumar-aamit opened this issue 9 months ago • 0 comments

Name of Feature or Improvement

RDMA Networks

Description of Problem the Feature Should Solve

RDMA Networks for Efficient LLM Training

Describe the Solution You Would Like to See

Description of the proposed solution. "workerGroupSpecs": [ { "replicas": cluster.config.num_workers, "minReplicas": cluster.config.num_workers, "maxReplicas": cluster.config.num_workers, "groupName": f"small-group-{cluster.config.name}", "rayStartParams": { "block": "true", "num-gpus": str(worker_gpu_count), "resources": worker_resources, }, "template": V1PodTemplateSpec( metadata=V1ObjectMeta( annotations={ "k8s.v1.cni.cncf.io/networks": [,,...] } ), spec=get_pod_spec( cluster, [get_worker_container_spec(cluster)], cluster.config.worker_tolerations, ) ), } ],

Describe Alternatives You Have Considered

Description of any alternative solutions or features you have considered.

Additional Context

Add any other context, screenshots, console logs, etc. about the request here.

kumar-aamit avatar Mar 06 '25 23:03 kumar-aamit