mpi-operator icon indicating copy to clipboard operation
mpi-operator copied to clipboard

fix #639 provide NCCL tests example

Open samos123 opened this issue 1 year ago • 4 comments

Draft, I need to retest it now that I've stripped down the manifest with GKE specific stuff.

samos123 avatar May 01 '24 01:05 samos123

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar May 01 '24 01:05 google-oss-prow[bot]

Is this gonna be revived?

surajssd avatar Apr 02 '25 23:04 surajssd

@andreyvelich thanks for the pointers. But I was mainly looking at the MPIOperator from the "Infiniband / RDMA setup validation / benchmarking" POV. As in, when user creates a k8s cluster with worker nodes supporting Infiniband based network, how do they know that their set up is working correctly? That's where this PR caught my attention and I was wondering if there are plans to resuscitate this PR.

surajssd avatar Apr 07 '25 23:04 surajssd

@andreyvelich thanks for the pointers. But I was mainly looking at the MPIOperator from the "Infiniband / RDMA setup validation / benchmarking" POV. As in, when user creates a k8s cluster with worker nodes supporting Infiniband based network, how do they know that their set up is working correctly? That's where this PR caught my attention and I was wondering if there are plans to resuscitate this PR.

Thanks for letting us know, I think it would be nice if you could join one of our Training WG calls to discuss it further: https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit#heading=h.o8oe6e5kry87

We can talk more where those benchmarks should live and how we can validate the Infiniband setup with MPI Operator.

andreyvelich avatar Apr 08 '25 14:04 andreyvelich