fix #639 provide NCCL tests example
Draft, I need to retest it now that I've stripped down the manifest with GKE specific stuff.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Is this gonna be revived?
@andreyvelich thanks for the pointers. But I was mainly looking at the MPIOperator from the "Infiniband / RDMA setup validation / benchmarking" POV. As in, when user creates a k8s cluster with worker nodes supporting Infiniband based network, how do they know that their set up is working correctly? That's where this PR caught my attention and I was wondering if there are plans to resuscitate this PR.
@andreyvelich thanks for the pointers. But I was mainly looking at the MPIOperator from the "Infiniband / RDMA setup validation / benchmarking" POV. As in, when user creates a k8s cluster with worker nodes supporting Infiniband based network, how do they know that their set up is working correctly? That's where this PR caught my attention and I was wondering if there are plans to resuscitate this PR.
Thanks for letting us know, I think it would be nice if you could join one of our Training WG calls to discuss it further: https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit#heading=h.o8oe6e5kry87
We can talk more where those benchmarks should live and how we can validate the Infiniband setup with MPI Operator.