codeflare-sdk icon indicating copy to clipboard operation
codeflare-sdk copied to clipboard

[WIP] set head node ip in jobs

Open MichaelClifford opened this issue 2 years ago • 1 comments

Issue link

https://github.com/project-codeflare/torchx/issues/4

What changes have been made

In working through the upstreaming of our changes, I discovered that we could move some of the logic into the SDK itself. This is better for our use case as we have specific needs and the ability to control job submissions that the average torchx users likely does not have.

I've added a new function in Cluster, get_head_ip that returns the ip address of the head ray node within the cluster.

I've added a new function to DDPJobDefinition, _set_rdzv_as_head_node that updates the torchx cmd arguments to force the use of the head node as the rdzv endpoint.

I also rearranged how _dry_run calls ddp to allow editing of its outputs prior to a job submission.

Verification steps

Checks

  • [ ] I've made sure the tests are passing.
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Manual tests
    • [ ] Testing is not required for this change

MichaelClifford avatar Aug 28 '23 14:08 MichaelClifford

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from michaelclifford. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Aug 28 '23 14:08 openshift-ci[bot]

ddp has been removed from the SDK. Please reopen if this PR is still applicable

KPostOffice avatar Jul 11 '24 17:07 KPostOffice