codeflare-sdk
codeflare-sdk copied to clipboard
[WIP] set head node ip in jobs
Issue link
https://github.com/project-codeflare/torchx/issues/4
What changes have been made
In working through the upstreaming of our changes, I discovered that we could move some of the logic into the SDK itself. This is better for our use case as we have specific needs and the ability to control job submissions that the average torchx users likely does not have.
I've added a new function in Cluster, get_head_ip that returns the ip address of the head ray node within the cluster.
I've added a new function to DDPJobDefinition, _set_rdzv_as_head_node that updates the torchx cmd arguments to force the use of the head node as the rdzv endpoint.
I also rearranged how _dry_run calls ddp to allow editing of its outputs prior to a job submission.
Verification steps
Checks
- [ ] I've made sure the tests are passing.
- Testing Strategy
- [ ] Unit tests
- [ ] Manual tests
- [ ] Testing is not required for this change
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from michaelclifford. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
ddp has been removed from the SDK. Please reopen if this PR is still applicable