torchx
torchx copied to clipboard
TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
#809 was flawed and didn't change the programming logic. I tried to think of any way for build_workspace_and_update_role() to return whether a change in the built image is detected. There...
Added a docker API client, and changed the build function to its low-level version. This makes it return an event stream that we can then log to screen for real-time...
## Description Changing the docker image build to its low level implementation so it can be more verbose. ## Motivation/Background Building the docker image can take quite some time, and...
## Description HuggingFace accelerate is used for some OSS models. It would be great to have support for it as a component in addition to dist.ddp. ## Motivation/Background ## Detailed...
## ❓ Questions and Help ### Please note that this issue tracker is not a help form and this issue will be closed. Before submitting, please ensure you have gone...
## 🐛 Bug custom components using binop instead of Optional result in validation error. custom schedulers work as intended as there is no validation Module (check all that applies): *...
## 🐛 Bug Device Request capabilities should be updated to "gpu", not "compute" https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L308 ``` c.kwargs["device_requests"] = [ DeviceRequest( count=resource.gpu, capabilities=[["compute"]], ) ] ``` Module (check all that applies): *...
## 🐛 Bug In DockerScheduler._submit_dryrun, the keyword argument for docker.containers.run hostname is set to name: https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L280 name is set to ``` name = f"{app_id}-{role.name}-{replica_id}" ``` https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L259C17-L260C1 It is typical/common for...
## Description Provide a way to use the NVIDIA Network Operator through the CLI and API of the Kubernetes scheduler. ## Motivation/Background The [NVIDIA Network Operator](https://github.com/Mellanox/network-operator) enables RDMA devices and...
1. Modify commands to initialize scheduler with options that can be defined in config. Generally most of the schedulers can operate using scheduler options, however in some cases for multi-tenant...