torchx issues

moved image diff logging to their respective workspace

1

#809 was flawed and didn't change the programming logic. I tried to think of any way for build_workspace_and_update_role() to return whether a change in the built image is detected. There...

ccharest93

CLA Signed

Docker build

2

Added a docker API client, and changed the build function to its low-level version. This makes it return an event stream that we can then log to screen for real-time...

ccharest93

CLA Signed

Docker build verbosity

3

## Description Changing the docker image build to its low level implementation so it can be more verbose. ## Motivation/Background Building the docker image can take quite some time, and...

ccharest93

HuggingFace accelerate component

## Description HuggingFace accelerate is used for some OSS models. It would be great to have support for it as a component in addition to dist.ddp. ## Motivation/Background ## Detailed...

d4l3k

Determine scheduler from component level

1

## ❓ Questions and Help ### Please note that this issue tracker is not a help form and this issue will be closed. Before submitting, please ensure you have gone...

ryxli

[py310+] custom components binary operator not supported in file linting

## 🐛 Bug custom components using binop instead of Optional result in validation error. custom schedulers work as intended as there is no validation Module (check all that applies): *...

ryxli

local_docker scheduler unable to set gpu correctly

## 🐛 Bug Device Request capabilities should be updated to "gpu", not "compute" https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L308 ``` c.kwargs["device_requests"] = [ DeviceRequest( count=resource.gpu, capabilities=[["compute"]], ) ] ``` Module (check all that applies): *...

ryxli

locker_docker scheduler hostname length exceeded

## 🐛 Bug In DockerScheduler._submit_dryrun, the keyword argument for docker.containers.run hostname is set to name: https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L280 name is set to ``` name = f"{app_id}-{role.name}-{replica_id}" ``` https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L259C17-L260C1 It is typical/common for...

ryxli

Add Support for NVIDIA Network Operator to the Kubernetes Scheduler

7

## Description Provide a way to use the NVIDIA Network Operator through the CLI and API of the Kubernetes scheduler. ## Motivation/Background The [NVIDIA Network Operator](https://github.com/Mellanox/network-operator) enables RDMA devices and...

benash

Azure batch scheduler implementation

3

1. Modify commands to initialize scheduler with options that can be defined in config. Generally most of the schedulers can operate using scheduler options, however in some cases for multi-tenant...

kurman

CLA Signed

torchx
torchx copied to clipboard

Metadata

moved image diff logging to their respective workspace

Docker build

Docker build verbosity

HuggingFace accelerate component

Determine scheduler from component level

[py310+] custom components binary operator not supported in file linting

local_docker scheduler unable to set gpu correctly

locker_docker scheduler hostname length exceeded

Add Support for NVIDIA Network Operator to the Kubernetes Scheduler

Azure batch scheduler implementation

← Metadata

Owner

Metadata

torchx torchx copied to clipboard

Metadata

← Metadata

Owner

Metadata

torchx
torchx copied to clipboard