torchx icon indicating copy to clipboard operation
torchx copied to clipboard

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

Results 135 torchx issues
Sort by recently updated
recently updated
newest added

Elasticity - the execution of placement groups are pending tasks that will be scheduled by GCS when resources become available. Related PR: #572 Test plan: Mock cluster scaling with `ray.cluster_utils`.

CLA Signed

Two features for Elastic Distributed Training are added to job launched by TorchX on Ray Cluster in this PR: 1. Fault Tolerance - Node failure throws RayActorError which can be...

CLA Signed

Summary: When logging schedule call in TorchX, capture the image used. For FB, it will be a FBPkg id. Differential Revision: D38526631

CLA Signed
fb-exported

This bumps the kfp and lightning versions. The previous diff only bumped lightning but had a dependency incompatibility for typing-extensions. https://github.com/pytorch/torchx/pull/574 Test plan: CI

CLA Signed

Differential Revision: D38448230

CLA Signed
fb-exported

Update linter link

CLA Signed

Initial TorchX Component for Hyper-parameter tuning (https://github.com/pytorch/torchx/issues/510) UX: === Exposes `grid_search` and `bayesian` candidate selection strategies and requires input to define search space, eg: ``` { "params": { "p1": {...

CLA Signed

## Description Add a new `torchx dashboard` command that will launch a local HTTP server that allows users to view all of their jobs with statuses, logs and integration with...

enhancement
RFC
cli

## Description Support elastic training on Ray Cluster. ## Motivation/Background Training can tolerate node failures. The number of worker nodes can expand as the size of the cluster grows. ##...

enhancement
ray

It would be nice to have GCP + TPU support in addition to our existing schedulers. Currently you can run on GCP via Kubernetes + the Kubernetes scheduler but would...

enhancement
module: runner
scheduler-request