torchx
torchx copied to clipboard
TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
See https://github.com/pytorch/torchx/actions/workflows/components-integration-tests.yaml
See: https://github.com/pytorch/torchx/actions/workflows/kfp-integration-tests.yaml
## Description I’m currently working with TorchX in conjunction with Volcano scheduling for my training jobs on an Amazon EKS cluster. I’ve also integrated Karpenter autoscaler for effective node scaling....
## ❓ Questions and Help ### Question Hi, could anyone provide the script to run pytorch ddp training on IBM LSF?
## 🐛 Bug Module (check all that applies): * [ ] `torchx.spec` * [ ] `torchx.component` * [ ] `torchx.apps` * [ ] `torchx.runtime` * [x] `torchx.cli` * [ ]...
## 📚 Documentation ## Link [https://pytorch.org/torchx/latest/components/distributed.html](https://pytorch.org/torchx/latest/components/distributed.html) ## What does it currently say? Not clear whether --cpu, --gpu arguments are overrided by -j arguments, although in my testing (launch then run...
## Description Add support for [Hashicorp Nomad](https://www.nomadproject.io/) as a scheduler. ## Motivation/Background Nomad has a good scheduler, and pytorch has good distributed training. However, Nomad launches batch job tasks asynchronously...
## Description Switch static type checker to mypy and include mypy compatible type stubs (PEP 561 compliant) by adding a `py.typed` file at the root of `torchx` module (see https://mypy.readthedocs.io/en/stable/installed_packages.html#creating-pep-561-compatible-packages)....
This adds a new `runopts.from_typed_dict` method and uses it to generate the runopts from the typed dict field, annotations, default parameters and docstring. This simplifies adding new fields to schedulers...
## 🐛 Bug Module (check all that applies): * [ ] `torchx.spec` * [ ] `torchx.component` * [ ] `torchx.apps` * [ ] `torchx.runtime` * [ ] `torchx.cli` * [...