Tristan Rice
Tristan Rice
This changes `dist.ddp` so the torchelastic rank matches the `replica_id`. This is important when using slurm w/ lightning and NCCL+IB since NCCL errors can occur leading to crashes or deadlocks....
This contains two possible hydra aware TorchX components. One is a proxy to a different component and can be used instead of `.torchxconfig` and the other allows full control and...
Named resources needs a bit of love - [ ] remove either get_resources or named_resources["foo"] to ensure only one entry path - [ ] make named_resources case insensitive - [...
There's been a number of failures for `test_start_on_file` in the CI that have popped up over time. We should fix this test so it's more stable. This failure was on...
## Description Quite a few of the cloud services / cluster tools for running ML jobs use OCI/Docker containers so I've been looking into how to make dealing with these...
## Description Add support for running TorchX components via the Slurm OCI interface. ## Motivation/Background Slurm 21.08+ has support for running OCI containers as the environment. This matches well with...
## Description We should add some notebook specific integrations to make working with workspace and launching remote jobs first class. This builds upon the workspace support tracked by #333. ##...
## 🐛 Bug Component (check all that applies): * [ ] `state api` * [ ] `train_step api` * [ ] `train_loop` * [x] `rendezvous` * [ ] `checkpoint` *...
##### ENVIRONMENT - Operating System: Arch Linux - Python version: Python 3.7.2 - VirtualBox version: virtualbox 6.0.4-4 - VirtualBox SDK version: virtualbox-sdk 6.0.4-4 - Location where VirtualBox SDK is installed:...
Desired features of a new/improved triplestore - Fast compression on disk (lz4?) - Fast queries - De-duplicate fields in memory (use tokens instead of ids/predicates?) - Use existing solution (Cayley?)...