Tristan Rice

Results 128 comments of Tristan Rice

``` TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'memory': 18038862642.0, 'CPU': 8.0, 'node:10.130.6.66': 0.999, 'object_store_memory':...

We don't have any current plans in this area but I'm happy to work with you to add it though. If you're interested in contributing this might be good to...

Yeah I can do Thursday -- feel free to throw something on my calendar at [email protected]

@ljjsalt sent you an invite to the PT slack

adding this support for slurm wouldn't be too bad: 1) generalize the workspace file logic from docker_workspace (.torchxignore) 2) add a job_dir argument to allow specifying an isolation env 3)...

For the heterogenous jobs displaying differently, that's tricky in the current model. The macros like `replica_id` generally need be applied on a per worker basis. If we wrap the app...

KFP pipeline specs are built on Argo so should be fairly straightforward to add support. Argo also can plug launch Volcano jobs so we can get distributed for free existing...

https://github.com/volcano-sh/volcano/issues/1765

Looks like to support this we'll have to directly query all of the pods.

It's more on the volcano side. Kubernetes infinitely retries when images don't exist