Brad Miro
Brad Miro
Hey @roarjn, sorry you're having issues! I have two preliminary suggestions: 1. Can you try manually deleting the `~/.config/gcloud/credentials.db` file and try re-authenticating? 2. If that didn't work, can you...
Glad it's working for you now @roarjn !
For #2 you may be missing `projectid`: https://github.com/dask/dask-cloudprovider/blob/53d3c92098ff58029d1d98041b38d3eebf9c7713/dask_cloudprovider/gcp/instances.py#L356-L365
@ivanmkc does that repo have automated notebook testing? Maybe we can move the notebooks there. @leahecole wdyt?
Great observation and I believe you are correct: [here](https://github.com/linkedin/TonY/blob/12227c7b896388f6d37af8bd8934598031b34290/tony-examples/mnist-pytorch/mnist_distributed.py#L197) it shows the `tcp` backend being used. Adding `--backend gloo` or `--backend nccl` (on a gpu cluster) to `--task_params` changed the...
Sure, I can look into this.
@oliverhu are there special considerations that need to be taken into consideration re: TonY for use with PyTorch? The error seems to be properly configuring [init_process_group](https://pytorch.org/docs/stable/_modules/torch/distributed/distributed_c10d.html#init_process_group). The current code is...
The `mpi` runtime does not work without an installation and we don't include this by default in the Dataproc image. The `nccl` does not seem to work, but I am...
`nccl` error with gpus attached to all machines: ```RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8``` This might be a PyTorch thing, I can look into it more...