Brad Miro

Results 26 comments of Brad Miro

Hey @roarjn, sorry you're having issues! I have two preliminary suggestions: 1. Can you try manually deleting the `~/.config/gcloud/credentials.db` file and try re-authenticating? 2. If that didn't work, can you...

For #2 you may be missing `projectid`: https://github.com/dask/dask-cloudprovider/blob/53d3c92098ff58029d1d98041b38d3eebf9c7713/dask_cloudprovider/gcp/instances.py#L356-L365

@ivanmkc does that repo have automated notebook testing? Maybe we can move the notebooks there. @leahecole wdyt?

Great observation and I believe you are correct: [here](https://github.com/linkedin/TonY/blob/12227c7b896388f6d37af8bd8934598031b34290/tony-examples/mnist-pytorch/mnist_distributed.py#L197) it shows the `tcp` backend being used. Adding `--backend gloo` or `--backend nccl` (on a gpu cluster) to `--task_params` changed the...

Sure, I can look into this.

@oliverhu are there special considerations that need to be taken into consideration re: TonY for use with PyTorch? The error seems to be properly configuring [init_process_group](https://pytorch.org/docs/stable/_modules/torch/distributed/distributed_c10d.html#init_process_group). The current code is...

The `mpi` runtime does not work without an installation and we don't include this by default in the Dataproc image. The `nccl` does not seem to work, but I am...

`nccl` error with gpus attached to all machines: ```RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8``` This might be a PyTorch thing, I can look into it more...