TonY
TonY copied to clipboard
PyTorch Support
Hi there, I'm working on an update for the TonY installation script for GCP Dataproc. While I have been able to (locally) successfully update TensorFlow, I cannot seem to get the PyTorch example working. It does not work on 0.4 (the most recent version you explicitly mentioning supporting) or 1.7.1, the most recent release. I get the following error:
File "mnist_distributed.py", line 230, in <module>
main()
File "mnist_distributed.py", line 225, in main
init_process(args)
File "mnist_distributed.py", line 185, in init_process
distributed.init_process_group(
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1607810694534_0006/container_1607810694534_0006_01_000003/venv/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 413, in init_process_group
backend = Backend(backend)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1607810694534_0006/container_1607810694534_0006_01_000003/venv/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 76, in __new__
raise ValueError("TCP backend has been deprecated. Please use "
ValueError: TCP backend has been deprecated. Please use Gloo or MPI backend for collective operations on CPU tensors.
Latest attempt: PyTorch 1.7.1 torchvision 0.8.2 TonY 0.4.0 Dataproc 2.0 (Hadoop 3.2.1)
Config:
<configuration>
<property>
<name>tony.application.name</name>
<value>PyTorch</value>
</property>
<property>
<name>tony.application.security.enabled</name>
<value>false</value>
</property>
<property>
<name>tony.worker.instances</name>
<value>2</value>
</property>
<property>
<name>tony.worker.memory</name>
<value>4g</value>
</property>
<property>
<name>tony.ps.instances</name>
<value>1</value>
</property>
<property>
<name>tony.ps.memory</name>
<value>2g</value>
</property>
<property>
<name>tony.application.framework</name>
<value>pytorch</value>
</property>
<property>
<name>tony.worker.gpus</name>
<value>1</value>
</property>
</configuration>
Cluster has 1 master, 2 workers and 2 NVIDIA Tesla T4s. However, any combination of configuration I have tried up to this point results in the same error. Any advice would be greatly appreciated!
@gogasca any idea? guess we need to upgrade the PyTorch example script. I don't see TonY or GCP being the issue here.
Great observation and I believe you are correct: here it shows the tcp backend being used. Adding --backend gloo or --backend nccl (on a gpu cluster) to --task_params changed the error message, so it looks like the example just needs a refresh.
@bradmiro would you mind contributing a patch to fix that?
Sure, I can look into this.
@oliverhu are there special considerations that need to be taken into consideration re: TonY for use with PyTorch? The error seems to be properly configuring init_process_group.
The current code is this: https://github.com/linkedin/TonY/blob/master/tony-examples/mnist-pytorch/mnist_distributed.py#L184-L189
Changing the backend to gloo throws "connection refused" errors at runtime.
That should not matter, all those backend should work 🤔 Have you tried other backends?
The mpi runtime does not work without an installation and we don't include this by default in the Dataproc image.
The nccl does not seem to work, but I am also testing on a cluster that only has GPUs allocated to workers, not the master. The TensorFlow job seemed to work with GPUs just attached to master, but I am creating a fresh cluster with a GPU attached to the master node as well.
nccl error with gpus attached to all machines: RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
This might be a PyTorch thing, I can look into it more probably early next week. Unsure about gloo as well.
mpi won't work because that requires SSH across workers, that is not something supported by default in Hadoop distributions.
nccl and gloo should work though at a glance. We use TensorFlow so not much insight there, but anything not using MPI should work.