pytorch
pytorch copied to clipboard
Distributed broadcast fails with simple GPU tensor on Windows + GLOO
🐛 Describe the bug
OS: Windows 10 Environment created using conda, CUDA 11.4 is installed system-wide, torch 1.10.1 is installed using pip. Here are the repro steps:
Python 3.8.10 (default, May 19 2021, 13:12:57) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import os
>>> import torch.distributed as dist
>>> torch.__version__
'1.10.1+cu113'
>>> os.environ['MASTER_ADDR'] = '127.0.0.1'
>>> os.environ['MASTER_PORT'] = '27501'
>>> os.environ['LOCAL_RANK'] = '0'
>>> os.environ['RANK'] = '0'
>>> os.environ['WORLD_SIZE'] = '1'
>>> dist.init_process_group(backend='gloo')
>>> g = dist.new_group([0])
>>> m = torch.nn.Linear(16, 16)
>>> m.cuda()
Linear(in_features=16, out_features=16, bias=True)
>>> p = list(m.parameters())[0]
>>> p.device
device(type='cuda', index=0)
>>> dist.broadcast(p, 0, group=g)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "___\miniconda3\envs\torch\lib\site-packages\torch\distributed\distributed_c10d.py", line 1167, in broadcast
work.wait()
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
The error does not happen with CPU tensors (e.g. if you skip m.cuda())
If this is not a bug, is there something special to do on Windows for Gloo's broadcast to work with GPU tensors?
The same issue reproduces on a similar environment with PyTorch 1.8.2+cu111 and on Windows 11.
Versions
PyTorch version: 1.10.1+cu113 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A
OS: Microsoft Windows 10 Pro GCC version: Could not collect Clang version: 13.0.0 (https://github.com/llvm/llvm-project.git d7b669b3a30345cfcdb2fde2af6f48aa4b94845d) CMake version: version 3.21.21080301-MSVC_2 Libc version: N/A
Python version: 3.8.10 (default, May 19 2021, 13:12:57) [MSC v.1916 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.19044-SP0 Is CUDA available: True CUDA runtime version: 11.4.100 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090
Nvidia driver version: 497.29 cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\bin\cudnn_ops_train64_8.dll HIP runtime version: N/A MIOpen runtime version: N/A
Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.21.2 [pip3] torch==1.10.1+cu113 [pip3] torch-tb-profiler==0.2.1 [pip3] torchaudio==0.10.1+cu113 [pip3] torchvision==0.11.2+cu113 [conda] libblas 3.9.0 11_win64_mkl conda-forge [conda] libcblas 3.9.0 11_win64_mkl conda-forge [conda] liblapack 3.9.0 11_win64_mkl conda-forge [conda] mkl 2021.3.0 hb70f87d_564 conda-forge [conda] mypy-extensions 0.4.3 pypi_0 pypi [conda] numpy 1.21.0 pypi_0 pypi [conda] torch 1.10.1+cu113 pypi_0 pypi [conda] torch-tb-profiler 0.2.1 pypi_0 pypi [conda] torchaudio 0.10.1+cu113 pypi_0 pypi [conda] torchvision 0.11.2+cu113 pypi_0 pypi
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang
I believe you can add a p.detach() before the broadcast call to resolve this issue. However, the example above seems like a toy example since you have world_size of 1 and don't really need to do a broadcast. Was wondering if you could share a minimal example of the actual problem you are trying to solve here?
I am actually debugging DeepSpeed unit tests on Windows, but I suspect a similar problem would happen on Linux if you explicitly set backend to gloo. Specifically, this test: test_zero_context.py::test_ext_param_getattr
Specifically, this code fails: https://github.com/microsoft/DeepSpeed/blob/3a4cb042433a2e8351887922f8362d3752c52a42/deepspeed/runtime/engine.py#L969
As far as I can see it tries to broadcast pretty much all parameters of the model without explicitly detaching them from the graph, and it seems to work with NCCL backend.
On a side note, why would it be required to deatch the tensor before broadcasting it in this case? The broadcast operation is not really an in-place operation, is it?
On a side note, why would it be required to deatch the tensor before broadcasting it in this case? The broadcast operation is not really an in-place operation, is it?
The broadcast operation is an in-place operation for nodes receiving the data, since the Tensor that you passed in as part of broadcast is being modified in-place with the data received from the src rank.
As far as I can see it tries to broadcast pretty much all parameters of the model without explicitly detaching them from the graph, and it seems to work with NCCL backend.
This mostly comes from different implementations between NCCL and GLOO, NCCL is directly filling out the underlying Tensor buffer, whereas gloo is using something like Tensor.copy_ to do the same.
Can you explain a little more? Here's a para in detach documentation:
In-place modifications on either of them will be seen, and may trigger errors in correctness checks.
Moreover
For sparse tensors: In-place indices / values changes (such as zero_ / copy_ / add_) to the returned tensor will not update the original tensor anymore, and will instead trigger an error.
So sparse tensors constituting parameters of a model can't be broadcasted using gloo? And broadcasting a regular tensor is not guaranteed to work due to the first quote?
Have you ended up figuring this out by any chance? I'm trying to use gloo, as nccl isn't available on windows.
def sync_params(params) -> None:
with torch.no_grad():
for i, p in enumerate(params):
p_copy = p.detach()
dist.broadcast(p_copy, 0)
p.copy_(p_copy)
@thorinf It works nicely with 'gloo' for inference case (I will try for training case soon).
I just added one line: "tensor_copy = tensor" and broadcasted tensor_copy (instead of tensor)
- Line 196 of ... \anaconda3\envs\ ... \Lib\site-packages\deepspeed\comm\torch.py.
Note. My environment: Python 3.10, Pytorch 2.1, deepspeed: 0.11.2, CUDA 11.8