gpt-2-output-dataset icon indicating copy to clipboard operation
gpt-2-output-dataset copied to clipboard

ERROR train.py: Default process group is not initialized

Open loretoparisi opened this issue 5 years ago • 5 comments

I get this error when training on a single GPU, when calling the function distributed() to disable tqdm.

To avoid this I have simple wrapped distributed like:

def distributed():
    try:
        return dist.is_available() and dist.is_initialized()
    except:
        return False

loretoparisi avatar Nov 20 '19 14:11 loretoparisi

[UPDATE] Even with the solution above in distributed I'm getting the same error, since after evaluation it is called again in

def _all_reduce_dict(d, device):
    # wrap in tensor and use reduce to gpu0 tensor
    output_d = {}
    for (key, value) in sorted(d.items()):
        tensor_input = torch.tensor([[value]]).to(device)
        torch.distributed.all_reduce(tensor_input)
        output_d[key] = tensor_input.item()
    return output_d

so the torch.distributed.all_reduce(tensor_input) will fail, so I have changed it like

def _all_reduce_dict(d, device):
    # wrap in tensor and use reduce to gpu0 tensor
    output_d = {}
    for (key, value) in sorted(d.items()):
        tensor_input = torch.tensor([[value]]).to(device)
        if distributed():
            torch.distributed.all_reduce(tensor_input)
        output_d[key] = tensor_input.item()
    return output_d

loretoparisi avatar Nov 20 '19 17:11 loretoparisi

Can you paste the output of the following bash script - to check your system information?

curl https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py | python -

I suspect two possibilities:

  • Your NCCL installation is incomplete: try this to (re)install it
  • You're on Windows: we don't have plan for Windows support it at this point.

jongwook avatar Nov 20 '19 18:11 jongwook

@jongwook thank you very much! I have run with both python and python3

ubuntu@deepblue:~$ curl https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py | python -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12461  100 12461    0     0  27635      0 --:--:-- --:--:-- --:--:-- 27691
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: Could not collect

Python version: 2.7
Is CUDA available: N/A
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: GeForce GTX 1080
GPU 1: GeForce GTX 1080

Nvidia driver version: 418.87.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.0.5

Versions of relevant libraries:
[pip] numpy==1.15.4
[conda] blas                      1.0                         mkl  
[conda] mkl                       2019.3                      199  
[conda] mkl-service               1.1.2            py37he904b0f_5  
[conda] mkl_fft                   1.0.10           py37ha843d7b_0  
[conda] mkl_random                1.0.2            py37hd81dba3_0
ubuntu@deepblue:~$ curl https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py | python3 -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12461  100 12461    0     0  56586      0 --:--:-- --:--:-- --:--:-- 56384
Collecting environment information...
PyTorch version: 1.3.1
Is debug build: No
CUDA used to build PyTorch: 10.1.243

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: Could not collect

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: GeForce GTX 1080
GPU 1: GeForce GTX 1080

Nvidia driver version: 418.87.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.0.5

Versions of relevant libraries:
[pip] numpy==1.15.4
[conda] blas                      1.0                         mkl  
[conda] mkl                       2019.3                      199  
[conda] mkl-service               1.1.2            py37he904b0f_5  
[conda] mkl_fft                   1.0.10           py37ha843d7b_0  
[conda] mkl_random                1.0.2            py37hd81dba3_0

loretoparisi avatar Nov 21 '19 08:11 loretoparisi

I still suspect the NCCL install is the culprit; realized that that script doesn't check for NCCL version.. which can be checked torch.cuda.nccl.version()..

Can you run the experiments fine with your proposed changes? I could incorporate them in this repo at some point, but this repo (as most other OpenAI repos are) is in archive status and update is not our priority.

jongwook avatar Nov 22 '19 18:11 jongwook

@jongwook yes I can make it working with the changes I did so far. I will further investigate NCLL by the way. Thanks.

loretoparisi avatar Nov 25 '19 10:11 loretoparisi