gpt-2-output-dataset
gpt-2-output-dataset copied to clipboard
ERROR train.py: Default process group is not initialized
I get this error when training on a single GPU, when calling the function distributed()
to disable tqdm
.
To avoid this I have simple wrapped distributed
like:
def distributed():
try:
return dist.is_available() and dist.is_initialized()
except:
return False
[UPDATE]
Even with the solution above in distributed
I'm getting the same error, since after evaluation it is called again in
def _all_reduce_dict(d, device):
# wrap in tensor and use reduce to gpu0 tensor
output_d = {}
for (key, value) in sorted(d.items()):
tensor_input = torch.tensor([[value]]).to(device)
torch.distributed.all_reduce(tensor_input)
output_d[key] = tensor_input.item()
return output_d
so the torch.distributed.all_reduce(tensor_input)
will fail, so I have changed it like
def _all_reduce_dict(d, device):
# wrap in tensor and use reduce to gpu0 tensor
output_d = {}
for (key, value) in sorted(d.items()):
tensor_input = torch.tensor([[value]]).to(device)
if distributed():
torch.distributed.all_reduce(tensor_input)
output_d[key] = tensor_input.item()
return output_d
Can you paste the output of the following bash script - to check your system information?
curl https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py | python -
I suspect two possibilities:
- Your NCCL installation is incomplete: try this to (re)install it
- You're on Windows: we don't have plan for Windows support it at this point.
@jongwook thank you very much!
I have run with both python
and python3
ubuntu@deepblue:~$ curl https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py | python -
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12461 100 12461 0 0 27635 0 --:--:-- --:--:-- --:--:-- 27691
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: Could not collect
Python version: 2.7
Is CUDA available: N/A
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: GeForce GTX 1080
GPU 1: GeForce GTX 1080
Nvidia driver version: 418.87.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.0.5
Versions of relevant libraries:
[pip] numpy==1.15.4
[conda] blas 1.0 mkl
[conda] mkl 2019.3 199
[conda] mkl-service 1.1.2 py37he904b0f_5
[conda] mkl_fft 1.0.10 py37ha843d7b_0
[conda] mkl_random 1.0.2 py37hd81dba3_0
ubuntu@deepblue:~$ curl https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py | python3 -
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12461 100 12461 0 0 56586 0 --:--:-- --:--:-- --:--:-- 56384
Collecting environment information...
PyTorch version: 1.3.1
Is debug build: No
CUDA used to build PyTorch: 10.1.243
OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: Could not collect
Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: GeForce GTX 1080
GPU 1: GeForce GTX 1080
Nvidia driver version: 418.87.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.0.5
Versions of relevant libraries:
[pip] numpy==1.15.4
[conda] blas 1.0 mkl
[conda] mkl 2019.3 199
[conda] mkl-service 1.1.2 py37he904b0f_5
[conda] mkl_fft 1.0.10 py37ha843d7b_0
[conda] mkl_random 1.0.2 py37hd81dba3_0
I still suspect the NCCL install is the culprit; realized that that script doesn't check for NCCL version.. which can be checked torch.cuda.nccl.version().
.
Can you run the experiments fine with your proposed changes? I could incorporate them in this repo at some point, but this repo (as most other OpenAI repos are) is in archive status and update is not our priority.
@jongwook yes I can make it working with the changes I did so far. I will further investigate NCLL by the way. Thanks.