gpt-2
gpt-2 copied to clipboard
How to ACTUALLY train 345M on Multiple GPU using train-horovod.py?
What am I doing wrong? Is it impossible to train the 345M model on multiple GPUs? Or is my GPUs are not enough? If it's the case, what GPU size and how many GPUs would work?
Is this the right process?
- I am using an
ml.p3.8xlarge
instance on AWS with 4x v100 GPUs (16 GB each). I am trying to run train-horovod.py to train the 345M model on these 4 GPUs. I am running this command -
mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH=src -mca pml ob1 -mca btl ^openib python /home/ec2-user/SageMaker/gpt-2/train-horovod.py --dataset /home/ec2-user/SageMaker/gpt-2/src/Dataset/data_encoded.npz --model_name /home/ec2-user/SageMaker/gpt-2/src/models/345M --batch_size 1
I am using batch_size == 1
. Still, I am getting this error repeatedly.
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[1,1023,50257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node strided_slice_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Mean/_5215]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[1,1023,50257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node strided_slice_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
...
...
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[1,1023,50257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node strided_slice_1 (defined at /home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
........
........
- I also tried installing CUDNN as mentioned here in issue #8 using this command -
conda install -c anaconda cudnn --yes
- After running the
nvidia-smi
command, this is the state of the GPU just before facing the error.
Sat Jun 20 09:48:49 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 |
| N/A 47C P0 65W / 300W | 15602MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 49C P0 66W / 300W | 15626MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 |
| N/A 51C P0 72W / 300W | 15626MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 49C P0 72W / 300W | 15626MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 115021 C python 15583MiB |
| 1 115022 C python 15583MiB |
| 2 115023 C python 15583MiB |
| 3 115024 C python 15583MiB |
+-----------------------------------------------------------------------------+
If anyone has solved this problem or faced it and shares his/her experience, it would help a great deal. Thanks.