When training the 0.1T model with ZeRO-infinity(NVMe) at more than 32 batch size, a CUDA illegal memory error occurs.
RuntimeError: CUDA error: an illegal memory access was encountered
Unexpectedly, it works well without any error message up to the 24 batch size, but from the 32 batch, an illegal memory error related to NCCL occurs. It would make more sense if it was O-O-M, but it is an error of unknown cause and works well in small batches, so I don't know what to do.
System info: 4 A100 GPUs(80GB), 1TB CPU DRAM, 11TB SSD
Do you have any guesses or a good link for me to refer to?
I got an error below, it occures when beggining of training(FWD):
RuntimeError: CUDA error: an illegal memory access was encountered
Traceback (most recent call last):
File "../pretrain_gpt2.py", line 133, in
pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/training.py", line 109, in pretrain
iteration = train(forward_step_func,
File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/training.py", line 481, in train
loss_dict, skipped_iter = train_step(forward_step_func,
File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/training.py", line 332, in train_step
loss, loss_reduced = forward_step_func(data_iterator, model)
File "../pretrain_gpt2.py", line 100, in forward_step
losses = model(tokens, position_ids, attention_mask, labels=labels)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1105, in forward
loss = self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/model/gpt2_model.py", line 76, in forward
lm_output = self.language_model(input_ids,
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/model/language_model.py", line 330, in forward
transformer_output = self.transformer(embedding_output,
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/model/transformer.py", line 987, in forward
hidden_states = self.checkpointed_forward(hidden_states,
File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/model/transformer.py", line 963, in checkpointed_forward
hidden_states = mpu.checkpoint(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 682, in checkpoint
CheckpointFunction.apply(function, all_outputs, *args)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 486, in forward
outputs = run_function(*inputs_cuda)
File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/model/transformer.py", line 955, in custom_forward
x = layer(x, inputs[1])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/model/transformer.py", line 761, in forward
self.attention(layernorm_output,
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/model/transformer.py", line 413, in forward
output, bias = self.dense(context_layer)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 736, in _call_impl
result = hook(self, input)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1445, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1552, in pre_sub_module_forward_function
self.param_coordinator.fetch_sub_module(sub_module)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 451, in fetch_sub_module
self._all_gather(partitioned_params, async_op=True)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 528, in _all_gather
handles = partitioned_params[0].all_gather(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 566, in all_gather
return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 660, in _all_gather
handle = self._allgather_param(param,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 845, in _allgather_param
torch.cuda.synchronize()
File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 380, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: ../torch/lib/c10d/../c10d/NCCLUtils.hpp:155, unhandled cuda error, NCCL version 2.8.3
ncclUnhandledCudaError: Call to CUDA function failed.
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: ../torch/lib/c10d/../c10d/NCCLUtils.hpp:155, unhandled cuda error, NCCL version 2.8.3
ncclUnhandledCudaError: Call to CUDA function failed.
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: ../torch/lib/c10d/../c10d/NCCLUtils.hpp:155, unhandled cuda error, NCCL version 2.8.3
ncclUnhandledCudaError: Call to CUDA function failed.
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: ../torch/lib/c10d/../c10d/NCCLUtils.hpp:155, unhandled cuda error, NCCL version 2.8.3
ncclUnhandledCudaError: Call to CUDA function failed.
Killing subprocess 25464
Killing subprocess 25465
Killing subprocess 25466
Killing subprocess 25467
I meet a similar problem. With ZeRO-3, setting micro_batch_size > 12 will trigger CUDA error: an illegal memory access was encountered. However, the GPU memory usage is only 60% (80GB A100) with MBS=12.
@kiehls90, from your stack trace it seems the failure is occuring during allgather and hence NCCL-related. My guess is that the memory allocation required by allgather is failing because of the high memory pressure of large batch sizes. My further guess is that the failure is occuring in C++ land which is why you don't see the nice familiar OOM error message.
One way to test my hypothesis is to indirectly reduce the memory footprint of allgather by disabling parameter prefetching and caching. You can do this with setting the following zero stage 3 configuration knobs as follows in ds_config:
"stage3_max_reuse_distance": 0,
"stage3_prefetch_bucket_size": 0
I meet a similar problem. With ZeRO-3, setting micro_batch_size > 12 will trigger CUDA error: an illegal memory access was encountered. However, the GPU memory usage is only 60% (80GB A100) with MBS=12.
@drcege We ran into the same problem with DeepSpeed stage 1 on 80GB A100s where batch size > 16 causes
site-packages/torch/cuda/random.py", line 31, in get_rng_state
return default_generator.get_state()
RuntimeError: CUDA error: an illegal memory access was encountered
Have you had any progress on this?
Also as per @tjruwase stated
My further guess is that the failure is occuring in C++ land which is why you don't see the nice familiar OOM error message.
We are not using stage 3, any suggestions? Thx ~
@drcege, how did you measure the 60% GPU memory usage for MBS=12? You could estimate the expected memory usage by comparing a smaller MBS.
@barius, @drcege, since these memory issues occur with larger batch sizes, I believe they are due to the increased activation memory footprint. Unfortunately, ZeRO does not help with that memory problem. Rather, you would need to apply gradient checkpointing if possible.
@barius The feasible solution is just to reduce batch size.
As others stated, this memory issue maybe indeed a sign of out of memory (OOM), but was triggered in the C++ land.
Closing this as out of scope for zero-infinity.