DeepSpeed [BUG] CUDA illegal memory access on large batch with ZeRO-infinity

When training the 0.1T model with ZeRO-infinity(NVMe) at more than 32 batch size, a CUDA illegal memory error occurs. RuntimeError: CUDA error: an illegal memory access was encountered

Unexpectedly, it works well without any error message up to the 24 batch size, but from the 32 batch, an illegal memory error related to NCCL occurs. It would make more sense if it was O-O-M, but it is an error of unknown cause and works well in small batches, so I don't know what to do. System info: 4 A100 GPUs(80GB), 1TB CPU DRAM, 11TB SSD

Do you have any guesses or a good link for me to refer to?

Mar 22 '22 13:03 kiehls90

I got an error below, it occures when beggining of training(FWD):

RuntimeError: CUDA error: an illegal memory access was encountered Traceback (most recent call last): File "../pretrain_gpt2.py", line 133, in pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/training.py", line 109, in pretrain iteration = train(forward_step_func, File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/training.py", line 481, in train loss_dict, skipped_iter = train_step(forward_step_func, File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/training.py", line 332, in train_step loss, loss_reduced = forward_step_func(data_iterator, model) File "../pretrain_gpt2.py", line 100, in forward_step losses = model(tokens, position_ids, attention_mask, labels=labels) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1105, in forward loss = self.module(*inputs, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl result = self.forward(*input, **kwargs) File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/model/gpt2_model.py", line 76, in forward lm_output = self.language_model(input_ids, File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl result = self.forward(*input, **kwargs) File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/model/language_model.py", line 330, in forward transformer_output = self.transformer(embedding_output, File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl result = self.forward(*input, **kwargs) File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/model/transformer.py", line 987, in forward hidden_states = self.checkpointed_forward(hidden_states, File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/model/transformer.py", line 963, in checkpointed_forward hidden_states = mpu.checkpoint( File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 682, in checkpoint CheckpointFunction.apply(function, all_outputs, *args) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 486, in forward outputs = run_function(*inputs_cuda) File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/model/transformer.py", line 955, in custom_forward x = layer(x, inputs[1]) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl result = self.forward(*input, **kwargs) File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/model/transformer.py", line 761, in forward self.attention(layernorm_output, File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl result = self.forward(*input, **kwargs) File "/home/skchoi/DeepSpeedExamples/ZeRO3-skchoi/megatron/model/transformer.py", line 413, in forward output, bias = self.dense(context_layer) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 736, in _call_impl result = hook(self, input) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1445, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1552, in pre_sub_module_forward_function self.param_coordinator.fetch_sub_module(sub_module) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 451, in fetch_sub_module self._all_gather(partitioned_params, async_op=True) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 528, in _all_gather handles = partitioned_params[0].all_gather( File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 566, in all_gather return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 660, in _all_gather handle = self._allgather_param(param, File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 845, in _allgather_param torch.cuda.synchronize() File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 380, in synchronize return torch._C._cuda_synchronize() RuntimeError: CUDA error: an illegal memory access was encountered terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error in: ../torch/lib/c10d/../c10d/NCCLUtils.hpp:155, unhandled cuda error, NCCL version 2.8.3 ncclUnhandledCudaError: Call to CUDA function failed. terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error in: ../torch/lib/c10d/../c10d/NCCLUtils.hpp:155, unhandled cuda error, NCCL version 2.8.3 ncclUnhandledCudaError: Call to CUDA function failed. terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error in: ../torch/lib/c10d/../c10d/NCCLUtils.hpp:155, unhandled cuda error, NCCL version 2.8.3 ncclUnhandledCudaError: Call to CUDA function failed. terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error in: ../torch/lib/c10d/../c10d/NCCLUtils.hpp:155, unhandled cuda error, NCCL version 2.8.3 ncclUnhandledCudaError: Call to CUDA function failed. Killing subprocess 25464 Killing subprocess 25465 Killing subprocess 25466 Killing subprocess 25467

Mar 23 '22 00:03 kiehls90

I meet a similar problem. With ZeRO-3, setting micro_batch_size > 12 will trigger CUDA error: an illegal memory access was encountered. However, the GPU memory usage is only 60% (80GB A100) with MBS=12.

Apr 12 '23 04:04 drcege

@kiehls90, from your stack trace it seems the failure is occuring during allgather and hence NCCL-related. My guess is that the memory allocation required by allgather is failing because of the high memory pressure of large batch sizes. My further guess is that the failure is occuring in C++ land which is why you don't see the nice familiar OOM error message.

One way to test my hypothesis is to indirectly reduce the memory footprint of allgather by disabling parameter prefetching and caching. You can do this with setting the following zero stage 3 configuration knobs as follows in ds_config:

     "stage3_max_reuse_distance": 0,
    "stage3_prefetch_bucket_size": 0

Apr 13 '23 12:04 tjruwase

I meet a similar problem. With ZeRO-3, setting micro_batch_size > 12 will trigger CUDA error: an illegal memory access was encountered. However, the GPU memory usage is only 60% (80GB A100) with MBS=12.

@drcege We ran into the same problem with DeepSpeed stage 1 on 80GB A100s where batch size > 16 causes

site-packages/torch/cuda/random.py", line 31, in get_rng_state
  return default_generator.get_state()
RuntimeError: CUDA error: an illegal memory access was encountered

Have you had any progress on this?

Also as per @tjruwase stated

My further guess is that the failure is occuring in C++ land which is why you don't see the nice familiar OOM error message.

We are not using stage 3, any suggestions? Thx ~

May 09 '23 09:05 barius

@drcege, how did you measure the 60% GPU memory usage for MBS=12? You could estimate the expected memory usage by comparing a smaller MBS.

May 09 '23 12:05 tjruwase

@barius, @drcege, since these memory issues occur with larger batch sizes, I believe they are due to the increased activation memory footprint. Unfortunately, ZeRO does not help with that memory problem. Rather, you would need to apply gradient checkpointing if possible.

May 09 '23 12:05 tjruwase

@barius The feasible solution is just to reduce batch size.

As others stated, this memory issue maybe indeed a sign of out of memory (OOM), but was triggered in the C++ land.

May 10 '23 08:05 drcege

Closing this as out of scope for zero-infinity.

May 10 '23 15:05 tjruwase

DeepSpeed DeepSpeed copied to clipboard

[BUG] CUDA illegal memory access on large batch with ZeRO-infinity

DeepSpeed
DeepSpeed copied to clipboard