ColossalAI
ColossalAI copied to clipboard
[BUG]: "RuntimeError: CUDA error: device-side assert triggered" on changing SEQ_LEN in Gemini
π Describe the bug
I changed SEQ_LEN from 1024 to 1600 and model to gpt2_xl
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [38,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 353, in <module>
main()
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 343, in main
train_step()
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 303, in train_step
outputs = model(input_ids, attn_mask)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 279, in forward
outputs = self.module(*args, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/commons/model_zoo.py", line 29, in forward
return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0]
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1043, in forward
transformer_outputs = self.transformer(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 877, in forward
outputs = torch.utils.checkpoint.checkpoint(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 235, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 96, in forward
outputs = run_function(*args)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 873, in custom_forward
return module(*inputs, use_cache, output_attentions)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 387, in forward
hidden_states = self.ln_1(hidden_states)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return F.layer_norm(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/functional.py", line 2500, in layer_norm
return handle_torch_function(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/tensor/colo_parameter.py", line 85, in __torch_function__
new_args = ColoParamOpHookManager.pre_op(params, *args, *kwargs.values())
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/tensor/param_op_hook.py", line 84, in pre_op
ColoParamOpHookManager._trigger_pre_forward(params)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/tensor/param_op_hook.py", line 65, in _trigger_pre_forward
hook.pre_forward(params)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/zero/utils/gemini_hook.py", line 47, in pre_forward
self.pre_op(params)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/zero/utils/gemini_hook.py", line 32, in pre_op
self._gemini_manager.sample_overall_data()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/gemini/gemini_mgr.py", line 144, in sample_overall_data
self._mem_stats_collector.sample_overall_data()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/gemini/memory_tracer/memstats_collector.py", line 87, in sample_overall_data
cuda_overall = self._mem_monitor.finish()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/gemini/memory_tracer/memory_monitor.py", line 143, in finish
torch.cuda.synchronize()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/cuda/__init__.py", line 496, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [100,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 353, in <module>
main()
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 343, in main
train_step()
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 303, in train_step
outputs = model(input_ids, attn_mask)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 279, in forward
outputs = self.module(*args, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/commons/model_zoo.py", line 29, in forward
return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0]
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1043, in forward
transformer_outputs = self.transformer(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 877, in forward
outputs = torch.utils.checkpoint.checkpoint(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 235, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 96, in forward
outputs = run_function(*args)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 873, in custom_forward
return module(*inputs, use_cache, output_attentions)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 387, in forward
hidden_states = self.ln_1(hidden_states)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return F.layer_norm(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/functional.py", line 2500, in layer_norm
return handle_torch_function(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/tensor/colo_parameter.py", line 85, in __torch_function__
new_args = ColoParamOpHookManager.pre_op(params, *args, *kwargs.values())
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/tensor/param_op_hook.py", line 84, in pre_op
ColoParamOpHookManager._trigger_pre_forward(params)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/tensor/param_op_hook.py", line 65, in _trigger_pre_forward
hook.pre_forward(params)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/zero/utils/gemini_hook.py", line 47, in pre_forward
self.pre_op(params)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/zero/utils/gemini_hook.py", line 32, in pre_op
self._gemini_manager.sample_overall_data()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/gemini/gemini_mgr.py", line 144, in sample_overall_data
self._mem_stats_collector.sample_overall_data()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/gemini/memory_tracer/memstats_collector.py", line 87, in sample_overall_data
cuda_overall = self._mem_monitor.finish()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/gemini/memory_tracer/memory_monitor.py", line 143, in finish
torch.cuda.synchronize()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/cuda/__init__.py", line 496, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [56,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 353, in <module>
main()
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 343, in main
train_step()
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 303, in train_step
outputs = model(input_ids, attn_mask)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 279, in forward
outputs = self.module(*args, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/commons/model_zoo.py", line 29, in forward
return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0]
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1043, in forward
transformer_outputs = self.transformer(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 877, in forward
outputs = torch.utils.checkpoint.checkpoint(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 235, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 96, in forward
outputs = run_function(*args)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 873, in custom_forward
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cureturn module(*inputs, use_cache, output_attentions):975
: indexSelectLargeIndex: block: [82 File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [121 ,0,0return forward_call(*input, **kwargs)] Assertion `srcIndex < srcSelectDimSize
` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 387, in forward
:975: indexSelectLargeIndex: block: [82,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [82,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
hidden_states = self.ln_1(hidden_states)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return F.layer_norm(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/functional.py", line 2500, in layer_norm
return handle_torch_function(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/tensor/colo_parameter.py", line 85, in __torch_function__
new_args = ColoParamOpHookManager.pre_op(params, *args, *kwargs.values())
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/tensor/param_op_hook.py", line 84, in pre_op
ColoParamOpHookManager._trigger_pre_forward(params)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/tensor/param_op_hook.py", line 65, in _trigger_pre_forward
hook.pre_forward(params)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/zero/utils/gemini_hook.py", line 47, in pre_forward
self.pre_op(params)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/zero/utils/gemini_hook.py", line 32, in pre_op
self._gemini_manager.sample_overall_data()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/gemini/gemini_mgr.py", line 144, in sample_overall_data
self._mem_stats_collector.sample_overall_data()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/gemini/memory_tracer/memstats_collector.py", line 87, in sample_overall_data
cuda_overall = self._mem_monitor.finish()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/gemini/memory_tracer/memory_monitor.py", line 143, in finish
torch.cuda.synchronize()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/cuda/__init__.py", line 496, in synchronize
Traceback (most recent call last):
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 353, in <module>
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
main()
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 343, in main
train_step()
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 303, in train_step
outputs = model(input_ids, attn_mask)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 279, in forward
outputs = self.module(*args, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/commons/model_zoo.py", line 29, in forward
return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0]
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1043, in forward
transformer_outputs = self.transformer(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 877, in forward
outputs = torch.utils.checkpoint.checkpoint(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 235, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 96, in forward
outputs = run_function(*args)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 873, in custom_forward
return module(*inputs, use_cache, output_attentions)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 387, in forward
hidden_states = self.ln_1(hidden_states)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return F.layer_norm(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/functional.py", line 2500, in layer_norm
return handle_torch_function(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/tensor/colo_parameter.py", line 85, in __torch_function__
new_args = ColoParamOpHookManager.pre_op(params, *args, *kwargs.values())
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/tensor/param_op_hook.py", line 84, in pre_op
ColoParamOpHookManager._trigger_pre_forward(params)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/tensor/param_op_hook.py", line 65, in _trigger_pre_forward
hook.pre_forward(params)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/zero/utils/gemini_hook.py", line 47, in pre_forward
self.pre_op(params)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/zero/utils/gemini_hook.py", line 32, in pre_op
self._gemini_manager.sample_overall_data()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/gemini/gemini_mgr.py", line 144, in sample_overall_data
self._mem_stats_collector.sample_overall_data()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/gemini/memory_tracer/memstats_collector.py", line 87, in sample_overall_data
cuda_overall = self._mem_monitor.finish()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/colossalai/gemini/memory_tracer/memory_monitor.py", line 143, in finish
torch.cuda.synchronize()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/cuda/__init__.py", line 496, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/cuda/CUDAEvent.h:91 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa743002497 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x13c (0x7fa77cf03d8c in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fa77cf05d68 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x221 (0x7fa77cf072f1 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xcda93 (0x7fa7850e8a93 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7fa7bab8d609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fa7ba94c133 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/cuda/CUDAEvent.h:91 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5cea2b6497 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x13c (0x7f5d241b7d8c in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f5d241b9d68 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x221 (0x7f5d241bb2f1 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xcda93 (0x7f5d2c39ca93 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f5d61e41609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f5d61c00133 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/cuda/CUDAEvent.h:91 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6604f1f497 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x13c (0x7f663ee20d8c in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f663ee22d68 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x221 (0x7f663ee242f1 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xcda93 (0x7f6647005a93 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f667caaa609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f667c869133 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/cuda/CUDAEvent.h:91 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd441753497 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x13c (0x7fd47b654d8c in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fd47b656d68 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x221 (0x7fd47b6582f1 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xcda93 (0x7fd483839a93 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7fd4b92de609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fd4b909d133 in /lib/x86_64-linux-gnu/libc.so.6)
Environment
- export DISTPLAN=CAI_Gemini
- DISTPLAN=CAI_Gemini
- export GPUNUM=4
- GPUNUM=4
- export TPDEGREE=1
- TPDEGREE=1
- export PLACEMENT=auto
- PLACEMENT=auto
- export USE_SHARD_INIT=True
- USE_SHARD_INIT=True
- export BATCH_SIZE=32
- BATCH_SIZE=32
- export MODEL_TYPE=gpt2_xl
- MODEL_TYPE=gpt2_xl
- export TRAIN_STEP=10
- TRAIN_STEP=10
- '[' True = True ']'
- USE_SHARD_INIT=--shardinit
- mkdir -p gemini_logs
- torchrun --standalone --nproc_per_node=4 ./train_gpt_demo.py --tp_degree=1 --model_type=gpt2_xl --batch_size=32 --placement=auto --shardinit --distplan=CAI_Gemini --train_step=10
- tee ./gemini_logs/gpt2_xl_CAI_Gemini_gpu_4_bs_32_tp_1_auto.log
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
Title: [BUG]: RuntimeError: CUDA error: device-side assert triggered on changing SEQ_LEN in Gemini
Add another argument max_seq_len=1600
to line. And it will train normally.
Yes @JThh , It works. Thank you.
Glad to hear it was resolved. Thanks.
same problem when runnning with bloomοΌ Add another argument max_seq_len=1600 οΌ does not work. please help me, thx