minimind icon indicating copy to clipboard operation
minimind copied to clipboard

求救train_pretrain.py:RuntimeError: CUDA error: device kernel image is invalid

Open WingsLong opened this issue 6 days ago • 11 comments

Traceback (most recent call last): File "/data/aigc/model_train/minimind/train_pretrain.py", line 169, in model, tokenizer = init_model(lm_config) File "/data/aigc/model_train/minimind/train_pretrain.py", line 100, in init_model model = MiniMindLM(lm_config).to(args.device) File "/data/programs/python310/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2960, in to return super().to(*args, **kwargs) File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1175, in to return self._apply(convert) File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply module._apply(fn) File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 804, in _apply param_applied = fn(param) File "/data/programs/python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1161, in convert return t.to( RuntimeError: CUDA error: device kernel image is invalid Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

大佬,请求,运行训练的脚本python train_pretrain.py ,搞了一下午没看出啥问题! 系统:CentOS 7 CUDA:11.8 PyTorch:2.3.1 显卡:Tesla T4

WingsLong avatar Feb 20 '25 11:02 WingsLong