llama2.c icon indicating copy to clipboard operation
llama2.c copied to clipboard

baby llama2 The training reported an error, and it was still good just now and suddenly reported the error

Open musellama opened this issue 2 years ago • 3 comments

Overriding: compile = False Overriding: eval_iters = 1 Overriding: batch_size = 1 tokens per iteration will be: 1,024 breaks down as: 4 grad accum steps * 1 processes * 1 batch size * 256 max seq len Initializing a new model from scratch num decayed parameter tensors: 43, with 15,187,968 parameters num non-decayed parameter tensors: 13, with 3,744 parameters using fused AdamW: True Created a PretokDataset with rng seed 42 Created a PretokDataset with rng seed 42 ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [66,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [67,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [68,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [69,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [70,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [71,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [72,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [73,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [74,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [75,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [76,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [77,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [78,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [79,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [80,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [81,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [82,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [83,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [84,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [85,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [86,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [87,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [88,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [89,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [90,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [91,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [92,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [93,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [94,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [96,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [97,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [98,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [99,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [100,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [101,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [102,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [103,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [335,0,0], thread: [104,0,0] Assertion srcIndex < srcSelectDimSize failed. Traceback (most recent call last): File "/home/maguoheng/anaconda3/envs/llama2/lib/python3.9/runpy.py", line 188, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "/home/maguoheng/anaconda3/envs/llama2/lib/python3.9/runpy.py", line 111, in _get_module_details import(pkg_name) File "/home/maguoheng/llama2.c/train.py", line 252, in losses = estimate_loss() File "/home/maguoheng/anaconda3/envs/llama2/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/maguoheng/llama2.c/train.py", line 211, in estimate_loss logits, loss = model(X, Y) File "/home/maguoheng/anaconda3/envs/llama2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/maguoheng/llama2.c/model.py", line 229, in forward h = layer(h, freqs_cis) File "/home/maguoheng/anaconda3/envs/llama2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/maguoheng/llama2.c/model.py", line 180, in forward h = x + self.attention.forward(self.attention_norm(x), freqs_cis) File "/home/maguoheng/llama2.c/model.py", line 111, in forward xq, xk, xv = self.wq(x), self.wk(x), self.wv(x) File "/home/maguoheng/anaconda3/envs/llama2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/maguoheng/anaconda3/envs/llama2/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

Just now, it was good and suddenly reported a mistake python==3.9 cuda117 torch2.0.1

musellama avatar Jul 27 '23 18:07 musellama

most likey it is the cause of driver. I had the alike one but for cudnn. I installed the right version and it resolved.

nobody4t avatar Jul 28 '23 00:07 nobody4t

most likey it is the cause of driver. I had the alike one but for cudnn. I installed the right version and it resolved. My CUDA version is 12.0, can you tell me your CUDA version

musellama avatar Jul 28 '23 14:07 musellama

Hi, have you resolved the issue?

mynewstart avatar Oct 02 '23 23:10 mynewstart