Transformers-Tutorials
Transformers-Tutorials copied to clipboard
Error Training
Hi @NielsRogge Thank for great sharing traing TrOcr, I step by step as you guide But when training I get error:
ex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [39,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [39,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [39,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [39,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
0%| | 0/67285 [00:05<?, ?it/s]
File "/home/tupk/anaconda3/envs/ocr/lib/python3.7/site-packages/transformers/models/trocr/modeling_trocr.py", line 144, in forward
self.weights = self.weights.to(self._float_tensor)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
I printed batch input, it's fine
[[0.7647, 0.7647, 0.7647, ..., 0.4510, 0.4510, 0.4510],
[0.7647, 0.7647, 0.7647, ..., 0.4510, 0.4510, 0.4510],
[0.7647, 0.7647, 0.7647, ..., 0.4510, 0.4510, 0.4510],
...,
[0.6235, 0.6235, 0.6235, ..., 0.5608, 0.5608, 0.5608],
[0.6235, 0.6235, 0.6235, ..., 0.5608, 0.5608, 0.5608],
[0.6235, 0.6235, 0.6235, ..., 0.5608, 0.5608, 0.5608]],
[[0.5451, 0.5451, 0.5451, ..., 0.2000, 0.2000, 0.2000],
[0.5451, 0.5451, 0.5451, ..., 0.2000, 0.2000, 0.2000],
[0.5451, 0.5451, 0.5451, ..., 0.2000, 0.2000, 0.2000],
...,
[0.3569, 0.3569, 0.3569, ..., 0.2941, 0.2941, 0.2941],
[0.3569, 0.3569, 0.3569, ..., 0.2941, 0.2941, 0.2941],
[0.3569, 0.3569, 0.3569, ..., 0.2941, 0.2941, 0.2941]]]],
device='cuda:0'), 'labels': tensor([[ 0, 53593, 5142, ..., -100, -100, -100],
[ 0, 51870, 1117, ..., -100, -100, -100],
[ 0, 1939, 38817, ..., -100, -100, -100],
...,
[ 0, 7221, 49581, ..., -100, -100, -100],
[ 0, 22980, 2870, ..., -100, -100, -100],
[ 0, 12894, 15165, ..., -100, -100, -100]], device='cuda:0')}