trankit
trankit copied to clipboard
Limit input string to 512 characters to avoid CUDA crash
Problem
# If
assert len(sentence) > 512
# then
annotated = model_trankit(sentence, is_sent=True)
# result in CUDA error, e.g.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [19635,0,0], thread: [112,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Cause XLM-Roberta can only process 512 characters.
Possible fix https://github.com/nlp-uoregon/trankit/blob/1c19b9b7df3be1de91c2dd6879e0e325af5e2898/trankit/pipeline.py#L1066
Change
...
ori_text = deepcopy(input)
tagged_sent = self._posdep_sent(input)
...
to
...
ori_text = deepcopy(input)
ori_text = ori_text[:512] # <<< TRIM STRING TO MAX 512
tagged_sent = self._posdep_sent(input)
...
A quick fix for other trankit users would be
annotated = model_trankit(sentence[:512], is_sent=True)