PURE icon indicating copy to clipboard operation
PURE copied to clipboard

RuntimeError: CUDA error: device-side assert triggered

Open heyoma opened this issue 3 years ago • 5 comments

Hi, I'd run into "RuntimeError: CUDA error: device-side assert triggered" error when I attempted to run your code on a Chinese dataset. The log is as follows:

/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [165,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [165,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "run_entity.py", line 225, in <module>
    output_dict = model.run_batch(train_batches[i], training=True)
  File "/tf_group/lihongyu/PURE-main/entity/models.py", line 302, in run_batch
    attention_mask = attention_mask_tensor.to(self._model_device),
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/tf_group/lihongyu/PURE-main/entity/models.py", line 65, in forward
    spans_embedding = self._get_span_embeddings(input_ids, spans, token_type_ids=token_type_ids, 
attention_mask=attention_mask)
  File "/tf_group/lihongyu/PURE-main/entity/models.py", line 41, in _get_span_embeddings
    sequence_output, pooled_output = self.bert(input_ids=input_ids, token_type_ids=token_type_ids,             
attention_mask=attention_mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 752, in forward
    input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 181, in forward
    embeddings = inputs_embeds + position_embeddings + token_type_embeddings
RuntimeError: CUDA error: device-side assert triggered

I have searched a few cases of this error on stack overflow, but still fail to make out what has happened. I drew the dimension of inputs_embeds, position_embeddings, token_type_embeddings, and it seemed to be nothing wrong(all of these is of [1, seq_len(>350), 768]) Thanks for your time.

heyoma avatar Jan 17 '22 10:01 heyoma

I don't think there's something to do with the data I used. But I would one piece here.

{"clusters": [[]], "sentences": [["攀", "谈", "中", "我", "了", "解", "到", "衣", "裙", "出", "她", "的", "手", ",", "一", "针", "一", "线", "、", "一", "花", "一", "朵", "都", "是", "田", "边", "地", "角", "劳", "动", "之", "余", "飞", "针", "走", "线", "绣", "成", "的", "。"]], "ner": [[[7, 8, "Thing"], [10, 10, "Person"], [14, 22, "Thing"], [25, 28, "Location"]]], "relations": [[[10, 10, 7, 8, "Create"], [14, 22, 7, 8, "Part-Whole"]]], "doc_key": "dev.json_9"}

heyoma avatar Jan 17 '22 10:01 heyoma

Maybe the reason could be found in https://discuss.pytorch.org/t/solved-assertion-srcindex-srcselectdimsize-failed-on-gpu-for-torch-cat/1804/15. But I still have no idea~

heyoma avatar Jan 17 '22 10:01 heyoma

Hi! Have you tried to run our pre-trained models? I have never run into this issue before. I am wondering whether this is due to version mismatching of some libraries.

a3616001 avatar Jan 17 '22 23:01 a3616001

Hi, thank you for your reply. I have just made out what had happened. It was because some of my instances are too long. I discarded the sentences over 512 (I don't know the exact number), and it worked.

heyoma avatar Jan 19 '22 07:01 heyoma

Hi, would you please modify the code of file run_relation_approx.py and make it more friendly to max_seq_length? Unlike those lines in run_relation.py, I found nothing was done for sequences with token number > max_seq_length.

In run_relation_approx.py

line 154: 
    assert(num_tokens + 4 <= max_seq_length)

In run_relation.py

line 114~119:
    if len(tokens) > max_seq_length:
        tokens = tokens[:max_seq_length]
        if sub_idx >= max_seq_length:
            sub_idx = 0
        if obj_idx >= max_seq_length:
            obj_idx = 0

heyoma avatar Jan 20 '22 08:01 heyoma