sow-reap-paraphrasing icon indicating copy to clipboard operation
sow-reap-paraphrasing copied to clipboard

sow model training error

Open TITC opened this issue 3 years ago • 7 comments

when I trained on a dataset made by sample_test_sow_reap.txt, it gives me the fellow error

here is training dataset

input: BOS Y are therefore normally incurred by car makers on the sole basis X the market incentive . EOS
gt output: BOS for automobile manufacturers based on the initiative X the market for Y EOS
BOS X EOS


dev nll per token: 8.729735
done with batch 0 / 4 in epoch 4, loss: 8.444669, time:46
train nll per token : 8.444669 

input: BOS a Y of paper had been taped to X . EOS
gt output: BOS X was a Y of paper . EOS
BOS EOS EOS


input: BOS a higher rate Y has been observed in X compared with infants . EOS
gt output: BOS in X , a higher incidence Y was observed than in infants . EOS
BOS X EOS


input: BOS Y has been observed in X compared with infants . EOS
gt output: BOS in X , Y was observed than in infants . EOS
BOS EOS EOS


dev nll per token: 8.472298
done with batch 0 / 4 in epoch 5, loss: 8.056769, time:46
train nll per token : 8.056769 

/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [339,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [339,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, 
···
···
···
Traceback (most recent call last):
  File "sow/train.py", line 301, in <module>
    main(args)
  File "sow/train.py", line 199, in main
    preds = model(curr_inp, curr_out, curr_inp_pos, curr_in_order)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/sow-reap-paraphrasing/sow/models/seq2seq_base.py", line 50, in forward
    device_ids=device_ids.get('encoder', None))
  File "/content/sow-reap-paraphrasing/sow/models/seq2seq_base.py", line 31, in encode
    return self.encoder(inputs, input_postags, input_pos_order, hidden)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/sow-reap-paraphrasing/sow/models/transformer.py", line 79, in forward
    x = block(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/sow-reap-paraphrasing/sow/models/modules/transformer_blocks.py", line 114, in forward
    x, _ = self.attention(x, x, x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/sow-reap-paraphrasing/sow/models/modules/attention.py", line 284, in forward
    MultiHeadAttention, self).forward(query, key, value, key_padding_mask=key_padding_mask, attn_mask=attn_mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/activation.py", line 783, in forward
    attn_mask=attn_mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 3097, in multi_head_attention_forward
    qkv_same = torch.equal(query, key) and torch.equal(key, value)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCReduceAll.cuh:327

TITC avatar May 18 '21 09:05 TITC

Dear author, I have found some links to furthermore confirm the issue. Finally, I find a way to alleviate the issuecuda runtime error (59) through below code add in sow/train.py

os.environ['CUDA_LAUNCH_BLOCKING'] = "1"  

and then bash shell give relative clear issue as below

  File "/content/sow-reap-paraphrasing/sow/models/transformer.py", line 71, in forward
    y = self.pos_embedder(input_postags).mul_(self.scale_embedding)

I think this part is correlated to your paper, which is Target order r? But I am not sure, cause here is a multiply operation.

Any advice is welcome?

reference: https://discuss.pytorch.org/t/device-side-assert-triggered-at-error/82488/5

TITC avatar May 19 '21 12:05 TITC

ok, as I think here is the problem

    model_config['postag_size'] = len(pos)

above should change to

    model_config['postag_size'] = len(pos)+1

reference: https://blog.csdn.net/Geek_of_CSDN/article/details/86527107

TITC avatar May 19 '21 16:05 TITC

Here still are some things not make sense image

pos class is 71, so I can solve the problem by add 1 to model_config['postag_size']

but the dev set made by your script appears 71 is strange. Cause the size of POS is 71, so the index is not possible be 71.

the other aspect is that all values in your dev set provide by your google drive are all below 71. But still has this error, and also can be fixed by add 1.

TITC avatar May 19 '21 16:05 TITC

The reason of POS appears 71 is here

            for p in pos1 + pos2:
                if p not in pos_vocab.keys():
                    pos_vocab[p] = len(pos_vocab)
                    rev_pos_vocab[pos_vocab[p]] = p

image

add new POS to pos_vocab, but save as new pkl file which lead to train.py read the previous pos_vocab, furthermore make Embedding size to len(pos_vocab)==70, not 71.

    model_config['postag_size'] = len(pos)

TITC avatar May 20 '21 01:05 TITC

Hi, This is an indexing error. Are you using your own data or is this running on the data in the google drive? Is this on the vocabulary that you have created or the one provided in the google drive?

tagoyal avatar May 21 '21 01:05 tagoyal

  • case1 vocabulary and dev dataset from your shared google drive, but the training dataset is created by your provide sample through your script. exist this error

  • case2 vocabulary came from your shared google drive, but the training dataset and dev dataset is created by your provide sample through your script. exist this error


if you not mind, can reproduce this error use my upload files in Github.

TITC avatar May 21 '21 01:05 TITC

Hi, This is an indexing error. Are you using your own data or is this running on the data in the google drive? Is this on the vocabulary that you have created or the one provided in the google drive?

if the error is caused by index, how to explain the error occurred even run in your provided datasets which shared in google drive and index range is 0~70 not exceed 71?

TITC avatar May 21 '21 06:05 TITC