sow-reap-paraphrasing
sow-reap-paraphrasing copied to clipboard
sow model training error
when I trained on a dataset made by sample_test_sow_reap.txt
, it gives me the fellow error
here is training dataset
input: BOS Y are therefore normally incurred by car makers on the sole basis X the market incentive . EOS
gt output: BOS for automobile manufacturers based on the initiative X the market for Y EOS
BOS X EOS
dev nll per token: 8.729735
done with batch 0 / 4 in epoch 4, loss: 8.444669, time:46
train nll per token : 8.444669
input: BOS a Y of paper had been taped to X . EOS
gt output: BOS X was a Y of paper . EOS
BOS EOS EOS
input: BOS a higher rate Y has been observed in X compared with infants . EOS
gt output: BOS in X , a higher incidence Y was observed than in infants . EOS
BOS X EOS
input: BOS Y has been observed in X compared with infants . EOS
gt output: BOS in X , Y was observed than in infants . EOS
BOS EOS EOS
dev nll per token: 8.472298
done with batch 0 / 4 in epoch 5, loss: 8.056769, time:46
train nll per token : 8.056769
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [339,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [339,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T,
···
···
···
Traceback (most recent call last):
File "sow/train.py", line 301, in <module>
main(args)
File "sow/train.py", line 199, in main
preds = model(curr_inp, curr_out, curr_inp_pos, curr_in_order)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/content/sow-reap-paraphrasing/sow/models/seq2seq_base.py", line 50, in forward
device_ids=device_ids.get('encoder', None))
File "/content/sow-reap-paraphrasing/sow/models/seq2seq_base.py", line 31, in encode
return self.encoder(inputs, input_postags, input_pos_order, hidden)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/content/sow-reap-paraphrasing/sow/models/transformer.py", line 79, in forward
x = block(x)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/content/sow-reap-paraphrasing/sow/models/modules/transformer_blocks.py", line 114, in forward
x, _ = self.attention(x, x, x)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/content/sow-reap-paraphrasing/sow/models/modules/attention.py", line 284, in forward
MultiHeadAttention, self).forward(query, key, value, key_padding_mask=key_padding_mask, attn_mask=attn_mask)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/activation.py", line 783, in forward
attn_mask=attn_mask)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 3097, in multi_head_attention_forward
qkv_same = torch.equal(query, key) and torch.equal(key, value)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/THCReduceAll.cuh:327
Dear author, I have found some links to furthermore confirm the issue. Finally, I find a way to alleviate the issuecuda runtime error (59)
through below code add in sow/train.py
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
and then bash shell give relative clear issue as below
File "/content/sow-reap-paraphrasing/sow/models/transformer.py", line 71, in forward
y = self.pos_embedder(input_postags).mul_(self.scale_embedding)
I think this part is correlated to your paper, which is Target order r
? But I am not sure, cause here is a multiply operation.

Any advice is welcome?
reference: https://discuss.pytorch.org/t/device-side-assert-triggered-at-error/82488/5
ok, as I think here is the problem
model_config['postag_size'] = len(pos)
above should change to
model_config['postag_size'] = len(pos)+1
reference: https://blog.csdn.net/Geek_of_CSDN/article/details/86527107
Here still are some things not make sense
pos class is 71, so I can solve the problem by add 1 to model_config['postag_size']
but the dev set made by your script appears 71 is strange. Cause the size of POS is 71, so the index is not possible be 71.
the other aspect is that all values in your dev set provide by your google drive are all below 71. But still has this error, and also can be fixed by add 1.
The reason of POS appears 71 is here
for p in pos1 + pos2:
if p not in pos_vocab.keys():
pos_vocab[p] = len(pos_vocab)
rev_pos_vocab[pos_vocab[p]] = p
add new POS to pos_vocab, but save as new pkl file which lead to train.py
read the previous pos_vocab, furthermore make Embedding size to len(pos_vocab)==70, not 71.
model_config['postag_size'] = len(pos)
Hi, This is an indexing error. Are you using your own data or is this running on the data in the google drive? Is this on the vocabulary that you have created or the one provided in the google drive?
-
case1 vocabulary and dev dataset from your shared google drive, but the training dataset is created by your provide sample through your script. exist this error
-
case2 vocabulary came from your shared google drive, but the training dataset and dev dataset is created by your provide sample through your script. exist this error
if you not mind, can reproduce this error use my upload files in Github.
Hi, This is an indexing error. Are you using your own data or is this running on the data in the google drive? Is this on the vocabulary that you have created or the one provided in the google drive?
if the error is caused by index, how to explain the error occurred even run in your provided datasets which shared in google drive and index range is 0~70
not exceed 71?