The shape of probs_seq does not match with the shape of the vocabulary Segmentation fault (core dumped)
[/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary [/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary [/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary Segmentation fault (core dumped)
I have encountered such a problem, I have not modified the original code, may I ask what is the problem
I'm running “CTC with DSLP” code
python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de --save-dir checkpoints --eval-tokenized-bleu
--keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric
--eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100
--eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5 --fixed-validation-seed 7 --ddp-backend=no_c10d
--share-all-embeddings --decoder-learned-pos --encoder-learned-pos --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \
--lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01
--fp16 --clip-norm 2.0 --max-update 300000 --task translation_lev --criterion nat_loss --arch nat_ctc_sd --noise full_mask \
--src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1 --concat-yhat --concat-dropout 0.0 --label-smoothing 0.0 \
--activation-fn gelu --dropout 0.1 --max-tokens 2048 --update-freq 4
FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary
I encountered this problem when running GLAT+CTC+SD and CTC+SD codes. What does that mean, and I haven't changed the DSLP code. I hope the author can clarify my doubts.
Hello, @thunder123321
Unfortunately, there is not enough information for me to tell what went wrong in your setup. My best guess is that the error is related to your ctcdecode installation.
BTW, I just tested a clean clone of the repo with your script, and it works on my side.
Actually, the ctcdecode is only used as a post-processing in the final version, as I only used beam size 1.
I think you can use the --plain-ctc option to avoid using ctcdecode.
However, you need to do some post-process here: https://github.com/chenyangh/DSLP/blob/a9d3ee154f3bc73b9dfc191ed537ee90b3896956/fairseq/models/nat/nat_ctc_sd_ss.py#L507 You may incorporate this function:
def _ctc_postprocess(tokens):
hyp = tokens
# if cfg.task.plain_ctc:
_toks = hyp.int().tolist()
_toks = [v for i, v in enumerate(_toks) if i == 0 or v != _toks[i - 1]]
hyp = hyp.new_tensor([v for v in _toks if v not in extra_symbols_to_ignore])
return hyp
extra_symbols_to_ignore = []
if hasattr(tgt_dict, "blank_index"):
extra_symbols_to_ignore.append(tgt_dict.blank_index)
if hasattr(tgt_dict, "mask_index"):
extra_symbols_to_ignore.append(tgt_dict.mask_index)
Hi. @chenyangh Thank you very much for answering my question. I noticed that the two post-processing functions you mentioned appear in generation.py file. Does that mean I just add the -- plain-ctc parameter? When I added the -- plain-ctc parameter in my experiment, I found that the memory footprint was higher. “ctcdecode” is used to reduce memory footprint.
Hi, @thunder123321 --plain-ctc was used to replace the ctcdecode module (as it is much slower even with beam 1). However, having --plain-ctc option will not perform postprocessing during the training. That's was why I suggest the above modifications if you can not get ctcdecode working.
In terms of memory consumption, I am not sure if that is caused by the plain-ctc option. But I did remember that at some point in my development, the model suddenly consumes more RAM per batch. Unfortunately, I haven't identified the reason.