DSLP The shape of probs_seq does not match with the shape of the vocabulary Segmentation fault (core dumped)

[/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary [/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary [/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary Segmentation fault (core dumped)

I have encountered such a problem, I have not modified the original code, may I ask what is the problem

Feb 03 '22 14:02 thunder123321

I'm running “CTC with DSLP” code

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de --save-dir checkpoints --eval-tokenized-bleu
--keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric
--eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100
--eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5 --fixed-validation-seed 7 --ddp-backend=no_c10d
--share-all-embeddings --decoder-learned-pos --encoder-learned-pos --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01
--fp16 --clip-norm 2.0 --max-update 300000 --task translation_lev --criterion nat_loss --arch nat_ctc_sd --noise full_mask \ --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1 --concat-yhat --concat-dropout 0.0 --label-smoothing 0.0 \ --activation-fn gelu --dropout 0.1 --max-tokens 2048 --update-freq 4

Feb 03 '22 14:02 thunder123321

FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary

I encountered this problem when running GLAT+CTC+SD and CTC+SD codes. What does that mean, and I haven't changed the DSLP code. I hope the author can clarify my doubts.

Feb 16 '22 06:02 thunder123321

Hello, @thunder123321

Unfortunately, there is not enough information for me to tell what went wrong in your setup. My best guess is that the error is related to your ctcdecode installation.

BTW, I just tested a clean clone of the repo with your script, and it works on my side.

Feb 16 '22 07:02 chenyangh

Actually, the ctcdecode is only used as a post-processing in the final version, as I only used beam size 1. I think you can use the --plain-ctc option to avoid using ctcdecode.

However, you need to do some post-process here: https://github.com/chenyangh/DSLP/blob/a9d3ee154f3bc73b9dfc191ed537ee90b3896956/fairseq/models/nat/nat_ctc_sd_ss.py#L507 You may incorporate this function:

def _ctc_postprocess(tokens):
        hyp = tokens
        # if cfg.task.plain_ctc:
        _toks = hyp.int().tolist()
        _toks = [v for i, v in enumerate(_toks) if i == 0 or v != _toks[i - 1]]
        hyp = hyp.new_tensor([v for v in _toks if v not in extra_symbols_to_ignore])
        return hyp

extra_symbols_to_ignore = []
    if hasattr(tgt_dict, "blank_index"):
        extra_symbols_to_ignore.append(tgt_dict.blank_index)
    if hasattr(tgt_dict, "mask_index"):
        extra_symbols_to_ignore.append(tgt_dict.mask_index)

Feb 16 '22 07:02 chenyangh

Hi. @chenyangh Thank you very much for answering my question. I noticed that the two post-processing functions you mentioned appear in generation.py file. Does that mean I just add the -- plain-ctc parameter? When I added the -- plain-ctc parameter in my experiment, I found that the memory footprint was higher. “ctcdecode” is used to reduce memory footprint.

Feb 17 '22 13:02 thunder123321

Hi, @thunder123321 --plain-ctc was used to replace the ctcdecode module (as it is much slower even with beam 1). However, having --plain-ctc option will not perform postprocessing during the training. That's was why I suggest the above modifications if you can not get ctcdecode working.

In terms of memory consumption, I am not sure if that is caused by the plain-ctc option. But I did remember that at some point in my development, the model suddenly consumes more RAM per batch. Unfortunately, I haven't identified the reason.

Feb 17 '22 14:02 chenyangh