marian Error: DefaultVocabulary file is expected to contain an entry for (...)

`[2019-11-25 21:28:58] [config] tied-embeddings-all: false [2019-11-25 21:28:58] [config] tied-embeddings-src: false [2019-11-25 21:28:58] [config] train-sets: [2019-11-25 21:28:58] [config] - /191030/1_ForTrain/train.tok.en [2019-11-25 21:28:58] [config] - /191030/1_ForTrain/train.tok.zh [2019-11-25 21:28:58] [config] transformer-aan-activation: swish [2019-11-25 21:28:58] [config] transformer-aan-depth: 2 [2019-11-25 21:28:58] [config] transformer-aan-nogate: false [2019-11-25 21:28:58] [config] transformer-decoder-autoreg: self-attention [2019-11-25 21:28:58] [config] transformer-dim-aan: 2048 [2019-11-25 21:28:58] [config] transformer-dim-ffn: 2048 [2019-11-25 21:28:58] [config] transformer-dropout: 0 [2019-11-25 21:28:58] [config] transformer-dropout-attention: 0 [2019-11-25 21:28:58] [config] transformer-dropout-ffn: 0 [2019-11-25 21:28:58] [config] transformer-ffn-activation: swish [2019-11-25 21:28:58] [config] transformer-ffn-depth: 2 [2019-11-25 21:28:58] [config] transformer-guided-alignment-layer: last [2019-11-25 21:28:58] [config] transformer-heads: 8 [2019-11-25 21:28:58] [config] transformer-no-projection: false [2019-11-25 21:28:58] [config] transformer-postprocess: dan [2019-11-25 21:28:58] [config] transformer-postprocess-emb: d [2019-11-25 21:28:58] [config] transformer-preprocess: "" [2019-11-25 21:28:58] [config] transformer-tied-layers: [2019-11-25 21:28:58] [config] [] [2019-11-25 21:28:58] [config] type: amun [2019-11-25 21:28:58] [config] ulr: false [2019-11-25 21:28:58] [config] ulr-dim-emb: 0 [2019-11-25 21:28:58] [config] ulr-dropout: 0 [2019-11-25 21:28:58] [config] ulr-keys-vectors: "" [2019-11-25 21:28:58] [config] ulr-query-vectors: "" [2019-11-25 21:28:58] [config] ulr-softmax-temperature: 1 [2019-11-25 21:28:58] [config] ulr-trainable-transformation: false [2019-11-25 21:28:58] [config] valid-freq: 10000u [2019-11-25 21:28:58] [config] valid-max-length: 1000 [2019-11-25 21:28:58] [config] valid-metrics: [2019-11-25 21:28:58] [config] - cross-entropy [2019-11-25 21:28:58] [config] valid-mini-batch: 32 [2019-11-25 21:28:58] [config] vocabs: [2019-11-25 21:28:58] [config] - /191030/2_ForTune/Tune.en [2019-11-25 21:28:58] [config] - /191030/2_ForTune/Tune.zh [2019-11-25 21:28:58] [config] word-penalty: 0 [2019-11-25 21:28:58] [config] workspace: 2048 [2019-11-25 21:28:58] [config] Model is being created with Marian v1.7.6 1d4ba73 2019-05-11 17:16:31 +0100 [2019-11-25 21:28:58] Using single-device training [2019-11-25 21:28:58] [data] Loading vocabulary from text file /191030/2_ForTune/Tune.en [2019-11-25 21:28:58] Error: DefaultVocabulary file /191030/2_ForTune/Tune.en is expected to contain an entry for [2019-11-25 21:28:58] Error: Aborted from marian::DefaultVocab::load(const string&, size_t)::<lambda(const string&, const string&, marian::Word)> in /marian/src/data/default_vocab.cpp:154

[CALL STACK] [0x592ded] [0x594c0c] [0x586ed5] [0x587c7b] [0x59cccc] [0x5a8771] [0x4d071d] [0x4f9392] [0x42e213] [0x40c0da] [0x7f8724bb9830] __libc_start_main + 0xf0 [0x42b7f9] `

I got this mistake while training the engine. Please I ask how to solve this problem

Nov 26 '19 02:11 sdlmw

Hi, did you create the vocabulary somehow by hand? It seems to be missing some symbols.

Nov 26 '19 03:11 emjotde

Hi emjotde, I just simply extracted part from the train set as the vocabulary，

Nov 26 '19 07:11 sdlmw

You would need to use the marian_vocab binary for this. Marian requires a number of special symbols, if those are missing you will get errors like this.

Nov 26 '19 07:11 emjotde

ok, I will retry usage: ./marian/build/marian-vocab [OPTIONS]

Nov 26 '19 07:11 sdlmw

Closing, the question has been answered.

Dec 09 '19 10:12 snukky

Actually no need to use marian_vocab. You just need these three entries, and by convention at the start of the vocab:

<unk>
<s>
</s>

I create my vocabs with something like this:

echo -e '<unk>\n<s>\n</s>' > VOCAB
cat CORPUS \
  | tr ' \r' '\n' \
  | sort -u | grep . \
  >> VOCAB

I believe marian_vocab creates a JSON file. JSON is not a suitable format for representing vocabularies, because JSON implies UTF-8 encoding. That is problematic because

I found corpora do contain invalid UTF-8, causing mal-formed JSON files;
Marian is encoding-agnostic.

Dec 09 '19 15:12 frankseide

Actually no need to use marian_vocab. You just need these three entries, and by convention at the start of the vocab:
<unk>
<s>
</s>

@frankseide Thank you so much! You saved my day!

Jan 08 '24 21:01 sappho192

marian marian copied to clipboard

Error: DefaultVocabulary file is expected to contain an entry for (...)

marian
marian copied to clipboard