marian
marian copied to clipboard
Error: DefaultVocabulary file is expected to contain an entry for (...)
`[2019-11-25 21:28:58] [config] tied-embeddings-all: false [2019-11-25 21:28:58] [config] tied-embeddings-src: false [2019-11-25 21:28:58] [config] train-sets: [2019-11-25 21:28:58] [config] - /191030/1_ForTrain/train.tok.en [2019-11-25 21:28:58] [config] - /191030/1_ForTrain/train.tok.zh [2019-11-25 21:28:58] [config] transformer-aan-activation: swish [2019-11-25 21:28:58] [config] transformer-aan-depth: 2 [2019-11-25 21:28:58] [config] transformer-aan-nogate: false [2019-11-25 21:28:58] [config] transformer-decoder-autoreg: self-attention [2019-11-25 21:28:58] [config] transformer-dim-aan: 2048 [2019-11-25 21:28:58] [config] transformer-dim-ffn: 2048 [2019-11-25 21:28:58] [config] transformer-dropout: 0 [2019-11-25 21:28:58] [config] transformer-dropout-attention: 0 [2019-11-25 21:28:58] [config] transformer-dropout-ffn: 0 [2019-11-25 21:28:58] [config] transformer-ffn-activation: swish [2019-11-25 21:28:58] [config] transformer-ffn-depth: 2 [2019-11-25 21:28:58] [config] transformer-guided-alignment-layer: last [2019-11-25 21:28:58] [config] transformer-heads: 8 [2019-11-25 21:28:58] [config] transformer-no-projection: false [2019-11-25 21:28:58] [config] transformer-postprocess: dan [2019-11-25 21:28:58] [config] transformer-postprocess-emb: d [2019-11-25 21:28:58] [config] transformer-preprocess: "" [2019-11-25 21:28:58] [config] transformer-tied-layers: [2019-11-25 21:28:58] [config] [] [2019-11-25 21:28:58] [config] type: amun [2019-11-25 21:28:58] [config] ulr: false [2019-11-25 21:28:58] [config] ulr-dim-emb: 0 [2019-11-25 21:28:58] [config] ulr-dropout: 0 [2019-11-25 21:28:58] [config] ulr-keys-vectors: "" [2019-11-25 21:28:58] [config] ulr-query-vectors: "" [2019-11-25 21:28:58] [config] ulr-softmax-temperature: 1 [2019-11-25 21:28:58] [config] ulr-trainable-transformation: false [2019-11-25 21:28:58] [config] valid-freq: 10000u [2019-11-25 21:28:58] [config] valid-max-length: 1000 [2019-11-25 21:28:58] [config] valid-metrics: [2019-11-25 21:28:58] [config] - cross-entropy [2019-11-25 21:28:58] [config] valid-mini-batch: 32 [2019-11-25 21:28:58] [config] vocabs: [2019-11-25 21:28:58] [config] - /191030/2_ForTune/Tune.en [2019-11-25 21:28:58] [config] - /191030/2_ForTune/Tune.zh [2019-11-25 21:28:58] [config] word-penalty: 0 [2019-11-25 21:28:58] [config] workspace: 2048 [2019-11-25 21:28:58] [config] Model is being created with Marian v1.7.6 1d4ba73 2019-05-11 17:16:31 +0100 [2019-11-25 21:28:58] Using single-device training [2019-11-25 21:28:58] [data] Loading vocabulary from text file /191030/2_ForTune/Tune.en [2019-11-25 21:28:58] Error: DefaultVocabulary file /191030/2_ForTune/Tune.en is expected to contain an entry for [2019-11-25 21:28:58] Error: Aborted from marian::DefaultVocab::load(const string&, size_t)::<lambda(const string&, const string&, marian::Word)> in /marian/src/data/default_vocab.cpp:154
[CALL STACK] [0x592ded] [0x594c0c] [0x586ed5] [0x587c7b] [0x59cccc] [0x5a8771] [0x4d071d] [0x4f9392] [0x42e213] [0x40c0da] [0x7f8724bb9830] __libc_start_main + 0xf0 [0x42b7f9] `
I got this mistake while training the engine. Please I ask how to solve this problem
Hi, did you create the vocabulary somehow by hand? It seems to be missing some symbols.
Hi emjotde, I just simply extracted part from the train set as the vocabulary,
You would need to use the marian_vocab
binary for this. Marian requires a number of special symbols, if those are missing you will get errors like this.
ok, I will retry usage: ./marian/build/marian-vocab [OPTIONS]
Closing, the question has been answered.
Actually no need to use marian_vocab
. You just need these three entries, and by convention at the start of the vocab:
<unk>
<s>
</s>
I create my vocabs with something like this:
echo -e '<unk>\n<s>\n</s>' > VOCAB
cat CORPUS \
| tr ' \r' '\n' \
| sort -u | grep . \
>> VOCAB
I believe marian_vocab
creates a JSON file. JSON is not a suitable format for representing vocabularies, because JSON implies UTF-8 encoding. That is problematic because
- I found corpora do contain invalid UTF-8, causing mal-formed JSON files;
- Marian is encoding-agnostic.
Actually no need to use
marian_vocab
. You just need these three entries, and by convention at the start of the vocab:<unk> <s> </s>
@frankseide Thank you so much! You saved my day!