marian-dev
marian-dev copied to clipboard
marian embed --compute-similarity errors out
Bug description
marian embed includes a --compute-similarity option. I assume if
$MARIAN/marian embed -t data.ja -v vocab.ja.spm -m model.npz
works, then doubling up testset and vocab (as hinted by the description of --compute-similarity):
$MARIAN/marian embed -t data.ja paraphrase.ja -v vocab.ja.spm vocab.ja.spm -m model.npz
should work too.
Instead I get
Error: Number of corpus files and vocab files does not agree
Am I doing something wrong?
Context
-
Marian version: v1.11.0 f00d062 2022-02-08 08:39:24 -0800
-
CMake command: cmake .. -DCMAKE_BUILD_TYPE=Release
-DUSE_SENTENCEPIECE=ON
-DCOMPILE_CPU=on
-DUSE_STATIC_LIBS=on
-DUSE_FBGEMM=on -
Full error log:
[2022-10-10 16:05:58] [marian] Marian v1.11.0 f00d062 2022-02-08 08:39:24 -0800
[2022-10-10 16:05:58] [marian] Running on host as process 737 with command line:
[2022-10-10 16:05:58] [marian] marian -t data.ja paraphrase.ja -v vocab.jp.spm vocab.jp.spm -m model.npz.best-translation.npz --compute-similarity
[2022-10-10 16:05:58] [config] authors: false
[2022-10-10 16:05:58] [config] bert-class-symbol: "[CLS]"
[2022-10-10 16:05:58] [config] bert-mask-symbol: "[MASK]"
[2022-10-10 16:05:58] [config] bert-masking-fraction: 0.15
[2022-10-10 16:05:58] [config] bert-sep-symbol: "[SEP]"
[2022-10-10 16:05:58] [config] bert-train-type-embeddings: true
[2022-10-10 16:05:58] [config] bert-type-vocab-size: 2
[2022-10-10 16:05:58] [config] best-deep: false
[2022-10-10 16:05:58] [config] binary: false
[2022-10-10 16:05:58] [config] build-info: ""
[2022-10-10 16:05:58] [config] check-nan: false
[2022-10-10 16:05:58] [config] cite: false
[2022-10-10 16:05:58] [config] compute-similarity: true
[2022-10-10 16:05:58] [config] cpu-threads: 0
[2022-10-10 16:05:58] [config] data-threads: 8
[2022-10-10 16:05:58] [config] dec-cell: gru
[2022-10-10 16:05:58] [config] dec-cell-base-depth: 2
[2022-10-10 16:05:58] [config] dec-cell-high-depth: 1
[2022-10-10 16:05:58] [config] dec-depth: 6
[2022-10-10 16:05:58] [config] devices:
[2022-10-10 16:05:58] [config] - 0
[2022-10-10 16:05:58] [config] dim-emb: 1024
[2022-10-10 16:05:58] [config] dim-rnn: 1024
[2022-10-10 16:05:58] [config] dim-vocabs:
[2022-10-10 16:05:58] [config] - 32000
[2022-10-10 16:05:58] [config] - 32000
[2022-10-10 16:05:58] [config] dump-config: ""
[2022-10-10 16:05:58] [config] enc-cell: gru
[2022-10-10 16:05:58] [config] enc-cell-depth: 1
[2022-10-10 16:05:58] [config] enc-depth: 6
[2022-10-10 16:05:58] [config] enc-type: bidirectional
[2022-10-10 16:05:58] [config] factors-combine: sum
[2022-10-10 16:05:58] [config] factors-dim-emb: 0
[2022-10-10 16:05:58] [config] ignore-model-config: false
[2022-10-10 16:05:58] [config] input-types:
[2022-10-10 16:05:58] [config] []
[2022-10-10 16:05:58] [config] interpolate-env-vars: false
[2022-10-10 16:05:58] [config] layer-normalization: false
[2022-10-10 16:05:58] [config] lemma-dependency: ""
[2022-10-10 16:05:58] [config] lemma-dim-emb: 0
[2022-10-10 16:05:58] [config] log: ""
[2022-10-10 16:05:58] [config] log-level: info
[2022-10-10 16:05:58] [config] log-time-zone: ""
[2022-10-10 16:05:58] [config] max-length: 1000
[2022-10-10 16:05:58] [config] max-length-crop: false
[2022-10-10 16:05:58] [config] maxi-batch: 100
[2022-10-10 16:05:58] [config] maxi-batch-sort: trg
[2022-10-10 16:05:58] [config] mini-batch: 64
[2022-10-10 16:05:58] [config] mini-batch-words: 0
[2022-10-10 16:05:58] [config] model: model.npz.best-translation.npz
[2022-10-10 16:05:58] [config] no-reload: false
[2022-10-10 16:05:58] [config] num-devices: 0
[2022-10-10 16:05:58] [config] output: stdout
[2022-10-10 16:05:58] [config] output-omit-bias: false
[2022-10-10 16:05:58] [config] precision:
[2022-10-10 16:05:58] [config] - float32
[2022-10-10 16:05:58] [config] quiet: false
[2022-10-10 16:05:58] [config] quiet-translation: false
[2022-10-10 16:05:58] [config] relative-paths: false
[2022-10-10 16:05:58] [config] right-left: false
[2022-10-10 16:05:58] [config] seed: 0
[2022-10-10 16:05:58] [config] skip: false
[2022-10-10 16:05:58] [config] tied-embeddings: true
[2022-10-10 16:05:58] [config] tied-embeddings-all: true
[2022-10-10 16:05:58] [config] tied-embeddings-src: false
[2022-10-10 16:05:58] [config] train-sets:
[2022-10-10 16:05:58] [config] - data.ja
[2022-10-10 16:05:58] [config] - paraphrase.ja
[2022-10-10 16:05:58] [config] transformer-aan-activation: swish
[2022-10-10 16:05:58] [config] transformer-aan-depth: 2
[2022-10-10 16:05:58] [config] transformer-aan-nogate: false
[2022-10-10 16:05:58] [config] transformer-decoder-autoreg: self-attention
[2022-10-10 16:05:58] [config] transformer-decoder-dim-ffn: 0
[2022-10-10 16:05:58] [config] transformer-decoder-ffn-depth: 0
[2022-10-10 16:05:58] [config] transformer-depth-scaling: false
[2022-10-10 16:05:58] [config] transformer-dim-aan: 2048
[2022-10-10 16:05:58] [config] transformer-dim-ffn: 4096
[2022-10-10 16:05:58] [config] transformer-ffn-activation: relu
[2022-10-10 16:05:58] [config] transformer-ffn-depth: 2
[2022-10-10 16:05:58] [config] transformer-guided-alignment-layer: last
[2022-10-10 16:05:58] [config] transformer-heads: 16
[2022-10-10 16:05:58] [config] transformer-no-projection: false
[2022-10-10 16:05:58] [config] transformer-pool: false
[2022-10-10 16:05:58] [config] transformer-postprocess: dan
[2022-10-10 16:05:58] [config] transformer-postprocess-emb: d
[2022-10-10 16:05:58] [config] transformer-postprocess-top: ""
[2022-10-10 16:05:58] [config] transformer-preprocess: ""
[2022-10-10 16:05:58] [config] transformer-tied-layers:
[2022-10-10 16:05:58] [config] []
[2022-10-10 16:05:58] [config] transformer-train-position-embeddings: false
[2022-10-10 16:05:58] [config] tsv: false
[2022-10-10 16:05:58] [config] tsv-fields: 0
[2022-10-10 16:05:58] [config] type: transformer
[2022-10-10 16:05:58] [config] ulr: false
[2022-10-10 16:05:58] [config] ulr-dim-emb: 0
[2022-10-10 16:05:58] [config] ulr-trainable-transformation: false
[2022-10-10 16:05:58] [config] version: v1.10.0 6f6d484 2021-02-06 15:35:16 -0800
[2022-10-10 16:05:58] [config] vocabs:
[2022-10-10 16:05:58] [config] - vocab.jp.spm
[2022-10-10 16:05:58] [config] - vocab.jp.spm
[2022-10-10 16:05:58] [config] workspace: 2048
[2022-10-10 16:05:58] [config] Loaded model has been created with Marian v1.10.0 6f6d484 2021-02-06 15:35:16 -0800
[2022-10-10 16:05:58] Error: Number of corpus files and vocab files does not agree
[2022-10-10 16:05:58] Error: Aborted from marian::data::CorpusBase::CorpusBase(marian::Ptr<marian::Options>, bool, size_t) in /data/smt/dev/marian-dev/src/data/corpus_base.cpp:105
[CALL STACK]
[0x5650f1e94669] marian::data::CorpusBase:: CorpusBase (std::shared_ptr<marian::Options>, bool, unsigned long) + 0x11d9
[0x5650f1ea7f3a] marian::data::Corpus:: Corpus (std::shared_ptr<marian::Options>, bool, unsigned long) + 0x6a
[0x5650f1d97034] marian::Embed<marian::Embedder>:: Embed (std::shared_ptr<marian::Options>) + 0x13d4
[0x5650f1c9ac7c] mainEmbedder (int, char**) + 0x9c
[0x5650f1b0e5a6] main + 0x106
[0x7fe7f4099083] __libc_start_main + 0xf3
[0x5650f1c963ee] _start + 0x2e
When I leave out one vocab or data file out, it instead complains
[2022-10-10 16:12:44] Error: There should be as many vocabularies as training files
[2022-10-10 16:12:44] Error: Aborted from void marian::ConfigValidator::validateOptionsParallelData() const in /data/smt/dev/marian-dev/src/common/config_validator.cpp:83
There is no more output apart of the stack.
Thanks a lot, Daniel
This comment: https://github.com/marian-nmt/marian-dev/blob/da6e30bfe3f12a05a74fda2737f31043afc94c18/src/embedder/embedder.h#L62..L63 suggests that the vocab is duplicated for the user. Have you maybe tried $MARIAN/marian embed -t data.ja paraphrase.ja -v vocab.ja.spm -m model.npz --compute-similarity?
Leaving one vocab out (regardless of having --compute-similarity) leads to:
[2023-02-07 14:58:23] Error: There should be as many vocabularies as training files
[2023-02-07 14:58:23] Error: Aborted from void marian::ConfigValidator::validateOptionsParallelData() const in /data/smt/dev/marian-dev/src/common/config_validator.cpp:84
[CALL STACK]
[0x5569146d3a1b] marian::ConfigValidator:: validateOptionsParallelData () const + 0xd6b
[0x5569146dbdd4] marian::ConfigValidator:: validateOptions (marian::cli::mode) const + 0x44
[0x5569146a7c7a] marian::ConfigParser:: parseOptions (int, char**, bool) + 0xaea
[0x556914694170] marian:: parseOptions (int, char**, marian::cli::mode, bool) + 0x50
[0x55691457e7d0] mainEmbedder (int, char**) + 0x30
[0x55691452dcf9] main + 0xf9
[0x7fbede24fc87] __libc_start_main + 0xe7
[0x556914578dca] _start + 0x2a
Aborted (core dumped)