marian-dev icon indicating copy to clipboard operation
marian-dev copied to clipboard

marian embed --compute-similarity errors out

Open eltorre opened this issue 3 years ago • 2 comments

Bug description

marian embed includes a --compute-similarity option. I assume if

$MARIAN/marian embed -t data.ja -v vocab.ja.spm -m model.npz

works, then doubling up testset and vocab (as hinted by the description of --compute-similarity):

$MARIAN/marian embed -t data.ja paraphrase.ja -v vocab.ja.spm vocab.ja.spm -m model.npz

should work too.

Instead I get

Error: Number of corpus files and vocab files does not agree

Am I doing something wrong?

Context

  • Marian version: v1.11.0 f00d062 2022-02-08 08:39:24 -0800

  • CMake command: cmake .. -DCMAKE_BUILD_TYPE=Release
    -DUSE_SENTENCEPIECE=ON
    -DCOMPILE_CPU=on
    -DUSE_STATIC_LIBS=on
    -DUSE_FBGEMM=on

  • Full error log:

[2022-10-10 16:05:58] [marian] Marian v1.11.0 f00d062 2022-02-08 08:39:24 -0800 
[2022-10-10 16:05:58] [marian] Running on host as process 737 with command line:
[2022-10-10 16:05:58] [marian] marian -t data.ja paraphrase.ja  -v vocab.jp.spm vocab.jp.spm -m model.npz.best-translation.npz --compute-similarity
[2022-10-10 16:05:58] [config] authors: false          
[2022-10-10 16:05:58] [config] bert-class-symbol: "[CLS]"              
[2022-10-10 16:05:58] [config] bert-mask-symbol: "[MASK]"
[2022-10-10 16:05:58] [config] bert-masking-fraction: 0.15     
[2022-10-10 16:05:58] [config] bert-sep-symbol: "[SEP]"
[2022-10-10 16:05:58] [config] bert-train-type-embeddings: true
[2022-10-10 16:05:58] [config] bert-type-vocab-size: 2       
[2022-10-10 16:05:58] [config] best-deep: false               
[2022-10-10 16:05:58] [config] binary: false             
[2022-10-10 16:05:58] [config] build-info: ""          
[2022-10-10 16:05:58] [config] check-nan: false
[2022-10-10 16:05:58] [config] cite: false                                 
[2022-10-10 16:05:58] [config] compute-similarity: true
[2022-10-10 16:05:58] [config] cpu-threads: 0
[2022-10-10 16:05:58] [config] data-threads: 8  
[2022-10-10 16:05:58] [config] dec-cell: gru
[2022-10-10 16:05:58] [config] dec-cell-base-depth: 2
[2022-10-10 16:05:58] [config] dec-cell-high-depth: 1             
[2022-10-10 16:05:58] [config] dec-depth: 6                                                                                                                   
[2022-10-10 16:05:58] [config] devices:
[2022-10-10 16:05:58] [config]   - 0           
[2022-10-10 16:05:58] [config] dim-emb: 1024   
[2022-10-10 16:05:58] [config] dim-rnn: 1024  
[2022-10-10 16:05:58] [config] dim-vocabs:                                                                                                                    
[2022-10-10 16:05:58] [config]   - 32000                                                                                                                      
[2022-10-10 16:05:58] [config]   - 32000                                                                                                                                                                                                                                                                                     
[2022-10-10 16:05:58] [config] dump-config: ""
[2022-10-10 16:05:58] [config] enc-cell: gru
[2022-10-10 16:05:58] [config] enc-cell-depth: 1                                                                                                              
[2022-10-10 16:05:58] [config] enc-depth: 6                                                                                                                   
[2022-10-10 16:05:58] [config] enc-type: bidirectional                                                                                                        
[2022-10-10 16:05:58] [config] factors-combine: sum                          
[2022-10-10 16:05:58] [config] factors-dim-emb: 0                             
[2022-10-10 16:05:58] [config] ignore-model-config: false                    
[2022-10-10 16:05:58] [config] input-types:                                  
[2022-10-10 16:05:58] [config]   []
[2022-10-10 16:05:58] [config] interpolate-env-vars: false
[2022-10-10 16:05:58] [config] layer-normalization: false
[2022-10-10 16:05:58] [config] lemma-dependency: ""
[2022-10-10 16:05:58] [config] lemma-dim-emb: 0
[2022-10-10 16:05:58] [config] log: ""
[2022-10-10 16:05:58] [config] log-level: info
[2022-10-10 16:05:58] [config] log-time-zone: ""
[2022-10-10 16:05:58] [config] max-length: 1000
[2022-10-10 16:05:58] [config] max-length-crop: false
[2022-10-10 16:05:58] [config] maxi-batch: 100
[2022-10-10 16:05:58] [config] maxi-batch-sort: trg
[2022-10-10 16:05:58] [config] mini-batch: 64
[2022-10-10 16:05:58] [config] mini-batch-words: 0
[2022-10-10 16:05:58] [config] model: model.npz.best-translation.npz
[2022-10-10 16:05:58] [config] no-reload: false
[2022-10-10 16:05:58] [config] num-devices: 0
[2022-10-10 16:05:58] [config] output: stdout
[2022-10-10 16:05:58] [config] output-omit-bias: false
[2022-10-10 16:05:58] [config] precision:
[2022-10-10 16:05:58] [config]   - float32
[2022-10-10 16:05:58] [config] quiet: false
[2022-10-10 16:05:58] [config] quiet-translation: false
[2022-10-10 16:05:58] [config] relative-paths: false
[2022-10-10 16:05:58] [config] right-left: false
[2022-10-10 16:05:58] [config] seed: 0
[2022-10-10 16:05:58] [config] skip: false
[2022-10-10 16:05:58] [config] tied-embeddings: true
[2022-10-10 16:05:58] [config] tied-embeddings-all: true
[2022-10-10 16:05:58] [config] tied-embeddings-src: false
[2022-10-10 16:05:58] [config] train-sets:
[2022-10-10 16:05:58] [config]   - data.ja 
[2022-10-10 16:05:58] [config]   - paraphrase.ja 
[2022-10-10 16:05:58] [config] transformer-aan-activation: swish
[2022-10-10 16:05:58] [config] transformer-aan-depth: 2
[2022-10-10 16:05:58] [config] transformer-aan-nogate: false
[2022-10-10 16:05:58] [config] transformer-decoder-autoreg: self-attention
[2022-10-10 16:05:58] [config] transformer-decoder-dim-ffn: 0
[2022-10-10 16:05:58] [config] transformer-decoder-ffn-depth: 0
[2022-10-10 16:05:58] [config] transformer-depth-scaling: false
[2022-10-10 16:05:58] [config] transformer-dim-aan: 2048
[2022-10-10 16:05:58] [config] transformer-dim-ffn: 4096
[2022-10-10 16:05:58] [config] transformer-ffn-activation: relu
[2022-10-10 16:05:58] [config] transformer-ffn-depth: 2
[2022-10-10 16:05:58] [config] transformer-guided-alignment-layer: last
[2022-10-10 16:05:58] [config] transformer-heads: 16
[2022-10-10 16:05:58] [config] transformer-no-projection: false
[2022-10-10 16:05:58] [config] transformer-pool: false
[2022-10-10 16:05:58] [config] transformer-postprocess: dan
[2022-10-10 16:05:58] [config] transformer-postprocess-emb: d
[2022-10-10 16:05:58] [config] transformer-postprocess-top: ""
[2022-10-10 16:05:58] [config] transformer-preprocess: ""
[2022-10-10 16:05:58] [config] transformer-tied-layers:
[2022-10-10 16:05:58] [config]   []
[2022-10-10 16:05:58] [config] transformer-train-position-embeddings: false
[2022-10-10 16:05:58] [config] tsv: false
[2022-10-10 16:05:58] [config] tsv-fields: 0
[2022-10-10 16:05:58] [config] type: transformer
[2022-10-10 16:05:58] [config] ulr: false
[2022-10-10 16:05:58] [config] ulr-dim-emb: 0
[2022-10-10 16:05:58] [config] ulr-trainable-transformation: false
[2022-10-10 16:05:58] [config] version: v1.10.0 6f6d484 2021-02-06 15:35:16 -0800
[2022-10-10 16:05:58] [config] vocabs:
[2022-10-10 16:05:58] [config]   - vocab.jp.spm
[2022-10-10 16:05:58] [config]   - vocab.jp.spm
[2022-10-10 16:05:58] [config] workspace: 2048
[2022-10-10 16:05:58] [config] Loaded model has been created with Marian v1.10.0 6f6d484 2021-02-06 15:35:16 -0800
[2022-10-10 16:05:58] Error: Number of corpus files and vocab files does not agree
[2022-10-10 16:05:58] Error: Aborted from marian::data::CorpusBase::CorpusBase(marian::Ptr<marian::Options>, bool, size_t) in /data/smt/dev/marian-dev/src/data/corpus_base.cpp:105

[CALL STACK]
[0x5650f1e94669]    marian::data::CorpusBase::  CorpusBase  (std::shared_ptr<marian::Options>,  bool,  unsigned long) + 0x11d9
[0x5650f1ea7f3a]    marian::data::Corpus::  Corpus  (std::shared_ptr<marian::Options>,  bool,  unsigned long) + 0x6a
[0x5650f1d97034]    marian::Embed<marian::Embedder>::  Embed  (std::shared_ptr<marian::Options>) + 0x13d4
[0x5650f1c9ac7c]    mainEmbedder  (int,  char**)                       + 0x9c
[0x5650f1b0e5a6]    main                                               + 0x106
[0x7fe7f4099083]    __libc_start_main                                  + 0xf3
[0x5650f1c963ee]    _start                                             + 0x2e

When I leave out one vocab or data file out, it instead complains

[2022-10-10 16:12:44] Error: There should be as many vocabularies as training files
[2022-10-10 16:12:44] Error: Aborted from void marian::ConfigValidator::validateOptionsParallelData() const in /data/smt/dev/marian-dev/src/common/config_validator.cpp:83

There is no more output apart of the stack.

Thanks a lot, Daniel

eltorre avatar Oct 10 '22 14:10 eltorre

This comment: https://github.com/marian-nmt/marian-dev/blob/da6e30bfe3f12a05a74fda2737f31043afc94c18/src/embedder/embedder.h#L62..L63 suggests that the vocab is duplicated for the user. Have you maybe tried $MARIAN/marian embed -t data.ja paraphrase.ja -v vocab.ja.spm -m model.npz --compute-similarity?

snukky avatar Jan 17 '23 18:01 snukky

Leaving one vocab out (regardless of having --compute-similarity) leads to:

[2023-02-07 14:58:23] Error: There should be as many vocabularies as training files
[2023-02-07 14:58:23] Error: Aborted from void marian::ConfigValidator::validateOptionsParallelData() const in /data/smt/dev/marian-dev/src/common/config_validator.cpp:84

[CALL STACK]
[0x5569146d3a1b]    marian::ConfigValidator::  validateOptionsParallelData  () const + 0xd6b
[0x5569146dbdd4]    marian::ConfigValidator::  validateOptions  (marian::cli::mode) const + 0x44
[0x5569146a7c7a]    marian::ConfigParser::  parseOptions  (int,  char**,  bool) + 0xaea
[0x556914694170]    marian::  parseOptions  (int,  char**,  marian::cli::mode,  bool) + 0x50
[0x55691457e7d0]    mainEmbedder  (int,  char**)                       + 0x30
[0x55691452dcf9]    main                                               + 0xf9
[0x7fbede24fc87]    __libc_start_main                                  + 0xe7
[0x556914578dca]    _start                                             + 0x2a

Aborted (core dumped)

eltorre avatar Feb 07 '23 14:02 eltorre