how to get embedding with finetuned model[encoder-only]

Open liyouli666 opened this issue 11 months ago • 1 comments

I finetuned 'BAAI/bge-m3' with the script

nohup torchrun --nproc_per_node 8 \
        --master_port 29505 \
        -m FlagEmbedding.finetune.embedder.encoder_only.m3 \
        --model_name_or_path ../BAAI/bge-m3 \
    --cache_dir ../cache/model \
    --train_data ../general_train_data/mini-nq-like-general-train \
    --cache_path ../cache/data \
    --train_group_size 8 \
    --query_max_len 512 \
    --passage_max_len 512 \
    --pad_to_multiple_of 8 \
    --knowledge_distillation False \
    --same_dataset_within_batch True \
    --small_threshold 0 \
    --drop_threshold 0 \
    --output_dir ../test_encoder_only_m3_bge-m3_sd \
    --overwrite_output_dir \
    --learning_rate 1e-5 \
    --fp16 \
    --num_train_epochs 2 \
    --per_device_train_batch_size 2 \
    --dataloader_drop_last True \
    --warmup_ratio 0.1 \
    --gradient_checkpointing \
    --deepspeed ds_stage0.json \
    --logging_steps 1 \
    --save_steps 5000 \
    --negatives_cross_device \
    --temperature 0.02 \
    --sentence_pooling_method cls \
    --normalize_embeddings True \
    --kd_loss_type m3_kd_loss \
    --unified_finetuning True \
    --use_self_distill True \
    --fix_encoder False \
    --self_distill_start_step 0 > finetune.log 2>&1 &

Then I got the saved model in checkpoint-20000:

ls -lrt
total 1.1G
-rw-r--r-- 1 root root  701 Jan 17 19:03 config.json
-rw-r--r-- 1 root root 1.1G Jan 17 19:04 model.safetensors
-rw-r--r-- 1 root root 1.2K Jan 17 19:04 tokenizer_config.json
-rw-r--r-- 1 root root  964 Jan 17 19:04 special_tokens_map.json
-rw-r--r-- 1 root root 3.0K Jan 17 19:04 sparse_linear.pt
-rw-r--r-- 1 root root 4.9M Jan 17 19:04 sentencepiece.bpe.model
-rw-r--r-- 1 root root 2.1M Jan 17 19:04 colbert_linear.pt
-rw-r--r-- 1 root root 7.0K Jan 17 19:04 training_args.bin
-rw-r--r-- 1 root root  17M Jan 17 19:04 tokenizer.json
drwxrwxrwx 3 root root 4.0K Jan 17 19:04 global_step20000/
-rw-r--r-- 1 root root  22K Jan 17 19:04 rng_state_5.pth
-rw-r--r-- 1 root root  22K Jan 17 19:04 rng_state_0.pth
-rw-r--r-- 1 root root   16 Jan 17 19:04 latest
-rw-r--r-- 1 root root 3.4M Jan 17 19:04 trainer_state.json
-rw-r--r-- 1 root root  22K Jan 17 19:04 rng_state_7.pth
-rw-r--r-- 1 root root  22K Jan 17 19:04 rng_state_6.pth
-rw-r--r-- 1 root root  22K Jan 17 19:04 rng_state_4.pth
-rw-r--r-- 1 root root  22K Jan 17 19:04 rng_state_3.pth
-rw-r--r-- 1 root root  22K Jan 17 19:04 rng_state_2.pth
-rw-r--r-- 1 root root  22K Jan 17 19:04 rng_state_1.pth

The model looks totally different from the 'BAAI/bge-m3', I loaded it and got many errors. I tried to use save_ckpt_for_sentence_transformers method, but got the same errors.

Traceback (most recent call last):
  File "/root/paddlejob/workspace/env_run/liuli/FlagEmbedding/to_sentence_transformer_model.py", line 19, in <module>
    save_ckpt_for_sentence_transformers(ckpt_dir, pooling_mode='cls', normlized=True)
  File "/root/paddlejob/workspace/env_run/liuli/FlagEmbedding/to_sentence_transformer_model.py", line 6, in save_ckpt_for_sentence_transformers
    word_embedding_model = models.Transformer(ckpt_dir)
  File "/root/.local/virtualenvs/xxx/lib/python3.9/site-packages/sentence_transformers/models/Transformer.py", line 78, in __init__
    self._load_model(model_name_or_path, config, cache_dir, backend, **model_args)
  File "/root/.local/virtualenvs/xxx/lib/python3.9/site-packages/sentence_transformers/models/Transformer.py", line 138, in _load_model
    self.auto_model = AutoModel.from_pretrained(
  File "/root/.local/virtualenvs/xxx/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
  File "/root/.local/virtualenvs/xxx/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3735, in from_pretrained
    with safe_open(resolved_archive_file, framework="pt") as f:
OSError: No such device (os error 19)

I have no idea to inference with the finetuned model. Can you help me?

Jan 20 '25 08:01 liyouli666

后面解决这个问题了，是因为之前训练的时候由于机器的磁盘空间不足，所以模型保存载挂载的afs目录下，读取模型的时候不支持直接afs直接读取，从afs目录拷贝到磁盘上就行了；

For anyone who meets this error, copy the model into your local disk, this problem will be fixed.

Jan 20 '25 11:01 liyouli666