请问,单卡跑官方示例用的zh-v1.5模型,'grad_norm': 一直nan是为什么
torchrun --nproc_per_node 1
-m FlagEmbedding.finetune.embedder.encoder_only.base
--model_name_or_path /root/.cache/modelscope/hub/models/BAAI/bge-base-zh-v1.5
--cache_dir ./cache/model
--train_data ./examples/finetune/embedder/example_data/retrieval
--cache_path ./cache/data
--train_group_size 8
--query_max_len 512
--passage_max_len 512
--pad_to_multiple_of 8
--query_instruction_for_retrieval 'Represent this sentence for searching relevant passages: '
--query_instruction_format '{}{}'
--knowledge_distillation False
--output_dir ./test_encoder_only_base_bge-large-en-v1.5
--overwrite_output_dir
--learning_rate 1e-5
--fp16
--num_train_epochs 2
--dataloader_drop_last True
--warmup_ratio 0.1
--logging_steps 1
--save_steps 1000
--negatives_cross_device
--temperature 0.02
--sentence_pooling_method cls
--normalize_embeddings True
--kd_loss_type kl_div
还有就是使用 --gradient_checkpointing 会报错:RuntimeError: Expected to mark a variable ready only once. Parameter at index 195 with name model.encoder.layer.11.output.LayerNorm.weight has been marked as ready twice. 单卡微调是不能用这个参数吗
gradient_checkpointing 搭配deepspeed config才不会报错