FlagEmbedding
FlagEmbedding copied to clipboard
torch.distributed.elastic.multiprocessing.errors.ChildFailedError
E1119 08:26:02.715000 28705 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: -11) local_rank: 0 (pid: 28773) of binary: /app/anaconda3/envs/py312/bin/python3.12
Traceback (most recent call last):
File "/app/anaconda3/envs/py312/bin/torchrun", line 7, in <module>
sys.exit(main())
^^^^^^
File "/app/anaconda3/envs/py312/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/app/anaconda3/envs/py312/lib/python3.12/site-packages/torch/distributed/run.py", line 936, in main
run(args)
File "/app/anaconda3/envs/py312/lib/python3.12/site-packages/torch/distributed/run.py", line 927, in run
elastic_launch(
File "/app/anaconda3/envs/py312/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 156, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/anaconda3/envs/py312/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
FlagEmbedding.finetune.reranker.encoder_only.base FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-11-19_08:26:02
host : search-api
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 28773)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 28773
========================================================
[HAMI-core Msg(28705:140271345607168:multiprocess_memory_limit.c:468)]: Calling exit handler 28705
运行了几分钟就报错了
训练代码
OUTPUT_DIR="./output_bge_reranker_v2_m3"
# 使用 torchrun 启动训练(直接调用 FlagEmbedding 模块)
echo "开始训练..."
torchrun --nproc_per_node 1 --master_port 29500 \
-m FlagEmbedding.finetune.reranker.encoder_only.base \
--model_name_or_path ${MODEL_PATH} \
--cache_dir ./cache/model \
--train_data ${TRAIN_DATA} \
--cache_path ./cache/data \
--train_group_size 4 \
--query_max_len 1024 \
--passage_max_len 1024 \
--pad_to_multiple_of 8 \
--knowledge_distillation False \
--output_dir ${OUTPUT_DIR} \
--overwrite_output_dir \
--learning_rate 6e-5 \
--num_train_epochs 2 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 4 \
--dataloader_drop_last True \
--dataloader_num_workers 0 \
--dataloader_pin_memory False \
--warmup_ratio 0.1 \
--weight_decay 0.01 \
--logging_steps 10 \
--save_steps 1000 \
--save_total_limit 2 \
--ddp_find_unused_parameters False
echo "训练完成!模型保存在: ${OUTPUT_DIR}"