FlagEmbedding
FlagEmbedding copied to clipboard
torch.distributed.elastic.multiprocessing.errors.ChildFailedError
When finetuning bge-large-en-v1.5:
Traceback (most recent call last):
File "/opt/homebrew/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
FlagEmbedding.baai_general_embedding.finetune.run FAILED
Here's what I ran:
python3 -m FlagEmbedding.baai_general_embedding.finetune.run \
--output_dir MODEL_PATH \
--model_name_or_path BAAI/bge-large-zh-v1.5 \
--train_data ./toy_finetune_data.jsonl \
--learning_rate 1e-5 \
--bf16 \
--num_train_epochs 5 \
--per_device_train_batch_size 8 \
--dataloader_drop_last True \
--normlized True \
--temperature 0.02 \
--query_max_len 64 \
--passage_max_len 256 \
--train_group_size 2 \
--negatives_cross_device False \
--logging_steps 10 \
--save_steps 1000 \
--query_instruction_for_retrieval ""
Device: MacOS Silicon transformers 4.42.4 torch 2.5.0
解决了吗