FlagEmbedding icon indicating copy to clipboard operation
FlagEmbedding copied to clipboard

About BGE-M3 finetune Distr

Open zhangbin1997 opened this issue 6 months ago • 1 comments

我的微调命令就是基于本仓库提供的示例 https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune

微调命令: `export CUDA_VISIBLE_DEVICES=0,1

torchrun --nproc_per_node 2
-m FlagEmbedding.BGE_M3.run
--output_dir /output
--model_name_or_path /embedding_model/bge-m3
--train_data /test_1k.jsonl
--learning_rate 1e-5
--fp16
--num_train_epochs 5
--per_device_train_batch_size 1
--dataloader_drop_last True
--normlized True
--temperature 0.02
--query_max_len 64
--passage_max_len 256
--train_group_size 2
--negatives_cross_device
--logging_steps 10
--same_task_within_batch True
--unified_finetuning True
--use_self_distill True`

报的错为:

` Traceback (most recent call last): File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/site-packages/FlagEmbedding/BGE_M3/run.py", line 155, in main() File "/usr/local/lib/python3.10/site-packages/FlagEmbedding/BGE_M3/run.py", line 90, in main model = BGEM3Model(model_name=model_args.model_name_or_path, File "/usr/local/lib/python3.10/site-packages/FlagEmbedding/BGE_M3/modeling.py", line 64, in init raise ValueError('Distributed training has not been initialized for representation all gather.') ValueError: Distributed training has not been initialized for representation all gather. Traceback (most recent call last): File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/site-packages/FlagEmbedding/BGE_M3/run.py", line 155, in main() File "/usr/local/lib/python3.10/site-packages/FlagEmbedding/BGE_M3/run.py", line 90, in main model = BGEM3Model(model_name=model_args.model_name_or_path, File "/usr/local/lib/python3.10/site-packages/FlagEmbedding/BGE_M3/modeling.py", line 64, in init raise ValueError('Distributed training has not been initialized for representation all gather.') ValueError: Distributed training has not been initialized for representation all gather. I0000 00:00:1723812274.944134 1478684 coordination_service_agent.cc:472] Coordination agent has initiated Shutdown(). I0000 00:00:1723812274.944391 1478683 coordination_service_agent.cc:472] Coordination agent has initiated Shutdown(). I0000 00:00:1723812274.947102 1478684 coordination_service_agent.cc:491] Coordination agent has successfully shut down. I0000 00:00:1723812274.947138 1478683 coordination_service_agent.cc:491] Coordination agent has successfully shut down. [2024-08-16 20:44:39,202] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1478683) of binary: /usr/local/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.2.0', 'console_scripts', 'torchrun')()) File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

FlagEmbedding.BGE_M3.run FAILED

Failures: [1]: time : 2024-08-16_20:44:39 host : dsw-86547-76c68555fc-22vfb rank : 1 (local_rank: 1) exitcode : 1 (pid: 1478684) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-08-16_20:44:39 host : dsw-86547-76c68555fc-22vfb rank : 0 (local_rank: 0) exitcode : 1 (pid: 1478683) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================`

好像就是分布式的原因,另外我看其他issue中所说的删除--negatives_cross_device ,也试过了, 然后报的错误变了, `Traceback (most recent call last): File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/site-packages/FlagEmbedding/BGE_M3/run.py", line 155, in main() File "/usr/local/lib/python3.10/site-packages/FlagEmbedding/BGE_M3/run.py", line 115, in main train_dataset = SameDatasetTrainDataset(args=data_args, File "/usr/local/lib/python3.10/site-packages/FlagEmbedding/BGE_M3/data.py", line 43, in init if dist.get_rank() == 0: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1532, in get_rank default_pg = _get_default_group() File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 977, in _get_default_group raise ValueError( ValueError: Default process group has not been initialized, please make sure to call init_process_group. Traceback (most recent call last): File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/site-packages/FlagEmbedding/BGE_M3/run.py", line 155, in main() File "/usr/local/lib/python3.10/site-packages/FlagEmbedding/BGE_M3/run.py", line 115, in main train_dataset = SameDatasetTrainDataset(args=data_args, File "/usr/local/lib/python3.10/site-packages/FlagEmbedding/BGE_M3/data.py", line 43, in init if dist.get_rank() == 0: File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1532, in get_rank default_pg = _get_default_group() File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 977, in _get_default_group raise ValueError( ValueError: Default process group has not been initialized, please make sure to call init_process_group. I0000 00:00:1723812568.317028 1481316 coordination_service_agent.cc:472] Coordination agent has initiated Shutdown(). I0000 00:00:1723812568.338617 1481315 coordination_service_agent.cc:472] Coordination agent has initiated Shutdown(). I0000 00:00:1723812568.341378 1481315 coordination_service_agent.cc:491] Coordination agent has successfully shut down. I0000 00:00:1723812568.341390 1481316 coordination_service_agent.cc:491] Coordination agent has successfully shut down. [2024-08-16 20:49:32,658] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1481315) of binary: /usr/local/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.2.0', 'console_scripts', 'torchrun')()) File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

FlagEmbedding.BGE_M3.run FAILED

Failures: [1]: time : 2024-08-16_20:49:32 host : dsw-86547-76c68555fc-22vfb rank : 1 (local_rank: 1) exitcode : 1 (pid: 1481316) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-08-16_20:49:32 host : dsw-86547-76c68555fc-22vfb rank : 0 (local_rank: 0) exitcode : 1 (pid: 1481315) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================`

感谢,期待回复~

zhangbin1997 avatar Aug 16 '24 12:08 zhangbin1997