Finetune script running failed in macOS M4 chips
Hi teams,
I'm trying to finetune something on my mac with M4 chip,
the script is here:
torchrun --nproc_per_node 1 \ -m FlagEmbedding.finetune.embedder.encoder_only.m3 \ --model_name_or_path /Users/nc/python/FlagNew/models/bge-m3 \ --train_data /Users/nc/python/FlagNew/examples/finetune/embedder/example_data/sts/sts.jsonl \ --train_group_size 4 \ --query_max_len 128 \ --passage_max_len 1024 \ --pad_to_multiple_of 8 \ --knowledge_distillation True \ --same_dataset_within_batch True \ --small_threshold 0 \ --drop_threshold 0 \ --output_dir /Users/nc/python/FlagNew/models/output \ --overwrite_output_dir \ --learning_rate 1e-5 \ --num_train_epochs 2 \ --per_device_train_batch_size 2 \ --dataloader_drop_last True \ --warmup_ratio 0.1 \ --gradient_checkpointing \ --deepspeed /Users/nc/python/FlagNew/examples/finetune/ds_stage0.json \ --logging_steps 1 \ --save_steps 1000 \ --negatives_cross_device \ --temperature 0.02 \ --sentence_pooling_method cls \ --normalize_embeddings True \ --kd_loss_type m3_kd_loss \ --unified_finetuning True \ --use_self_distill True \ --fix_encoder False \ --self_distill_start_step 0
But the script failed because of the following error:
`W0505 20:46:13.911000 19544 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
[2025-05-05 20:46:18,817] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to mps (auto detect)
W0505 20:46:19.067000 19555 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
[2025-05-05 20:46:19,181] [INFO] [comm.py:658:init_distributed] cdb=None
[2025-05-05 20:46:19,181] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend gloo
[W505 20:46:19.158823000 ProcessGroupGloo.cpp:760] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/accelerate/state.py:262: UserWarning: OMP_NUM_THREADS/MKL_NUM_THREADS unset, we set it at 12 to improve oob performance.
warnings.warn(
05/05/2025 20:46:19 - WARNING - FlagEmbedding.abc.finetune.embedder.AbsRunner - Process rank: 0, device: cpu:0, n_gpu: 1, distributed training: True, local rank: 0, 16-bits training: False
05/05/2025 20:46:19 - INFO - FlagEmbedding.abc.finetune.embedder.AbsRunner - Model parameters EncoderOnlyEmbedderM3ModelArguments(model_name_or_path='/Users/nc/python/FlagNew/models/bge-m3', config_name=None, tokenizer_name=None, cache_dir=None, trust_remote_code=False, token=None, colbert_dim=-1)
05/05/2025 20:46:19 - INFO - FlagEmbedding.abc.finetune.embedder.AbsRunner - Data parameters AbsEmbedderDataArguments(train_data=['/Users/nc/python/FlagNew/examples/finetune/embedder/example_data/sts/sts.jsonl'], cache_path=None, train_group_size=4, query_max_len=128, passage_max_len=1024, pad_to_multiple_of=8, max_example_num_per_dataset=100000000, query_instruction_for_retrieval=None, query_instruction_format='{}{}', knowledge_distillation=True, passage_instruction_for_retrieval=None, passage_instruction_format='{}{}', shuffle_ratio=0.0, same_dataset_within_batch=True, small_threshold=0, drop_threshold=0)
05/05/2025 20:46:19 - INFO - FlagEmbedding.finetune.embedder.encoder_only.m3.runner - Config: XLMRobertaConfig {
"architectures": [
"XLMRobertaModel"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"classifier_dropout": null,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"id2label": {
"0": "LABEL_0"
},
"initializer_range": 0.02,
"intermediate_size": 4096,
"label2id": {
"LABEL_0": 0
},
"layer_norm_eps": 1e-05,
"max_position_embeddings": 8194,
"model_type": "xlm-roberta",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"output_past": true,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"torch_dtype": "float32",
"transformers_version": "4.51.3",
"type_vocab_size": 1,
"use_cache": true,
"vocab_size": 250002
}
05/05/2025 20:46:20 - INFO - FlagEmbedding.finetune.embedder.encoder_only.m3.runner - loading existing colbert_linear and sparse_linear---------
05/05/2025 20:46:20 - INFO - FlagEmbedding.abc.finetune.embedder.AbsDataset - loading data from /Users/nc/python/FlagNew/examples/finetune/embedder/example_data/sts/sts.jsonl ...
05/05/2025 20:46:21 - INFO - FlagEmbedding.abc.finetune.embedder.AbsDataset - -- Rank 0: refresh data --
/Users/nc/python/FlagNew/FlagEmbedding/finetune/embedder/encoder_only/m3/runner.py:161: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for EncoderOnlyEmbedderM3Trainer.__init__. Use processing_class instead.
trainer = EncoderOnlyEmbedderM3Trainer(
[rank0]: Traceback (most recent call last):
[rank0]: File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/Users/nc/python/FlagNew/FlagEmbedding/finetune/embedder/encoder_only/m3/main.py", line 26, in
[rank0]: main()
[rank0]: File "/Users/nc/python/FlagNew/FlagEmbedding/finetune/embedder/encoder_only/m3/main.py", line 22, in main
[rank0]: runner.run()
[rank0]: File "/Users/nc/python/FlagNew/FlagEmbedding/abc/finetune/embedder/AbsRunner.py", line 150, in run
[rank0]: self.trainer.train(resume_from_checkpoint=self.training_args.resume_from_checkpoint)
[rank0]: File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
[rank0]: return inner_training_loop(
[rank0]: File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 2377, in _inner_training_loop
[rank0]: model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
[rank0]: File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/accelerate/accelerator.py", line 1440, in prepare
[rank0]: result = self._prepare_deepspeed(*args)
[rank0]: File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/accelerate/accelerator.py", line 2033, in _prepare_deepspeed
[rank0]: engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs)
[rank0]: File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/deepspeed/init.py", line 193, in initialize
[rank0]: engine = DeepSpeedEngine(args=args,
[rank0]: File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 273, in init
[rank0]: self._configure_distributed_model(model)
[rank0]: File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1287, in _configure_distributed_model
[rank0]: self._broadcast_model()
[rank0]: File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1205, in _broadcast_model
[rank0]: dist.broadcast(p.data, groups.get_broadcast_src_rank(), group=self.seq_data_parallel_group)
[rank0]: File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
[rank0]: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank0]: File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 206, in broadcast
[rank0]: return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[rank0]: File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2780, in broadcast
[rank0]: work = group.broadcast([tensor], opts)
[rank0]: NotImplementedError: The operator 'c10d::broadcast' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash de772dae0a7854e9cb71526e2ed16ef5792cdd6c. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
E0505 20:46:23.138000 19544 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 19555) of binary: /Users/nc/miniconda3/envs/ds2/bin/python
Traceback (most recent call last):
File "/Users/nc/miniconda3/envs/ds2/bin/torchrun", line 8, in
sys.exit(main())
File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/Users/nc/miniconda3/envs/ds2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
E0505 20:46:23.138000 19544 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 19555) of binary: /Users/nc/miniconda3/envs/ds2/bin/python
Traceback (most recent call last):
File "/Users/nc/miniconda3/envs/ds2/bin/torchrun", line 8, in FlagEmbedding.finetune.embedder.encoder_only.m3 FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2025-05-05_20:46:23 host : DXHWVXX2VQ rank : 0 (local_rank: 0) exitcode : 1 (pid: 19555) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================`
the torch version is 2.8.0.dev20250416
How to fix this issue? Let me know if you need extra info.
thanks