torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
cuda 11.6 torch 1.13.1 torch cu 11.6
报错如下:
请问是否是版本过低的原因?
具体的报错信息应该还在截图界面的上面,可以往上翻查看更具体的信息
上面都是一些warning
应该是环境的问题,看起来是bitsandbytes的问题,可以去其repo咨询
经过修改现在没有bitsandbytes相关的warning信息了,但仍然有torch.distributed.elastic.multiprocessing.errors.ChildFailedError:,请问该模型微调最低需要多大CPU RAM和GPU 显存?
我使用的是bge-base-zh
跟batch size, max_length有关,减小batch size和max length就可降低显存占用。 没有看到具体报错信息,难以判断问题所在。可以尝试单卡运行,看是否报错。
torchrun --nproc_per_node 4
-m baai_general_embedding.finetune.run
--output_dir /FlagEmbedding/examples/finetune/output
--model_name_or_path /bge-base-zh
--train_data /FlagEmbedding/examples/finetune/toy_finetune_data.jsonl
--learning_rate 1e-5
--fp16
--num_train_epochs 1
--per_device_train_batch_size 1
--dataloader_drop_last True
--normlized True
--temperature 0.02
--query_max_len 8
--passage_max_len 256
--train_group_size 2
--negatives_cross_device
--logging_steps 10
--query_instruction_for_retrieval "为这个句子生成表示以用于检索相关文章:"
以下是命令行显示的全部信息
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
11/08/2023 03:11:46 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: True
11/08/2023 03:11:46 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: True
11/08/2023 03:11:46 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: True
11/08/2023 03:11:46 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True
11/08/2023 03:11:46 - INFO - __main__ - Training/evaluation parameters RetrieverTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=True,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fix_position_embedding=False,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=1e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/FlagEmbedding/examples/finetune/output/runs/Nov08_03-11-46_976e924a5c25,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
negatives_cross_device=True,
no_cuda=False,
normlized=True,
num_train_epochs=1.0,
optim=adamw_torch,
optim_args=None,
output_dir=/FlagEmbedding/examples/finetune/output,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=/FlagEmbedding/examples/finetune/output,
save_on_each_node=False,
save_safetensors=True,
save_steps=500,
save_strategy=steps,
save_total_limit=None,
seed=42,
sentence_pooling_method=cls,
skip_memory_metrics=True,
split_batches=False,
temperature=0.02,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
11/08/2023 03:11:46 - INFO - __main__ - Model parameters ModelArguments(model_name_or_path='/bge-base-zh', config_name=None, tokenizer_name=None, cache_dir=None)
11/08/2023 03:11:46 - INFO - __main__ - Data parameters DataArguments(train_data='/FlagEmbedding/examples/finetune/toy_finetune_data.jsonl', train_group_size=2, query_max_len=8, passage_max_len=256, max_example_num_per_dataset=100000000, query_instruction_for_retrieval='为这个句子生成表示以用于检索相关文章:', passage_instruction_for_retrieval=None)
11/08/2023 03:11:46 - INFO - __main__ - Config: BertConfig {
"_name_or_path": "/bge-base-zh",
"architectures": [
"BertModel"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"classifier_dropout": null,
"directionality": "bidi",
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"LABEL_0": 0
},
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"output_past": true,
"pad_token_id": 0,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"position_embedding_type": "absolute",
"torch_dtype": "float32",
"transformers_version": "4.35.0",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 21128
}
11/08/2023 03:11:48 - WARNING - accelerate.utils.other - Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6575 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 6576) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
====================================================
baai_general_embedding.finetune.run FAILED
----------------------------------------------------
Failures:
[1]:
time : 2023-11-08_03:12:22
host : 976e924a5c25
rank : 2 (local_rank: 2)
exitcode : -7 (pid: 6577)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 6577
[2]:
time : 2023-11-08_03:12:22
host : 976e924a5c25
rank : 3 (local_rank: 3)
exitcode : -7 (pid: 6578)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 6578
----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-11-08_03:12:22
host : 976e924a5c25
rank : 1 (local_rank: 1)
exitcode : -7 (pid: 6576)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 6576
====================================================
nvidia-smi
Wed Nov 8 03:12:38 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... Off | 00000000:04:00.0 Off | 0 |
| N/A 34C P0 58W / 250W | 11633MiB / 40960MiB | 54% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... Off | 00000000:0C:00.0 Off | 0 |
| N/A 34C P0 64W / 250W | 17831MiB / 40960MiB | 44% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-PCI... Off | 00000000:0E:00.0 Off | 0 |
| N/A 35C P0 62W / 250W | 10691MiB / 40960MiB | 24% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-PCI... Off | 00000000:16:00.0 Off | 0 |
| N/A 31C P0 44W / 250W | 10751MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 45115 C 11631MiB |
| 1 N/A N/A 45115 C 17829MiB |
| 2 N/A N/A 45115 C 10689MiB |
| 3 N/A N/A 45115 C 10639MiB |
+-----------------------------------------------------------------------------+
CPU内存还有376G可用。参考https://github.com/tatsu-lab/stanford_alpaca/issues/245 不知道是不是因为内存和显存的大小不够
Facing the same issue with re-ranker
@zoeChen119 ,问题解决了吗?怎么解决的,同样遇到上面的问题
求解决方案,同报错
[2023-12-15 14:01:57,786] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 66887) of binary: /nfs/volume-151-1/gaozhiqiang/miniconda3/envs/case_platform_hdbscan/bin/python
Traceback (most recent call last):
File "/nfs/volume-151-1/gaozhiqiang/miniconda3/envs/case_platform_hdbscan/bin/torchrun", line 8, in
sys.exit(main())
File "/nfs/volume-151-1/gaozhiqiang/miniconda3/envs/case_platform_hdbscan/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/nfs/volume-151-1/gaozhiqiang/miniconda3/envs/case_platform_hdbscan/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/nfs/volume-151-1/gaozhiqiang/miniconda3/envs/case_platform_hdbscan/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/nfs/volume-151-1/gaozhiqiang/miniconda3/envs/case_platform_hdbscan/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/nfs/volume-151-1/gaozhiqiang/miniconda3/envs/case_platform_hdbscan/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
.pretrain FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-12-15_14:01:57 host : k8s-deploy-znbhbj-1695031414061-586c4945d-bsb9s rank : 0 (local_rank: 0) exitcode : 1 (pid: 66887) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
同样的报错,请问有解决方案吗? @zoeChen119
同样的报错
+1