transformers
transformers copied to clipboard
[Bug] Error when trying to run two models in a machine with created ZeRO config variables.
System Info
- OS: Ubuntu 22.04.3 LTS
- GPU count and types: one machine with 4 x NVIDIA H100 PCIe
- Python version: 3.10.12
- Any other relevant info about your setup: transformers 4.39.3
Who can help?
@ArthurZucker @younesbelkada @pacman100
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Error Messages
deepspeed --num_gpus=4 test.py --deepspeed deepspeed_config_zero3_without_offload.json
[2024-04-18 23:48:52,671] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-18 23:48:53,410] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-04-18 23:48:53,440] [INFO] [runner.py:568:main] cmd = /home/ivanfung/miniforge3/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None test.py --deepspeed deepspeed_config_zero3_without_offload.json
[2024-04-18 23:48:55,322] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-18 23:48:55,784] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2024-04-18 23:48:55,784] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2024-04-18 23:48:55,784] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2024-04-18 23:48:55,784] [INFO] [launch.py:163:main] dist_world_size=4
[2024-04-18 23:48:55,784] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2024-04-18 23:48:55,785] [INFO] [launch.py:253:main] process 493390 spawned with command: ['/home/ivanfung/miniforge3/bin/python3.10', '-u', 'test.py', '--local_rank=0', '--deepspeed', 'deepspeed_config_zero3_without_offload.json']
[2024-04-18 23:48:55,786] [INFO] [launch.py:253:main] process 493391 spawned with command: ['/home/ivanfung/miniforge3/bin/python3.10', '-u', 'test.py', '--local_rank=1', '--deepspeed', 'deepspeed_config_zero3_without_offload.json']
[2024-04-18 23:48:55,787] [INFO] [launch.py:253:main] process 493392 spawned with command: ['/home/ivanfung/miniforge3/bin/python3.10', '-u', 'test.py', '--local_rank=2', '--deepspeed', 'deepspeed_config_zero3_without_offload.json']
[2024-04-18 23:48:55,788] [INFO] [launch.py:253:main] process 493393 spawned with command: ['/home/ivanfung/miniforge3/bin/python3.10', '-u', 'test.py', '--local_rank=3', '--deepspeed', 'deepspeed_config_zero3_without_offload.json']
2024-04-18 23:49:02 - test.py[line:117] - INFO: args.__dict__ : {'deepspeed': 'deepspeed_config_zero3_without_offload.json', 'resume_from_checkpoint': False, 'local_rank': 3}
2024-04-18 23:49:02 - test.py[line:129] - INFO: per_device_train_batch_size = 32, gradient_accumulation_steps = 1
2024-04-18 23:49:02 - test.py[line:117] - INFO: args.__dict__ : {'deepspeed': 'deepspeed_config_zero3_without_offload.json', 'resume_from_checkpoint': False, 'local_rank': 2}
2024-04-18 23:49:02 - test.py[line:129] - INFO: per_device_train_batch_size = 32, gradient_accumulation_steps = 1
2024-04-18 23:49:02 - test.py[line:117] - INFO: args.__dict__ : {'deepspeed': 'deepspeed_config_zero3_without_offload.json', 'resume_from_checkpoint': False, 'local_rank': 0}
2024-04-18 23:49:02 - test.py[line:129] - INFO: per_device_train_batch_size = 32, gradient_accumulation_steps = 1
2024-04-18 23:49:02 - test.py[line:117] - INFO: args.__dict__ : {'deepspeed': 'deepspeed_config_zero3_without_offload.json', 'resume_from_checkpoint': False, 'local_rank': 1}
2024-04-18 23:49:02 - test.py[line:129] - INFO: per_device_train_batch_size = 32, gradient_accumulation_steps = 1
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.15s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.19s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.18s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.15s/it]
Map (num_proc=32): 100%|███████████████████████████████████████████████████████████████████████████████| 13883/13883 [00:00<00:00, 21859.95 examples/s]
Map (num_proc=32): 100%|███████████████████████████████████████████████████████████████████████████████| 13883/13883 [00:00<00:00, 18743.57 examples/s]
Map (num_proc=32): 100%|███████████████████████████████████████████████████████████████████████████████| 13883/13883 [00:00<00:00, 19966.79 examples/s]
Map (num_proc=32): 100%|███████████████████████████████████████████████████████████████████████████████| 13883/13883 [00:00<00:00, 22316.04 examples/s]
Filter (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████| 13883/13883 [00:00<00:00, 37141.84 examples/s]
Filter (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████| 13883/13883 [00:00<00:00, 38661.50 examples/s]
Filter (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████| 13883/13883 [00:00<00:00, 35041.42 examples/s]
Filter (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████| 13883/13883 [00:00<00:00, 38018.21 examples/s]
Disagreement of input vs. target of training data: 13883
2024-04-18 23:49:15 - test.py[line:265] - INFO: Tokenizing training set success!
Disagreement of input vs. target of training data: 13883
2024-04-18 23:49:15 - test.py[line:265] - INFO: Tokenizing training set success!
Disagreement of input vs. target of training data: 13883
2024-04-18 23:49:15 - test.py[line:265] - INFO: Tokenizing training set success!
Disagreement of input vs. target of training data: 13883
2024-04-18 23:49:15 - test.py[line:265] - INFO: Tokenizing training set success!
Map (num_proc=32): 100%|██████████████████████████████████████████████████████████████████████████████████| 1752/1752 [00:00<00:00, 4166.98 examples/s]
Map (num_proc=32): 100%|██████████████████████████████████████████████████████████████████████████████████| 1752/1752 [00:00<00:00, 4280.46 examples/s]
Map (num_proc=32): 100%|██████████████████████████████████████████████████████████████████████████████████| 1752/1752 [00:00<00:00, 4181.90 examples/s]
Filter (num_proc=32): 100%|███████████████████████████████████████████████████████████████████████████████| 1752/1752 [00:00<00:00, 5585.31 examples/s]
Map (num_proc=32): 100%|██████████████████████████████████████████████████████████████████████████████████| 1752/1752 [00:00<00:00, 4121.15 examples/s]
Filter (num_proc=32): 88%|█████████████████████████████████████████████████████████████████████▎ | 1536/1752 [00:00<00:00, 8288.97 examples/s]Disagreement of input vs target of valid data: 1752
***** Start Training *****
2024-04-18 23:49:19 - test.py[line:288] - INFO: num_gpus = 4, training_nums = 13883, total_steps = 545, warmup_steps = 32
Filter (num_proc=32): 100%|███████████████████████████████████████████████████████████████████████████████| 1752/1752 [00:00<00:00, 5868.96 examples/s]
Filter (num_proc=32): 100%|███████████████████████████████████████████████████████████████████████████████| 1752/1752 [00:00<00:00, 5691.31 examples/s]
Disagreement of input vs target of valid data: 1752
***** Start Training *****
2024-04-18 23:49:19 - test.py[line:288] - INFO: num_gpus = 4, training_nums = 13883, total_steps = 545, warmup_steps = 32
Disagreement of input vs target of valid data: 1752
***** Start Training *****
2024-04-18 23:49:19 - test.py[line:288] - INFO: num_gpus = 4, training_nums = 13883, total_steps = 545, warmup_steps = 32
Filter (num_proc=32): 0%| | 0/1752 [00:00<?, ? examples/s][2024-04-18 23:49:19,923] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Filter (num_proc=32): 53%|██████████████████████████████████████████▋ | 935/1752 [00:00<00:00, 4852.83 examples/s][2024-04-18 23:49:20,050] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-04-18 23:49:20,123] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Filter (num_proc=32): 100%|███████████████████████████████████████████████████████████████████████████████| 1752/1752 [00:00<00:00, 5395.09 examples/s]
[2024-04-18 23:49:20,171] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-18 23:49:20,251] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-04-18 23:49:20,252] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-04-18 23:49:20,303] [INFO] [comm.py:637:init_distributed] cdb=None
Disagreement of input vs target of valid data: 1752
***** Start Training *****
2024-04-18 23:49:20 - test.py[line:288] - INFO: num_gpus = 4, training_nums = 13883, total_steps = 545, warmup_steps = 32
[2024-04-18 23:49:21,018] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-18 23:49:21,146] [INFO] [comm.py:637:init_distributed] cdb=None
trainer.train
trainer.train
trainer.train
trainer.train
hpZeRO group size: 4
Parameter Offload: Total persistent parameters: 266240 in 65 params
{'loss': 4.996, 'grad_norm': 166.72275161005012, 'learning_rate': 3.125e-06, 'epoch': 0.09}
{'loss': 2.4735, 'grad_norm': 38.799148163513834, 'learning_rate': 6.25e-06, 'epoch': 0.18}
{'loss': 2.0208, 'grad_norm': 36.615276885153875, 'learning_rate': 9.375000000000001e-06, 'epoch': 0.28}
{'loss': 1.6564, 'grad_norm': 9.72972914196507, 'learning_rate': 9.844054580896686e-06, 'epoch': 0.37}
{'loss': 1.41, 'grad_norm': 9.066500374092316, 'learning_rate': 9.649122807017545e-06, 'epoch': 0.46}
10%|███████████▏ | 54/545 [01:18<11:25, 1.40s/itTraceback (most recent call last):██████████████████████████████████████████████████████████████████████████████████████| 55/55 [00:22<00:00, 2.44it/s]
File "/home/ivanfung/workspace/bug/test.py", line 366, in <module>
train(args)
File "/home/ivanfung/workspace/bug/test.py", line 340, in train
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2193, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2577, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
output = eval_loop(
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3656, in evaluation_loop
metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
File "/home/ivanfung/workspace/bug/test.py", line 57, in compute_metrics
bs_f1 = BERT_SCORER.compute(
File "/home/ivanfung/workspace/app/evaluate/src/evaluate/module.py", line 462, in compute
output = self._compute(**inputs, **compute_kwargs)
File "/home/ivanfung/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bertscore/cf4907b18f8f741f202232c0f8009a3bd49ff98802c245abcb6ea51a37a8c05b/bertscore.py", line 189, in _compute
self.cached_bertscorer = scorer(
File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/scorer.py", line 98, in __init__
self._model = get_model(self.model_type, self.num_layers, self.all_layers)
File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/utils.py", line 255, in get_model
model = AutoModel.from_pretrained(model_type)
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
return model_class.from_pretrained(
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3394, in from_pretrained
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 939, in __init__
groups._create_zero_param_parallel_group(_ds_config.zero_config.zero_hpz_partition_size)
File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/utils/groups.py", line 518, in _create_zero_param_parallel_group
assert _ZERO_PARAM_INTRA_PARALLEL_GROUP is None, \
AssertionError: ZeRO parameter intra parallel group is already initialized
Traceback (most recent call last):
File "/home/ivanfung/workspace/bug/test.py", line 366, in <module>
train(args)
File "/home/ivanfung/workspace/bug/test.py", line 340, in train
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2193, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2577, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
output = eval_loop(
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3656, in evaluation_loop
metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
File "/home/ivanfung/workspace/bug/test.py", line 57, in compute_metrics
bs_f1 = BERT_SCORER.compute(
File "/home/ivanfung/workspace/app/evaluate/src/evaluate/module.py", line 462, in compute
output = self._compute(**inputs, **compute_kwargs)
File "/home/ivanfung/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bertscore/cf4907b18f8f741f202232c0f8009a3bd49ff98802c245abcb6ea51a37a8c05b/bertscore.py", line 189, in _compute
self.cached_bertscorer = scorer(
File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/scorer.py", line 98, in __init__
self._model = get_model(self.model_type, self.num_layers, self.all_layers)
File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/utils.py", line 255, in get_model
model = AutoModel.from_pretrained(model_type)
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
return model_class.from_pretrained(
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3394, in from_pretrained
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 939, in __init__
groups._create_zero_param_parallel_group(_ds_config.zero_config.zero_hpz_partition_size)
File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/utils/groups.py", line 518, in _create_zero_param_parallel_group
assert _ZERO_PARAM_INTRA_PARALLEL_GROUP is None, \
AssertionError: ZeRO parameter intra parallel group is already initialized
Traceback (most recent call last):
File "/home/ivanfung/workspace/bug/test.py", line 366, in <module>
train(args)
File "/home/ivanfung/workspace/bug/test.py", line 340, in train
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2193, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2577, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
output = eval_loop(
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3656, in evaluation_loop
metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
File "/home/ivanfung/workspace/bug/test.py", line 57, in compute_metrics
bs_f1 = BERT_SCORER.compute(
File "/home/ivanfung/workspace/app/evaluate/src/evaluate/module.py", line 462, in compute
output = self._compute(**inputs, **compute_kwargs)
File "/home/ivanfung/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bertscore/cf4907b18f8f741f202232c0f8009a3bd49ff98802c245abcb6ea51a37a8c05b/bertscore.py", line 189, in _compute
self.cached_bertscorer = scorer(
File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/scorer.py", line 98, in __init__
self._model = get_model(self.model_type, self.num_layers, self.all_layers)
File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/utils.py", line 255, in get_model
model = AutoModel.from_pretrained(model_type)
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
return model_class.from_pretrained(
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3394, in from_pretrained
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 939, in __init__
groups._create_zero_param_parallel_group(_ds_config.zero_config.zero_hpz_partition_size)
File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/utils/groups.py", line 518, in _create_zero_param_parallel_group
assert _ZERO_PARAM_INTRA_PARALLEL_GROUP is None, \
AssertionError: ZeRO parameter intra parallel group is already initialized
Traceback (most recent call last):
File "/home/ivanfung/workspace/bug/test.py", line 366, in <module>
train(args)
File "/home/ivanfung/workspace/bug/test.py", line 340, in train
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2193, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2577, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
output = eval_loop(
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3656, in evaluation_loop
metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
File "/home/ivanfung/workspace/bug/test.py", line 57, in compute_metrics
bs_f1 = BERT_SCORER.compute(
File "/home/ivanfung/workspace/app/evaluate/src/evaluate/module.py", line 462, in compute
output = self._compute(**inputs, **compute_kwargs)
File "/home/ivanfung/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bertscore/cf4907b18f8f741f202232c0f8009a3bd49ff98802c245abcb6ea51a37a8c05b/bertscore.py", line 189, in _compute
self.cached_bertscorer = scorer(
File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/scorer.py", line 98, in __init__
self._model = get_model(self.model_type, self.num_layers, self.all_layers)
File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/utils.py", line 255, in get_model
model = AutoModel.from_pretrained(model_type)
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
return model_class.from_pretrained(
File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3394, in from_pretrained
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 939, in __init__
groups._create_zero_param_parallel_group(_ds_config.zero_config.zero_hpz_partition_size)
File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/utils/groups.py", line 518, in _create_zero_param_parallel_group
assert _ZERO_PARAM_INTRA_PARALLEL_GROUP is None, \
AssertionError: ZeRO parameter intra parallel group is already initialized
[2024-04-18 23:51:25,958] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 493390
[2024-04-18 23:51:26,535] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 493391
[2024-04-18 23:51:26,595] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 493392
[2024-04-18 23:51:26,595] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 493393
[2024-04-18 23:51:26,783] [ERROR] [launch.py:322:sigkill_handler] ['/home/ivanfung/miniforge3/bin/python3.10', '-u', 'test.py', '--local_rank=3', '--deepspeed', 'deepspeed_config_zero3_without_offload.json'] exits with return code = 1
Dataset
I used the following data examples for training and validation in this error reproduction procedure.
Please download and use the command unzip dataset.zip to decompress it.
dataset.zip
Steps for Reproduction
- Create DeepSpeed config
deepspeed_config_zero3_without_offload.jsonshown as the following:
{
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"zero_hpz_partition_size": 8,
"reduce_bucket_size": 10000000,
"reduce_scatter": true,
"stage3_gather_16bit_weights_on_model_save": false
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
- Use DeepSpeed to run a training script
test.pythat imports models with statements likeAutoModel.from_pretrained(...)in Hugging Face.
# -*- coding: utf-8 -*-
import os
import sys
import json
import glob
import logging
import argparse
import warnings
from typing import List, Dict, Optional
import torch
import transformers
from evaluate import load
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, util
from transformers import (
LlamaForCausalLM,
LlamaTokenizer,
AutoModelForCausalLM,
AutoTokenizer,
)
from transformers.tokenization_utils_base import BatchEncoding
warnings.filterwarnings("ignore")
TOKENIZER = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
TOKENIZER.pad_token_id = 0
TOKENIZER.bos_token_id = 1
TOKENIZER.eos_token_id = 2
BERT_SCORER = load("bertscore")
def preprocess_logits_for_metrics(logits, labels):
"""
Original Trainer may cause OOM issue.
This is a workaround to avoid storing too many tensors that are not needed.
"""
pred_ids = torch.argmax(logits, dim=-1)
return pred_ids, labels
def compute_metrics(eval_preds):
"""Compute metrics for evaluation."""
pred_ids = eval_preds.predictions[0]
labels_ids = eval_preds.label_ids
if isinstance(pred_ids, tuple):
pred_ids = pred_ids[0]
pred_ids[pred_ids == -100] = TOKENIZER.pad_token_id
pred_str = TOKENIZER.batch_decode(pred_ids, skip_special_tokens=True)
labels_ids[labels_ids == -100] = TOKENIZER.pad_token_id
label_str = TOKENIZER.batch_decode(labels_ids, skip_special_tokens=True)
# compute BERTScore F1
bs_f1 = BERT_SCORER.compute(
predictions=pred_str,
references=label_str,
lang="en",
nthreads=16,
device="cuda:3",
)["f1"][0]
# return {"rouge-l": round(rouge_l, 4) * 100}
return {
"bertscore-f1": round(bs_f1, 4) * 100,
}
def get_logger(logger_name: str, output_dir: str) -> logging.Logger:
"""Initialize logger."""
logger = logging.getLogger(logger_name)
logger.setLevel(logging.DEBUG)
os.makedirs(output_dir, exist_ok=True)
file_handler = logging.FileHandler(
os.path.join(output_dir, "log.txt"), mode="w")
file_handler.setLevel(logging.INFO)
file_handler.setFormatter(
logging.Formatter(
fmt="%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
)
logger.addHandler(file_handler)
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)
console_handler.setFormatter(
logging.Formatter(
fmt="%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
)
logger.addHandler(console_handler)
return logger
def train(args: argparse.Namespace) -> None:
"""Training entry for supervised fine-tuning."""
model_config = {
"batch_size": 128,
"num_epochs": 5,
"per_device_train_batch_size": 32,
"eval_times": 10,
"warmup_rate": 0.06,
"gradient_accumulation_steps": 1,
}
model_type = "llama"
model_name_or_path = "meta-llama/Llama-2-7b-chat-hf"
data_path_train = "./train.jsonl"
data_path_valid = "./valid.jsonl"
output_dir = "./output"
max_seq_len = 128
logger = get_logger("train", "output")
logger.info("args.__dict__ : {}".format(args.__dict__))
assert (
model_name_or_path
), "Please specify a --base_model, e.g. --base_model='decapoda-research/llama-7b-hf'"
gradient_accumulation_steps = (
model_config["batch_size"] // model_config["per_device_train_batch_size"]
if "gradient_accumulation_steps" not in model_config
else model_config["gradient_accumulation_steps"]
)
logger.info(
"per_device_train_batch_size = {}, gradient_accumulation_steps = {}".format(
model_config["per_device_train_batch_size"], gradient_accumulation_steps
)
)
device_map = None
world_size = int(
os.environ.get("WORLD_SIZE", 1)
) # `world_size` is corresponding to the number of GPUs
ddp = world_size != 1
if ddp:
device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)}
gradient_accumulation_steps = max(
gradient_accumulation_steps // world_size, 1)
# load model and tokenizer for LLaMA and its variants
model = LlamaForCausalLM.from_pretrained(
model_name_or_path,
device_map=device_map,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
tokenizer = LlamaTokenizer.from_pretrained(model_name_or_path)
tokenizer.pad_token_id = 0
tokenizer.bos_token_id = 1
tokenizer.eos_token_id = 2
def tokenize(
input_text: str, target_text: str, add_eos_token: bool = True
) -> Dict[str, str]:
"""Tokenize for the given prompt and convert input prompt to input_ids, attention_mask and labels' ids."""
result = dict()
inputs = tokenizer(
input_text,
truncation=False,
max_length=max_seq_len,
padding=False,
return_tensors=None,
)
targets = tokenizer(
target_text,
truncation=False,
max_length=max_seq_len,
padding=False,
return_tensors=None,
)
inputs_len = len(inputs["input_ids"])
targets_len = len(targets["input_ids"])
# (1) len of inputs + len of targets < max_seq_len
if inputs_len + targets_len < max_seq_len:
result["input_ids"] = inputs["input_ids"] + targets["input_ids"]
result["attention_mask"] = (
inputs["attention_mask"] + targets["attention_mask"]
)
# (2) len of inputs + len of targets >= max_seq_len, shrink the length of inputs
elif inputs_len + targets_len >= max_seq_len:
inputs_len = max_seq_len - targets_len - 1
result["input_ids"] = (
inputs["input_ids"][:inputs_len] + targets["input_ids"]
)
result["attention_mask"] = (
inputs["attention_mask"][:inputs_len] +
targets["attention_mask"]
)
if inputs_len <= 8:
print(
f"[DROP] `inputs_len` should be greater than 30 in input of data point: {input_text}."
)
return {"input_ids": [], "attention_mask": [], "labels": []}
# Add token "eos"
if (
result["input_ids"][-1] != tokenizer.eos_token_id
and len(result["input_ids"]) < max_seq_len
and add_eos_token
):
result["input_ids"].append(tokenizer.eos_token_id)
result["attention_mask"].append(1)
if add_eos_token and len(result["input_ids"]) >= max_seq_len:
result["input_ids"][max_seq_len - 1] = tokenizer.eos_token_id
result["attention_mask"][max_seq_len - 1] = 1
# Construct labels, ignore the loss computing for tokens of prompt by assigning them with -100
result["labels"] = [-100] * inputs_len + \
result["input_ids"][inputs_len:].copy()
if len(result["input_ids"]) != len(result["labels"]):
print(
f"[DROP] Length mismatch between `input_ids` and `labels` in {input_text}!"
)
return {"input_ids": [], "attention_mask": [], "labels": []}
return result
def generate_and_tokenize_prompt(datapoint) -> Dict[str, str]:
"""Generate and construct prompt constrained by a fixed size of window.
Dynamically generate input sequence and target sequence for each training example.
"""
input_text = (
datapoint["instruction"] + "\n\n"
) # no use prefix of prompt for fine-tuning
input_text = (
tokenizer.bos_token + input_text
if tokenizer.bos_token is not None
else input_text
) # Add token bos if exists
target_text = (
datapoint["definition"] + tokenizer.eos_token
if tokenizer.eos_token is not None
else datapoint["definition"]
) # Add token eos if exists
# Check the length of input_text and target_text
if len(input_text.split()) + len(target_text.split()) <= max_seq_len:
return tokenize(input_text, target_text)
else:
print(
f"[DROP] Length of `input_text` ⨁ `target_text` should be less than {max_seq_len} in data point: {input_text}."
)
return {"input_ids": [], "attention_mask": [], "labels": []}
data_train = load_dataset("json", data_files=data_path_train)["train"]
training_nums = len(data_train)
# tokenize datapoints for training set
train_data = (
data_train.shuffle()
.map(generate_and_tokenize_prompt, num_proc=32, keep_in_memory=True)
.filter(lambda x: len(x["input_ids"]) > 0, num_proc=32, keep_in_memory=True)
)
print(
f"Disagreement of input vs. target of training data: {str(len([len(d['input_ids']) != len(d['labels']) for d in train_data]))}"
)
logger.info("Tokenizing training set success!")
if os.path.isfile(data_path_valid):
data_valid = load_dataset("json", data_files=data_path_valid)["train"]
# tokenize datapoints for validation set
val_data = (
data_valid.shuffle()
.map(generate_and_tokenize_prompt, num_proc=32, keep_in_memory=True)
.filter(lambda x: len(x["input_ids"]) > 0, num_proc=32, keep_in_memory=True)
)
else:
val_data = None
print(
f"Disagreement of input vs target of valid data: {str(len([len(d['input_ids']) != len(d['labels']) for d in val_data]))}"
)
print("***** Start Training *****")
num_gpus = torch.cuda.device_count()
total_steps = (
training_nums // (gradient_accumulation_steps *
model_config["per_device_train_batch_size"] * num_gpus) + 1
) * model_config["num_epochs"]
eval_interval_steps = save_interval_steps = total_steps // model_config["eval_times"]
warmup_steps = int(total_steps * model_config.get("warmup_rate", 0.06))
logger.info(
"num_gpus = {}, training_nums = {}, total_steps = {}, warmup_steps = {}".format(
num_gpus, training_nums, total_steps, warmup_steps
)
)
trainer = transformers.Trainer(
model=model,
train_dataset=train_data,
eval_dataset=val_data,
preprocess_logits_for_metrics=preprocess_logits_for_metrics,
compute_metrics=compute_metrics,
args=transformers.TrainingArguments(
per_device_train_batch_size=model_config["per_device_train_batch_size"],
gradient_accumulation_steps=gradient_accumulation_steps,
warmup_steps=warmup_steps,
adam_beta1=0.9,
adam_beta2=0.95,
weight_decay=0.01,
num_train_epochs=model_config["num_epochs"],
learning_rate=1e-5,
lr_scheduler_type="linear",
bf16=True,
tf32=True,
gradient_checkpointing=True,
logging_dir="logs/tensorboard",
logging_steps=10,
evaluation_strategy="steps",
eval_steps=eval_interval_steps,
save_steps=eval_interval_steps,
output_dir=output_dir,
report_to=None,
save_total_limit=2,
load_best_model_at_end=True,
ddp_find_unused_parameters=False if ddp else None,
deepspeed=(
args.deepspeed if args.deepspeed else None
),
group_by_length=True,
),
data_collator=transformers.DataCollatorForSeq2Seq(
tokenizer,
pad_to_multiple_of=8,
return_tensors="pt",
padding=True,
),
)
model.config.use_cache = False
if torch.__version__ >= "2" and sys.platform != "win32":
model = torch.compile(model)
print("trainer.train")
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
logger.info("***** Checkpointing *****")
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
# save tokenizer for each detected checkpoint directory in output_dir
for checkpoint_dir in glob.glob(os.path.join(output_dir, "checkpoint-*")):
try:
tokenizer.save_pretrained(checkpoint_dir)
except:
pass
logger.info("Training succeeded")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--deepspeed", type=str, help="deepspeed config")
parser.add_argument(
"--resume_from_checkpoint",
action="store_true",
help="either training checkpoint or final adapter",
)
parser.add_argument("--local_rank", type=int)
args = parser.parse_args()
train(args)
- Run training with DeepSpeed
deepspeed --num_gpus=4 test.py --deepspeed deepspeed_config_zero3_without_offload.json
Expected behavior
It should allow me to compute the BERT Score using a GPU device in the interval training procedure, rather than throwing an error message.
Any thoughts on this?
@pacman100
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
🤔
Sorry @jacklanda, I think @muellerzr and @SunMarc will replace @pacman100 on such issues! If one of you can have a look!
Are there any thoughts on it?
At this time we do not support multiple models with deepspeed, please see: https://github.com/huggingface/accelerate/issues/2496
At this time we do not support multiple models with
deepspeed, please see: huggingface/accelerate#2496
I see. Thanks for your message :)