System Info

OS: Ubuntu 22.04.3 LTS
GPU count and types: one machine with 4 x NVIDIA H100 PCIe
Python version: 3.10.12
Any other relevant info about your setup: transformers 4.39.3

Who can help?

@ArthurZucker @younesbelkada @pacman100

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Error Messages

deepspeed --num_gpus=4 test.py --deepspeed deepspeed_config_zero3_without_offload.json
[2024-04-18 23:48:52,671] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-18 23:48:53,410] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-04-18 23:48:53,440] [INFO] [runner.py:568:main] cmd = /home/ivanfung/miniforge3/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None test.py --deepspeed deepspeed_config_zero3_without_offload.json
[2024-04-18 23:48:55,322] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-18 23:48:55,784] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2024-04-18 23:48:55,784] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2024-04-18 23:48:55,784] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2024-04-18 23:48:55,784] [INFO] [launch.py:163:main] dist_world_size=4
[2024-04-18 23:48:55,784] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2024-04-18 23:48:55,785] [INFO] [launch.py:253:main] process 493390 spawned with command: ['/home/ivanfung/miniforge3/bin/python3.10', '-u', 'test.py', '--local_rank=0', '--deepspeed', 'deepspeed_config_zero3_without_offload.json']
[2024-04-18 23:48:55,786] [INFO] [launch.py:253:main] process 493391 spawned with command: ['/home/ivanfung/miniforge3/bin/python3.10', '-u', 'test.py', '--local_rank=1', '--deepspeed', 'deepspeed_config_zero3_without_offload.json']
[2024-04-18 23:48:55,787] [INFO] [launch.py:253:main] process 493392 spawned with command: ['/home/ivanfung/miniforge3/bin/python3.10', '-u', 'test.py', '--local_rank=2', '--deepspeed', 'deepspeed_config_zero3_without_offload.json']
[2024-04-18 23:48:55,788] [INFO] [launch.py:253:main] process 493393 spawned with command: ['/home/ivanfung/miniforge3/bin/python3.10', '-u', 'test.py', '--local_rank=3', '--deepspeed', 'deepspeed_config_zero3_without_offload.json']
2024-04-18 23:49:02 - test.py[line:117] - INFO: args.__dict__ : {'deepspeed': 'deepspeed_config_zero3_without_offload.json', 'resume_from_checkpoint': False, 'local_rank': 3}
2024-04-18 23:49:02 - test.py[line:129] - INFO: per_device_train_batch_size = 32, gradient_accumulation_steps = 1
2024-04-18 23:49:02 - test.py[line:117] - INFO: args.__dict__ : {'deepspeed': 'deepspeed_config_zero3_without_offload.json', 'resume_from_checkpoint': False, 'local_rank': 2}
2024-04-18 23:49:02 - test.py[line:129] - INFO: per_device_train_batch_size = 32, gradient_accumulation_steps = 1
2024-04-18 23:49:02 - test.py[line:117] - INFO: args.__dict__ : {'deepspeed': 'deepspeed_config_zero3_without_offload.json', 'resume_from_checkpoint': False, 'local_rank': 0}
2024-04-18 23:49:02 - test.py[line:129] - INFO: per_device_train_batch_size = 32, gradient_accumulation_steps = 1
2024-04-18 23:49:02 - test.py[line:117] - INFO: args.__dict__ : {'deepspeed': 'deepspeed_config_zero3_without_offload.json', 'resume_from_checkpoint': False, 'local_rank': 1}
2024-04-18 23:49:02 - test.py[line:129] - INFO: per_device_train_batch_size = 32, gradient_accumulation_steps = 1
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.15s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.19s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.18s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.15s/it]
Map (num_proc=32): 100%|███████████████████████████████████████████████████████████████████████████████| 13883/13883 [00:00<00:00, 21859.95 examples/s]
Map (num_proc=32): 100%|███████████████████████████████████████████████████████████████████████████████| 13883/13883 [00:00<00:00, 18743.57 examples/s]
Map (num_proc=32): 100%|███████████████████████████████████████████████████████████████████████████████| 13883/13883 [00:00<00:00, 19966.79 examples/s]
Map (num_proc=32): 100%|███████████████████████████████████████████████████████████████████████████████| 13883/13883 [00:00<00:00, 22316.04 examples/s]
Filter (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████| 13883/13883 [00:00<00:00, 37141.84 examples/s]
Filter (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████| 13883/13883 [00:00<00:00, 38661.50 examples/s]
Filter (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████| 13883/13883 [00:00<00:00, 35041.42 examples/s]
Filter (num_proc=32): 100%|████████████████████████████████████████████████████████████████████████████| 13883/13883 [00:00<00:00, 38018.21 examples/s]
Disagreement of input vs. target of training data: 13883
2024-04-18 23:49:15 - test.py[line:265] - INFO: Tokenizing training set success!
Disagreement of input vs. target of training data: 13883
2024-04-18 23:49:15 - test.py[line:265] - INFO: Tokenizing training set success!
Disagreement of input vs. target of training data: 13883
2024-04-18 23:49:15 - test.py[line:265] - INFO: Tokenizing training set success!
Disagreement of input vs. target of training data: 13883
2024-04-18 23:49:15 - test.py[line:265] - INFO: Tokenizing training set success!
Map (num_proc=32): 100%|██████████████████████████████████████████████████████████████████████████████████| 1752/1752 [00:00<00:00, 4166.98 examples/s]
Map (num_proc=32): 100%|██████████████████████████████████████████████████████████████████████████████████| 1752/1752 [00:00<00:00, 4280.46 examples/s]
Map (num_proc=32): 100%|██████████████████████████████████████████████████████████████████████████████████| 1752/1752 [00:00<00:00, 4181.90 examples/s]
Filter (num_proc=32): 100%|███████████████████████████████████████████████████████████████████████████████| 1752/1752 [00:00<00:00, 5585.31 examples/s]
Map (num_proc=32): 100%|██████████████████████████████████████████████████████████████████████████████████| 1752/1752 [00:00<00:00, 4121.15 examples/s]
Filter (num_proc=32):  88%|█████████████████████████████████████████████████████████████████████▎         | 1536/1752 [00:00<00:00, 8288.97 examples/s]Disagreement of input vs target of valid data: 1752
***** Start Training *****
2024-04-18 23:49:19 - test.py[line:288] - INFO: num_gpus = 4, training_nums = 13883, total_steps = 545, warmup_steps = 32
Filter (num_proc=32): 100%|███████████████████████████████████████████████████████████████████████████████| 1752/1752 [00:00<00:00, 5868.96 examples/s]
Filter (num_proc=32): 100%|███████████████████████████████████████████████████████████████████████████████| 1752/1752 [00:00<00:00, 5691.31 examples/s]
Disagreement of input vs target of valid data: 1752
***** Start Training *****
2024-04-18 23:49:19 - test.py[line:288] - INFO: num_gpus = 4, training_nums = 13883, total_steps = 545, warmup_steps = 32
Disagreement of input vs target of valid data: 1752
***** Start Training *****
2024-04-18 23:49:19 - test.py[line:288] - INFO: num_gpus = 4, training_nums = 13883, total_steps = 545, warmup_steps = 32
Filter (num_proc=32):   0%|                                                                                            | 0/1752 [00:00<?, ? examples/s][2024-04-18 23:49:19,923] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Filter (num_proc=32):  53%|██████████████████████████████████████████▋                                     | 935/1752 [00:00<00:00, 4852.83 examples/s][2024-04-18 23:49:20,050] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-04-18 23:49:20,123] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Filter (num_proc=32): 100%|███████████████████████████████████████████████████████████████████████████████| 1752/1752 [00:00<00:00, 5395.09 examples/s]
[2024-04-18 23:49:20,171] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-18 23:49:20,251] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-04-18 23:49:20,252] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-04-18 23:49:20,303] [INFO] [comm.py:637:init_distributed] cdb=None
Disagreement of input vs target of valid data: 1752
***** Start Training *****
2024-04-18 23:49:20 - test.py[line:288] - INFO: num_gpus = 4, training_nums = 13883, total_steps = 545, warmup_steps = 32
[2024-04-18 23:49:21,018] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-18 23:49:21,146] [INFO] [comm.py:637:init_distributed] cdb=None
trainer.train
trainer.train
trainer.train
trainer.train
hpZeRO group size: 4
Parameter Offload: Total persistent parameters: 266240 in 65 params
{'loss': 4.996, 'grad_norm': 166.72275161005012, 'learning_rate': 3.125e-06, 'epoch': 0.09}
{'loss': 2.4735, 'grad_norm': 38.799148163513834, 'learning_rate': 6.25e-06, 'epoch': 0.18}
{'loss': 2.0208, 'grad_norm': 36.615276885153875, 'learning_rate': 9.375000000000001e-06, 'epoch': 0.28}
{'loss': 1.6564, 'grad_norm': 9.72972914196507, 'learning_rate': 9.844054580896686e-06, 'epoch': 0.37}
{'loss': 1.41, 'grad_norm': 9.066500374092316, 'learning_rate': 9.649122807017545e-06, 'epoch': 0.46}
 10%|███████████▏                                                                                                     | 54/545 [01:18<11:25,  1.40s/itTraceback (most recent call last):██████████████████████████████████████████████████████████████████████████████████████| 55/55 [00:22<00:00,  2.44it/s]
  File "/home/ivanfung/workspace/bug/test.py", line 366, in <module>
    train(args)
  File "/home/ivanfung/workspace/bug/test.py", line 340, in train
    trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2193, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2577, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
    output = eval_loop(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3656, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "/home/ivanfung/workspace/bug/test.py", line 57, in compute_metrics
    bs_f1 = BERT_SCORER.compute(
  File "/home/ivanfung/workspace/app/evaluate/src/evaluate/module.py", line 462, in compute
    output = self._compute(**inputs, **compute_kwargs)
  File "/home/ivanfung/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bertscore/cf4907b18f8f741f202232c0f8009a3bd49ff98802c245abcb6ea51a37a8c05b/bertscore.py", line 189, in _compute
    self.cached_bertscorer = scorer(
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/scorer.py", line 98, in __init__
    self._model = get_model(self.model_type, self.num_layers, self.all_layers)
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/utils.py", line 255, in get_model
    model = AutoModel.from_pretrained(model_type)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3394, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 939, in __init__
    groups._create_zero_param_parallel_group(_ds_config.zero_config.zero_hpz_partition_size)
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/utils/groups.py", line 518, in _create_zero_param_parallel_group
    assert _ZERO_PARAM_INTRA_PARALLEL_GROUP is None, \
AssertionError: ZeRO parameter intra parallel group is already initialized
Traceback (most recent call last):
  File "/home/ivanfung/workspace/bug/test.py", line 366, in <module>
    train(args)
  File "/home/ivanfung/workspace/bug/test.py", line 340, in train
    trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2193, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2577, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
    output = eval_loop(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3656, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "/home/ivanfung/workspace/bug/test.py", line 57, in compute_metrics
    bs_f1 = BERT_SCORER.compute(
  File "/home/ivanfung/workspace/app/evaluate/src/evaluate/module.py", line 462, in compute
    output = self._compute(**inputs, **compute_kwargs)
  File "/home/ivanfung/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bertscore/cf4907b18f8f741f202232c0f8009a3bd49ff98802c245abcb6ea51a37a8c05b/bertscore.py", line 189, in _compute
    self.cached_bertscorer = scorer(
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/scorer.py", line 98, in __init__
    self._model = get_model(self.model_type, self.num_layers, self.all_layers)
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/utils.py", line 255, in get_model
    model = AutoModel.from_pretrained(model_type)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3394, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 939, in __init__
    groups._create_zero_param_parallel_group(_ds_config.zero_config.zero_hpz_partition_size)
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/utils/groups.py", line 518, in _create_zero_param_parallel_group
    assert _ZERO_PARAM_INTRA_PARALLEL_GROUP is None, \
AssertionError: ZeRO parameter intra parallel group is already initialized
Traceback (most recent call last):
  File "/home/ivanfung/workspace/bug/test.py", line 366, in <module>
    train(args)
  File "/home/ivanfung/workspace/bug/test.py", line 340, in train
    trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2193, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2577, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
    output = eval_loop(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3656, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "/home/ivanfung/workspace/bug/test.py", line 57, in compute_metrics
    bs_f1 = BERT_SCORER.compute(
  File "/home/ivanfung/workspace/app/evaluate/src/evaluate/module.py", line 462, in compute
    output = self._compute(**inputs, **compute_kwargs)
  File "/home/ivanfung/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bertscore/cf4907b18f8f741f202232c0f8009a3bd49ff98802c245abcb6ea51a37a8c05b/bertscore.py", line 189, in _compute
    self.cached_bertscorer = scorer(
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/scorer.py", line 98, in __init__
    self._model = get_model(self.model_type, self.num_layers, self.all_layers)
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/utils.py", line 255, in get_model
    model = AutoModel.from_pretrained(model_type)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3394, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 939, in __init__
    groups._create_zero_param_parallel_group(_ds_config.zero_config.zero_hpz_partition_size)
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/utils/groups.py", line 518, in _create_zero_param_parallel_group
    assert _ZERO_PARAM_INTRA_PARALLEL_GROUP is None, \
AssertionError: ZeRO parameter intra parallel group is already initialized
Traceback (most recent call last):
  File "/home/ivanfung/workspace/bug/test.py", line 366, in <module>
    train(args)
  File "/home/ivanfung/workspace/bug/test.py", line 340, in train
    trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2193, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2577, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
    output = eval_loop(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3656, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "/home/ivanfung/workspace/bug/test.py", line 57, in compute_metrics
    bs_f1 = BERT_SCORER.compute(
  File "/home/ivanfung/workspace/app/evaluate/src/evaluate/module.py", line 462, in compute
    output = self._compute(**inputs, **compute_kwargs)
  File "/home/ivanfung/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--bertscore/cf4907b18f8f741f202232c0f8009a3bd49ff98802c245abcb6ea51a37a8c05b/bertscore.py", line 189, in _compute
    self.cached_bertscorer = scorer(
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/scorer.py", line 98, in __init__
    self._model = get_model(self.model_type, self.num_layers, self.all_layers)
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/bert_score/utils.py", line 255, in get_model
    model = AutoModel.from_pretrained(model_type)
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ivanfung/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3394, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 939, in __init__
    groups._create_zero_param_parallel_group(_ds_config.zero_config.zero_hpz_partition_size)
  File "/home/ivanfung/miniforge3/lib/python3.10/site-packages/deepspeed/utils/groups.py", line 518, in _create_zero_param_parallel_group
    assert _ZERO_PARAM_INTRA_PARALLEL_GROUP is None, \
AssertionError: ZeRO parameter intra parallel group is already initialized
[2024-04-18 23:51:25,958] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 493390
[2024-04-18 23:51:26,535] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 493391
[2024-04-18 23:51:26,595] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 493392
[2024-04-18 23:51:26,595] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 493393
[2024-04-18 23:51:26,783] [ERROR] [launch.py:322:sigkill_handler] ['/home/ivanfung/miniforge3/bin/python3.10', '-u', 'test.py', '--local_rank=3', '--deepspeed', 'deepspeed_config_zero3_without_offload.json'] exits with return code = 1

Dataset

I used the following data examples for training and validation in this error reproduction procedure. Please download and use the command unzip dataset.zip to decompress it. dataset.zip

Steps for Reproduction

Create DeepSpeed config deepspeed_config_zero3_without_offload.json shown as the following:

{
    "bf16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "zero_hpz_partition_size": 8,
        "reduce_bucket_size": 10000000,
        "reduce_scatter": true,
        "stage3_gather_16bit_weights_on_model_save": false
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Use DeepSpeed to run a training script test.py that imports models with statements like AutoModel.from_pretrained(...) in Hugging Face.

# -*- coding: utf-8 -*-

import os
import sys
import json
import glob
import logging
import argparse
import warnings
from typing import List, Dict, Optional

import torch
import transformers
from evaluate import load
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, util
from transformers import (
    LlamaForCausalLM,
    LlamaTokenizer,
    AutoModelForCausalLM,
    AutoTokenizer,
)
from transformers.tokenization_utils_base import BatchEncoding

warnings.filterwarnings("ignore")

TOKENIZER = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
TOKENIZER.pad_token_id = 0
TOKENIZER.bos_token_id = 1
TOKENIZER.eos_token_id = 2

BERT_SCORER = load("bertscore")


def preprocess_logits_for_metrics(logits, labels):
    """
    Original Trainer may cause OOM issue.
    This is a workaround to avoid storing too many tensors that are not needed.
    """
    pred_ids = torch.argmax(logits, dim=-1)
    return pred_ids, labels


def compute_metrics(eval_preds):
    """Compute metrics for evaluation."""
    pred_ids = eval_preds.predictions[0]
    labels_ids = eval_preds.label_ids
    if isinstance(pred_ids, tuple):
        pred_ids = pred_ids[0]

    pred_ids[pred_ids == -100] = TOKENIZER.pad_token_id
    pred_str = TOKENIZER.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = TOKENIZER.pad_token_id
    label_str = TOKENIZER.batch_decode(labels_ids, skip_special_tokens=True)

    # compute BERTScore F1
    bs_f1 = BERT_SCORER.compute(
        predictions=pred_str,
        references=label_str,
        lang="en",
        nthreads=16,
        device="cuda:3",
    )["f1"][0]

    # return {"rouge-l": round(rouge_l, 4) * 100}
    return {
        "bertscore-f1": round(bs_f1, 4) * 100,
    }


def get_logger(logger_name: str, output_dir: str) -> logging.Logger:
    """Initialize logger."""
    logger = logging.getLogger(logger_name)
    logger.setLevel(logging.DEBUG)
    os.makedirs(output_dir, exist_ok=True)
    file_handler = logging.FileHandler(
        os.path.join(output_dir, "log.txt"), mode="w")
    file_handler.setLevel(logging.INFO)
    file_handler.setFormatter(
        logging.Formatter(
            fmt="%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s",
            datefmt="%Y-%m-%d %H:%M:%S",
        )
    )
    logger.addHandler(file_handler)
    console_handler = logging.StreamHandler()
    console_handler.setLevel(logging.INFO)
    console_handler.setFormatter(
        logging.Formatter(
            fmt="%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s",
            datefmt="%Y-%m-%d %H:%M:%S",
        )
    )
    logger.addHandler(console_handler)

    return logger


def train(args: argparse.Namespace) -> None:
    """Training entry for supervised fine-tuning."""
    model_config = {
        "batch_size": 128,
        "num_epochs": 5,
        "per_device_train_batch_size": 32,
        "eval_times": 10,
        "warmup_rate": 0.06,
        "gradient_accumulation_steps": 1,
    }
    model_type = "llama"
    model_name_or_path = "meta-llama/Llama-2-7b-chat-hf"
    data_path_train = "./train.jsonl"
    data_path_valid = "./valid.jsonl"
    output_dir = "./output"
    max_seq_len = 128

    logger = get_logger("train", "output")
    logger.info("args.__dict__ : {}".format(args.__dict__))

    assert (
        model_name_or_path
    ), "Please specify a --base_model, e.g. --base_model='decapoda-research/llama-7b-hf'"

    gradient_accumulation_steps = (
        model_config["batch_size"] // model_config["per_device_train_batch_size"]
        if "gradient_accumulation_steps" not in model_config
        else model_config["gradient_accumulation_steps"]
    )

    logger.info(
        "per_device_train_batch_size = {}, gradient_accumulation_steps = {}".format(
            model_config["per_device_train_batch_size"], gradient_accumulation_steps
        )
    )
    device_map = None
    world_size = int(
        os.environ.get("WORLD_SIZE", 1)
    )  # `world_size` is corresponding to the number of GPUs
    ddp = world_size != 1
    if ddp:
        device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)}
        gradient_accumulation_steps = max(
            gradient_accumulation_steps // world_size, 1)

    # load model and tokenizer for LLaMA and its variants
    model = LlamaForCausalLM.from_pretrained(
        model_name_or_path,
        device_map=device_map,
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    tokenizer = LlamaTokenizer.from_pretrained(model_name_or_path)
    tokenizer.pad_token_id = 0
    tokenizer.bos_token_id = 1
    tokenizer.eos_token_id = 2

    def tokenize(
        input_text: str, target_text: str, add_eos_token: bool = True
    ) -> Dict[str, str]:
        """Tokenize for the given prompt and convert input prompt to input_ids, attention_mask and labels' ids."""
        result = dict()
        inputs = tokenizer(
            input_text,
            truncation=False,
            max_length=max_seq_len,
            padding=False,
            return_tensors=None,
        )
        targets = tokenizer(
            target_text,
            truncation=False,
            max_length=max_seq_len,
            padding=False,
            return_tensors=None,
        )
        inputs_len = len(inputs["input_ids"])
        targets_len = len(targets["input_ids"])

        # (1) len of inputs + len of targets < max_seq_len
        if inputs_len + targets_len < max_seq_len:
            result["input_ids"] = inputs["input_ids"] + targets["input_ids"]
            result["attention_mask"] = (
                inputs["attention_mask"] + targets["attention_mask"]
            )
        # (2) len of inputs + len of targets >= max_seq_len, shrink the length of inputs
        elif inputs_len + targets_len >= max_seq_len:
            inputs_len = max_seq_len - targets_len - 1
            result["input_ids"] = (
                inputs["input_ids"][:inputs_len] + targets["input_ids"]
            )
            result["attention_mask"] = (
                inputs["attention_mask"][:inputs_len] +
                targets["attention_mask"]
            )

        if inputs_len <= 8:
            print(
                f"[DROP] `inputs_len` should be greater than 30 in input of data point: {input_text}."
            )
            return {"input_ids": [], "attention_mask": [], "labels": []}

        # Add token "eos"
        if (
            result["input_ids"][-1] != tokenizer.eos_token_id
            and len(result["input_ids"]) < max_seq_len
            and add_eos_token
        ):
            result["input_ids"].append(tokenizer.eos_token_id)
            result["attention_mask"].append(1)

        if add_eos_token and len(result["input_ids"]) >= max_seq_len:
            result["input_ids"][max_seq_len - 1] = tokenizer.eos_token_id
            result["attention_mask"][max_seq_len - 1] = 1

        # Construct labels, ignore the loss computing for tokens of prompt by assigning them with -100
        result["labels"] = [-100] * inputs_len + \
            result["input_ids"][inputs_len:].copy()

        if len(result["input_ids"]) != len(result["labels"]):
            print(
                f"[DROP] Length mismatch between `input_ids` and `labels` in {input_text}!"
            )
            return {"input_ids": [], "attention_mask": [], "labels": []}

        return result

    def generate_and_tokenize_prompt(datapoint) -> Dict[str, str]:
        """Generate and construct prompt constrained by a fixed size of window.
        Dynamically generate input sequence and target sequence for each training example.
        """
        input_text = (
            datapoint["instruction"] + "\n\n"
        )  # no use prefix of prompt for fine-tuning
        input_text = (
            tokenizer.bos_token + input_text
            if tokenizer.bos_token is not None
            else input_text
        )  # Add token bos if exists
        target_text = (
            datapoint["definition"] + tokenizer.eos_token
            if tokenizer.eos_token is not None
            else datapoint["definition"]
        )  # Add token eos if exists

        # Check the length of input_text and target_text
        if len(input_text.split()) + len(target_text.split()) <= max_seq_len:
            return tokenize(input_text, target_text)
        else:
            print(
                f"[DROP] Length of `input_text` ⨁ `target_text` should be less than {max_seq_len} in data point: {input_text}."
            )
            return {"input_ids": [], "attention_mask": [], "labels": []}

    data_train = load_dataset("json", data_files=data_path_train)["train"]
    training_nums = len(data_train)

    # tokenize datapoints for training set
    train_data = (
        data_train.shuffle()
        .map(generate_and_tokenize_prompt, num_proc=32, keep_in_memory=True)
        .filter(lambda x: len(x["input_ids"]) > 0, num_proc=32, keep_in_memory=True)
    )
    print(
        f"Disagreement of input vs. target of training data: {str(len([len(d['input_ids']) != len(d['labels']) for d in train_data]))}"
    )
    logger.info("Tokenizing training set success!")
    if os.path.isfile(data_path_valid):
        data_valid = load_dataset("json", data_files=data_path_valid)["train"]
        # tokenize datapoints for validation set
        val_data = (
            data_valid.shuffle()
            .map(generate_and_tokenize_prompt, num_proc=32, keep_in_memory=True)
            .filter(lambda x: len(x["input_ids"]) > 0, num_proc=32, keep_in_memory=True)
        )
    else:
        val_data = None
    print(
        f"Disagreement of input vs target of valid data: {str(len([len(d['input_ids']) != len(d['labels']) for d in val_data]))}"
    )

    print("***** Start Training *****")
    num_gpus = torch.cuda.device_count()
    total_steps = (
        training_nums // (gradient_accumulation_steps *
                          model_config["per_device_train_batch_size"] * num_gpus) + 1
    ) * model_config["num_epochs"]
    eval_interval_steps = save_interval_steps = total_steps // model_config["eval_times"]
    warmup_steps = int(total_steps * model_config.get("warmup_rate", 0.06))
    logger.info(
        "num_gpus = {}, training_nums = {}, total_steps = {}, warmup_steps = {}".format(
            num_gpus, training_nums, total_steps, warmup_steps
        )
    )
    trainer = transformers.Trainer(
        model=model,
        train_dataset=train_data,
        eval_dataset=val_data,
        preprocess_logits_for_metrics=preprocess_logits_for_metrics,
        compute_metrics=compute_metrics,
        args=transformers.TrainingArguments(
            per_device_train_batch_size=model_config["per_device_train_batch_size"],
            gradient_accumulation_steps=gradient_accumulation_steps,
            warmup_steps=warmup_steps,
            adam_beta1=0.9,
            adam_beta2=0.95,
            weight_decay=0.01,
            num_train_epochs=model_config["num_epochs"],
            learning_rate=1e-5,
            lr_scheduler_type="linear",
            bf16=True,
            tf32=True,
            gradient_checkpointing=True,
            logging_dir="logs/tensorboard",
            logging_steps=10,
            evaluation_strategy="steps",
            eval_steps=eval_interval_steps,
            save_steps=eval_interval_steps,
            output_dir=output_dir,
            report_to=None,
            save_total_limit=2,
            load_best_model_at_end=True,
            ddp_find_unused_parameters=False if ddp else None,
            deepspeed=(
                args.deepspeed if args.deepspeed else None
            ),
            group_by_length=True,
        ),
        data_collator=transformers.DataCollatorForSeq2Seq(
            tokenizer,
            pad_to_multiple_of=8,
            return_tensors="pt",
            padding=True,
        ),
    )

    model.config.use_cache = False

    if torch.__version__ >= "2" and sys.platform != "win32":
        model = torch.compile(model)
    print("trainer.train")
    trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
    logger.info("***** Checkpointing *****")

    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    # save tokenizer for each detected checkpoint directory in output_dir
    for checkpoint_dir in glob.glob(os.path.join(output_dir, "checkpoint-*")):
        try:
            tokenizer.save_pretrained(checkpoint_dir)
        except:
            pass

    logger.info("Training succeeded")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--deepspeed", type=str, help="deepspeed config")
    parser.add_argument(
        "--resume_from_checkpoint",
        action="store_true",
        help="either training checkpoint or final adapter",
    )
    parser.add_argument("--local_rank", type=int)
    args = parser.parse_args()

    train(args)

Run training with DeepSpeed

deepspeed --num_gpus=4 test.py --deepspeed deepspeed_config_zero3_without_offload.json

Expected behavior

It should allow me to compute the BERT Score using a GPU device in the interval training procedure, rather than throwing an error message.

Apr 18 '24 15:04 jacklanda

Any thoughts on this?

@pacman100

Apr 21 '24 06:04 jacklanda

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 19 '24 08:05 github-actions[bot]

🤔

May 19 '24 08:05 jacklanda

Sorry @jacklanda, I think @muellerzr and @SunMarc will replace @pacman100 on such issues! If one of you can have a look!

May 23 '24 13:05 ArthurZucker

Are there any thoughts on it?

Jun 17 '24 16:06 jacklanda

At this time we do not support multiple models with deepspeed, please see: https://github.com/huggingface/accelerate/issues/2496

Jun 18 '24 08:06 muellerzr

At this time we do not support multiple models with deepspeed, please see: huggingface/accelerate#2496

I see. Thanks for your message :)

Jun 18 '24 11:06 jacklanda

transformers
transformers copied to clipboard

[Bug] Error when trying to run two models in a machine with created ZeRO config variables.

System Info

Who can help?

Information

Tasks

Reproduction

Error Messages

Dataset

Steps for Reproduction

Expected behavior

transformers transformers copied to clipboard

[Bug] Error when trying to run two models in a machine with created ZeRO config variables.

System Info

Who can help?

Information

Tasks

Reproduction

Error Messages

Dataset

Steps for Reproduction

Expected behavior

transformers
transformers copied to clipboard