ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 73958 closing signal SIGTERM

Open jialesmu opened this issue 2 years ago • 6 comments

🐛 Describe the bug

[04/04/23` 03:02:31] INFO     colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
[04/04/23 03:02:31] INFO     colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0
                    INFO     colossalai - colossalai - INFO: process rank 1 is bound to device 1
[04/04/23 03:02:37] INFO     colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:557 set_seed
[04/04/23 03:02:37] INFO     colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:557 set_seed
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 42, python random: 42, ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/initialize.py:116 launch
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 42, python random: 42, ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 2, pipeline parallel size: 1, tensor parallel size: 1
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 73958 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 73959) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
train_sft.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-04_03:04:14
  host      : AgreeML
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 73959)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 73959
======================================================

I'm trying to run train_sft.sh with model llama 7b, however, it show this, what's wrong with this ? how can i fix this problem? thanks

torchrun --standalone --nproc_per_node=2 train_sft.py
--pretrain "ColossalAI/applications/models/11ama-7b-hf/"
--model 'llama'
--strategy colossalai_zero2
--log_interval 10
--save_path /Coati-7B/output
--dataset ColossalAI/applications/dataset/xp3_codeparrot_xlcost-text-to-code_Java-program-level_train_soljava.jsonl
--batch_size 1
--accimulation_steps 2
--lr 2e-5
--max_epochs 3 \

Environment

python3.9 torch 1.3.1 cuda 11.6 2 3090 GPU 24GB

jialesmu avatar Apr 04 '23 03:04 jialesmu

Unfortunately, there was an OOM in your machine. Two '2 3090 GPU 24GB' and your current main memory might not support training a 7-B model.

JThh avatar Apr 04 '23 06:04 JThh

Unfortunately, there was an OOM in your machine. Two '2 3090 GPU 24GB' and your current main memory might not support training a 7-B model.

Loading extension module fused_optim...
[04/04/23 09:33:01] INFO     colossalai - colossalai - INFO: /home/agree/jz/ColossalAI/applications/Chat/examples/train_sft_m.py:128 train
                    INFO     colossalai - colossalai - INFO: Using Distributed Sampler
[04/04/23 09:33:02] INFO     colossalai - colossalai - INFO: /home/agree/jz/ColossalAI/applications/Chat/examples/train_sft_m.py:128 train
                    INFO     colossalai - colossalai - INFO: Using Distributed Sampler
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/agree/jz/ColossalAI/applications/Chat/examples/train_sft_m.py:176 in <module>              │
│                                                                                                  │
│   173 │   parser.add_argument('--lr', type=float, default=5e-6)                                  │
│   174 │   parser.add_argument('--accimulation_steps', type=int, default=8)                       │
│   175 │   args = parser.parse_args()                                                             │
│ ❱ 176 │   train(args)                                                                            │
│   177                                                                                            │
│                                                                                                  │
│ /home/agree/jz/ColossalAI/applications/Chat/examples/train_sft_m.py:148 in train                 │
│                                                                                                  │
│   145 │   │   │   │   │   │    max_epochs=args.max_epochs,                                       │
│   146 │   │   │   │   │   │    accimulation_steps=args.accimulation_steps)                       │
│   147 │                                                                                          │
│ ❱ 148 │   trainer.fit(logger=logger, log_interval=args.log_interval)                             │
│   149 │                                                                                          │
│   150 │   # save model checkpoint after fitting on only rank0                                    │
│   151 │   trainer.save_model(path=args.save_path, only_rank0=True, tokenizer=tokenizer)          │
│                                                                                                  │
│ /opt/conda/lib/python3.9/site-packages/coati/trainer/sft.py:88 in fit                            │
│                                                                                                  │
│    85 │   │   │   for batch_id, batch in enumerate(self.train_dataloader):                       │
│    86 │   │   │   │                                                                              │
│    87 │   │   │   │   prompt_ids = batch["input_ids"]                                            │
│ ❱  88 │   │   │   │   p_mask = batch["attention_mask"]                                           │
│    89 │   │   │   │   labels = batch["labels"]                                                   │
│    90 │   │   │   │   prompt_ids = prompt_ids.squeeze(1).cuda()                                  │
│    91 │   │   │   │   p_mask = p_mask.squeeze(1).cuda()                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'attention_mask'
steps:   0%|                                                                                                                                                                                               | 0/186 [00:00<?, ?it/s]╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/agree/jz/ColossalAI/applications/Chat/examples/train_sft_m.py:176 in <module>              │
│                                                                                                  │
│   173 │   parser.add_argument('--lr', type=float, default=5e-6)                                  │
│   174 │   parser.add_argument('--accimulation_steps', type=int, default=8)                       │
│   175 │   args = parser.parse_args()                                                             │
│ ❱ 176 │   train(args)                                                                            │
│   177                                                                                            │
│                                                                                                  │
│ /home/agree/jz/ColossalAI/applications/Chat/examples/train_sft_m.py:148 in train                 │
│                                                                                                  │
│   145 │   │   │   │   │   │    max_epochs=args.max_epochs,                                       │
│   146 │   │   │   │   │   │    accimulation_steps=args.accimulation_steps)                       │
│   147 │                                                                                          │
│ ❱ 148 │   trainer.fit(logger=logger, log_interval=args.log_interval)                             │
│   149 │                                                                                          │
│   150 │   # save model checkpoint after fitting on only rank0                                    │
│   151 │   trainer.save_model(path=args.save_path, only_rank0=True, tokenizer=tokenizer)          │
│                                                                                                  │
│ /opt/conda/lib/python3.9/site-packages/coati/trainer/sft.py:88 in fit                            │
│                                                                                                  │
│    85 │   │   │   for batch_id, batch in enumerate(self.train_dataloader):                       │
│    86 │   │   │   │                                                                              │
│    87 │   │   │   │   prompt_ids = batch["input_ids"]                                            │
│ ❱  88 │   │   │   │   p_mask = batch["attention_mask"]                                           │
│    89 │   │   │   │   labels = batch["labels"]                                                   │
│    90 │   │   │   │   prompt_ids = prompt_ids.squeeze(1).cuda()                                  │
│    91 │   │   │   │   p_mask = p_mask.squeeze(1).cuda()                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'attention_mask'
steps:   0%|                                                                                                                                                                                               | 0/186 [00:00<?, ?it/s]ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 76781) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_sft_m.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-04-04_09:33:09
  host      : AgreeML
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 76782)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-04_09:33:09
  host      : AgreeML
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 76781)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I tried bloom-560 it seems has same error with another KeyError, thanks for your help~

jialesmu avatar Apr 04 '23 09:04 jialesmu

Maybe you should check which dataset obj has been created. In train_sft.py, If using SupervisedDataset (line 106), it will finally has the 'attention_mask' key. Check DataCollatorForSupervisedDataset for more detail.

train_dataset = SupervisedDataset(tokenizer=tokenizer,
                                          data_path=args.dataset,
                                          max_datasets_size=args.max_datasets_size)
eval_dataset = None
data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
@dataclass
class DataCollatorForSupervisedDataset(object):
    """Collate examples for supervised fine-tuning."""

    tokenizer: transformers.PreTrainedTokenizer

    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
        input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
        input_ids = torch.nn.utils.rnn.pad_sequence(input_ids,
                                                    batch_first=True,
                                                    padding_value=self.tokenizer.pad_token_id)
        labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX)
        return dict(
            input_ids=input_ids,
            labels=labels,
            attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
        )

chengeharrison avatar Apr 05 '23 01:04 chengeharrison

image this part should be input_id? since it already changed above

jialesmu avatar Apr 07 '23 06:04 jialesmu

anyone get a fix for this?

DamascusGit avatar May 22 '23 13:05 DamascusGit

  1. you can set export TORCH_CPP_LOG_LEVEL=DEBUG to print more info
  2. you can decrease the number of multi-gpus, such as 8->2, to make sure the memory is sufficient.

UpCoder avatar Oct 30 '23 16:10 UpCoder