ColossalAI
ColossalAI copied to clipboard
[BUG]: WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 73958 closing signal SIGTERM
🐛 Describe the bug
[04/04/23` 03:02:31] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
[04/04/23 03:02:31] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
[04/04/23 03:02:37] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:557 set_seed
[04/04/23 03:02:37] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 42, python random: 42, ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/initialize.py:116 launch
INFO colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 42, python random: 42, ParallelMode.DATA: 42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 2, pipeline parallel size: 1, tensor parallel size: 1
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 73958 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 73959) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
train_sft.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-04_03:04:14
host : AgreeML
rank : 1 (local_rank: 1)
exitcode : -9 (pid: 73959)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 73959
======================================================
I'm trying to run train_sft.sh with model llama 7b, however, it show this, what's wrong with this ? how can i fix this problem? thanks
torchrun --standalone --nproc_per_node=2 train_sft.py
--pretrain "ColossalAI/applications/models/11ama-7b-hf/"
--model 'llama'
--strategy colossalai_zero2
--log_interval 10
--save_path /Coati-7B/output
--dataset ColossalAI/applications/dataset/xp3_codeparrot_xlcost-text-to-code_Java-program-level_train_soljava.jsonl
--batch_size 1
--accimulation_steps 2
--lr 2e-5
--max_epochs 3 \
Environment
python3.9 torch 1.3.1 cuda 11.6 2 3090 GPU 24GB
Unfortunately, there was an OOM in your machine. Two '2 3090 GPU 24GB' and your current main memory might not support training a 7-B model.
Unfortunately, there was an OOM in your machine. Two '2 3090 GPU 24GB' and your current main memory might not support training a 7-B model.
Loading extension module fused_optim...
[04/04/23 09:33:01] INFO colossalai - colossalai - INFO: /home/agree/jz/ColossalAI/applications/Chat/examples/train_sft_m.py:128 train
INFO colossalai - colossalai - INFO: Using Distributed Sampler
[04/04/23 09:33:02] INFO colossalai - colossalai - INFO: /home/agree/jz/ColossalAI/applications/Chat/examples/train_sft_m.py:128 train
INFO colossalai - colossalai - INFO: Using Distributed Sampler
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/agree/jz/ColossalAI/applications/Chat/examples/train_sft_m.py:176 in <module> │
│ │
│ 173 │ parser.add_argument('--lr', type=float, default=5e-6) │
│ 174 │ parser.add_argument('--accimulation_steps', type=int, default=8) │
│ 175 │ args = parser.parse_args() │
│ ❱ 176 │ train(args) │
│ 177 │
│ │
│ /home/agree/jz/ColossalAI/applications/Chat/examples/train_sft_m.py:148 in train │
│ │
│ 145 │ │ │ │ │ │ max_epochs=args.max_epochs, │
│ 146 │ │ │ │ │ │ accimulation_steps=args.accimulation_steps) │
│ 147 │ │
│ ❱ 148 │ trainer.fit(logger=logger, log_interval=args.log_interval) │
│ 149 │ │
│ 150 │ # save model checkpoint after fitting on only rank0 │
│ 151 │ trainer.save_model(path=args.save_path, only_rank0=True, tokenizer=tokenizer) │
│ │
│ /opt/conda/lib/python3.9/site-packages/coati/trainer/sft.py:88 in fit │
│ │
│ 85 │ │ │ for batch_id, batch in enumerate(self.train_dataloader): │
│ 86 │ │ │ │ │
│ 87 │ │ │ │ prompt_ids = batch["input_ids"] │
│ ❱ 88 │ │ │ │ p_mask = batch["attention_mask"] │
│ 89 │ │ │ │ labels = batch["labels"] │
│ 90 │ │ │ │ prompt_ids = prompt_ids.squeeze(1).cuda() │
│ 91 │ │ │ │ p_mask = p_mask.squeeze(1).cuda() │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'attention_mask'
steps: 0%| | 0/186 [00:00<?, ?it/s]╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/agree/jz/ColossalAI/applications/Chat/examples/train_sft_m.py:176 in <module> │
│ │
│ 173 │ parser.add_argument('--lr', type=float, default=5e-6) │
│ 174 │ parser.add_argument('--accimulation_steps', type=int, default=8) │
│ 175 │ args = parser.parse_args() │
│ ❱ 176 │ train(args) │
│ 177 │
│ │
│ /home/agree/jz/ColossalAI/applications/Chat/examples/train_sft_m.py:148 in train │
│ │
│ 145 │ │ │ │ │ │ max_epochs=args.max_epochs, │
│ 146 │ │ │ │ │ │ accimulation_steps=args.accimulation_steps) │
│ 147 │ │
│ ❱ 148 │ trainer.fit(logger=logger, log_interval=args.log_interval) │
│ 149 │ │
│ 150 │ # save model checkpoint after fitting on only rank0 │
│ 151 │ trainer.save_model(path=args.save_path, only_rank0=True, tokenizer=tokenizer) │
│ │
│ /opt/conda/lib/python3.9/site-packages/coati/trainer/sft.py:88 in fit │
│ │
│ 85 │ │ │ for batch_id, batch in enumerate(self.train_dataloader): │
│ 86 │ │ │ │ │
│ 87 │ │ │ │ prompt_ids = batch["input_ids"] │
│ ❱ 88 │ │ │ │ p_mask = batch["attention_mask"] │
│ 89 │ │ │ │ labels = batch["labels"] │
│ 90 │ │ │ │ prompt_ids = prompt_ids.squeeze(1).cuda() │
│ 91 │ │ │ │ p_mask = p_mask.squeeze(1).cuda() │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'attention_mask'
steps: 0%| | 0/186 [00:00<?, ?it/s]ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 76781) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_sft_m.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-04-04_09:33:09
host : AgreeML
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 76782)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-04_09:33:09
host : AgreeML
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 76781)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
I tried bloom-560 it seems has same error with another KeyError, thanks for your help~
Maybe you should check which dataset obj has been created. In train_sft.py, If using SupervisedDataset (line 106), it will finally has the 'attention_mask' key. Check DataCollatorForSupervisedDataset for more detail.
train_dataset = SupervisedDataset(tokenizer=tokenizer,
data_path=args.dataset,
max_datasets_size=args.max_datasets_size)
eval_dataset = None
data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
@dataclass
class DataCollatorForSupervisedDataset(object):
"""Collate examples for supervised fine-tuning."""
tokenizer: transformers.PreTrainedTokenizer
def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
input_ids = torch.nn.utils.rnn.pad_sequence(input_ids,
batch_first=True,
padding_value=self.tokenizer.pad_token_id)
labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX)
return dict(
input_ids=input_ids,
labels=labels,
attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
)
this part should be input_id? since it already changed above
anyone get a fix for this?
- you can set export TORCH_CPP_LOG_LEVEL=DEBUG to print more info
- you can decrease the number of multi-gpus, such as 8->2, to make sure the memory is sufficient.