ColossalAI
ColossalAI copied to clipboard
[BUG]: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 7 (pid: 2860777) of binary:
🐛 Describe the bug
use: gpt2_configs/gpt2_zero3.py run:Error: failed to run torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch on 127.0.0.1
bug log:
WARNING colossalai - ShardedOptimizerV2 - WARNING: found inf during ShardedOptimV2 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found inf during ShardedOptimV2 step
[Epoch 0 / Train]: 1%|▌ | 19/2839 [15:02<33:39:30, 42.97s/it, loss=2580.0, lr=0.0167, throughput=0.38035 sample_per_sec, 12.996 Tflops[Epoch 0 / Train]: 1%|▌ | 20/2839 [15:02<33:31:36, 42.82s/it, loss=2580.0, lr=0.0167, throughput=0.38035 sample_per_sec, 12.996 Tflops]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2860767 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2860768 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2860769 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2860770 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2860771 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2860772 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2860776 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 7 (pid: 2860777) of binary: /home/jovyan/miniconda3/envs/colossalai-test/bin/python
Traceback (most recent call last):
File "/home/jovyan/miniconda3/envs/colossalai-test/bin/torchrun", line 8, in
Environment
lastest No response
This is expected because torchrun
cannot do graceful shutdown for now. It is not a bug.
This code only runs 20 steps on purpose since it is only a demo. You should be able to find in code that the the program will be stopped after 20 steps.
I used the code in https://github.com/hpcaitech/ColossalAI-Examples/blob/main/language/gpt/train_gpt.py。 config.py : from colossalai.nn.optimizer import HybridAdam from colossalai.zero.shard_utils import TensorShardStrategy from titans.model.gpt import gpt2_small, gpt2_36B
BATCH_SIZE = 2 NUM_EPOCHS = 60 SEQ_LEN = 1024
zero = dict( model_config=dict( tensor_placement_policy='cpu', shard_strategy=TensorShardStrategy(), reuse_fp16_shard=True ), optimizer_config=dict() )
optimizer = dict( type=HybridAdam, lr=0.001, weight_decay=1e-2, )
model = dict(
type=gpt2_small,
checkpoint=True,
)
model = dict( type=gpt2_36B, vocab_size=53227, checkpoint=True, )
Where can I look for the code stopped after 20 steps?
Hi @Tron1994 We have updated a lot. You can check our new example. https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt This issue was closed due to inactivity. Thanks.