ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 7 (pid: 2860777) of binary:

Open Tron1994 opened this issue 2 years ago • 3 comments

🐛 Describe the bug

use: gpt2_configs/gpt2_zero3.py run:Error: failed to run torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch on 127.0.0.1

bug log: WARNING colossalai - ShardedOptimizerV2 - WARNING: found inf during ShardedOptimV2 step
WARNING colossalai - ShardedOptimizerV2 - WARNING: found inf during ShardedOptimV2 step
[Epoch 0 / Train]: 1%|▌ | 19/2839 [15:02<33:39:30, 42.97s/it, loss=2580.0, lr=0.0167, throughput=0.38035 sample_per_sec, 12.996 Tflops[Epoch 0 / Train]: 1%|▌ | 20/2839 [15:02<33:31:36, 42.82s/it, loss=2580.0, lr=0.0167, throughput=0.38035 sample_per_sec, 12.996 Tflops]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2860767 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2860768 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2860769 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2860770 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2860771 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2860772 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2860776 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 7 (pid: 2860777) of binary: /home/jovyan/miniconda3/envs/colossalai-test/bin/python Traceback (most recent call last): File "/home/jovyan/miniconda3/envs/colossalai-test/bin/torchrun", line 8, in sys.exit(main()) File "/home/jovyan/miniconda3/envs/colossalai-test/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/home/jovyan/miniconda3/envs/colossalai-test/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main run(args) File "/home/jovyan/miniconda3/envs/colossalai-test/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/jovyan/miniconda3/envs/colossalai-test/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/jovyan/miniconda3/envs/colossalai-test/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: bffb46bd587f822d975696245edef6d

Environment

lastest No response

Tron1994 avatar Jan 11 '23 03:01 Tron1994

This is expected because torchrun cannot do graceful shutdown for now. It is not a bug.

FrankLeeeee avatar Jan 11 '23 03:01 FrankLeeeee

This code only runs 20 steps on purpose since it is only a demo. You should be able to find in code that the the program will be stopped after 20 steps.

FrankLeeeee avatar Jan 11 '23 03:01 FrankLeeeee

I used the code in https://github.com/hpcaitech/ColossalAI-Examples/blob/main/language/gpt/train_gpt.py。 config.py : from colossalai.nn.optimizer import HybridAdam from colossalai.zero.shard_utils import TensorShardStrategy from titans.model.gpt import gpt2_small, gpt2_36B

BATCH_SIZE = 2 NUM_EPOCHS = 60 SEQ_LEN = 1024

zero = dict( model_config=dict( tensor_placement_policy='cpu', shard_strategy=TensorShardStrategy(), reuse_fp16_shard=True ), optimizer_config=dict() )

optimizer = dict( type=HybridAdam, lr=0.001, weight_decay=1e-2, )

model = dict(

type=gpt2_small,

checkpoint=True,

)

model = dict( type=gpt2_36B, vocab_size=53227, checkpoint=True, )

Where can I look for the code stopped after 20 steps?

Tron1994 avatar Jan 11 '23 04:01 Tron1994

Hi @Tron1994 We have updated a lot. You can check our new example. https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 18 '23 07:04 binmakeswell