ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

Run into Deadlocks when training or inference

Open SusuXu opened this issue 3 years ago • 1 comments

When I trained the model or conducted the inference in docker container. The model just runs forever and falls into deadlock. It occupied all four GPUs with 100% GPU usage but around 1200MB GPU memory for each GPU. Do you have any idea of why it falls into deadlock? I suspect it is the multiprocessing issue.

sudo docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config hpcaitech/energon-ai:latest ++ dirname /config/server.sh

  • cd /config
  • export BASE=/config
  • BASE=/config
  • export PYTHONPATH=/config
  • PYTHONPATH=/config
  • energonai service init --config_file=/config/opt_config.py *** Energon Init Configurations: *** opt_30B : <function opt_30B at 0x7fa870a60550> opt_125M : <function opt_125M at 0x7fa870a604c0> opt_175B : <function opt_175B at 0x7fa870a60670> launch_engine : <function launch_engine at 0x7fa85e379b80> model_class : <function opt_125M at 0x7fa870a604c0> model_type : gpt host : 127.0.0.1 port : 29402 half : True checkpoint : home/susu/opt_metaseq_125m/model/restored.pt backend : nccl tp_init_size : 4 pp_init_size : 1 engine_server : <function launch_engine at 0x7fa85e379b80> tokenizer_path : facebook/opt-30b server_host : 0.0.0.0 server_port : 8020 log_level : info allow_cors : True executor_max_batch_size : 16 cache_size : 50 cache_list_size : 2 timeout_keep_alive : 180 executor_max_queue_size : 0 fixed_cache_keys : [('Question: What is the name of the largest continent on earth?\nAnswer: Asia\n\nQuestion: What is at the center of the solar system?\nAnswer:', 64), ('A chat between a salesman and a student.\n\nSalesman: Hi boy, are you looking for a new phone?\nStudent: Yes, my phone is not functioning well.\nSalesman: What is your budget? \nStudent: I have received my scholarship so I am fine with any phone.\nSalesman: Great, then perhaps this latest flagship phone is just right for you.', 64), ("English: I am happy today.\nChinese: 我今天很开心。\n\nEnglish: I am going to play basketball.\nChinese: 我一会去打篮球。\n\nEnglish: Let's celebrate our anniversary.\nChinese:", 64)] max_batch_size : 32 dtype : torch.float16 rm_padding : False seed : 1024 verbose : True trt_sample : None Downloading vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 878k/878k [00:00<00:00, 20.4MB/s] Downloading merges.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 446k/446k [00:00<00:00, 10.6MB/s] Downloading special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 221/221 [00:00<00:00, 149kB/s] Downloading tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 522kB/s] Downloading config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 412kB/s] [09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 1
    [09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 2
    [09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 3
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:1
    with 4 nodes.
    [09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 0
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:1
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 0
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 1
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 2
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 3
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:2
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:2
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:2
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:2
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 3
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 2
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:3
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:3
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 0
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 1
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 2
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 3
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:3
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:3
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 1
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 0
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:4
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:4
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 1
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 0
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:4
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:4
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 3
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:5
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 2
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 3
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:5
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:5
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:5
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 0
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 1
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 2
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:6
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:6
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:6
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 2
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 0
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 1
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:6
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 3
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:7
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 3
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:7
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:7
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:7
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 0
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 2
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 1
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:8
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 0
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:8
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:8
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 2
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 1
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:8
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 3
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:9
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 3
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:9
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 0
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:9
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:9
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 2
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 1
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:10
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:10
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 1
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:10
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 3
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 2
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:10
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 0
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:11
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:11
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:11
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:11
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 0
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 1
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 2
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 3
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:12
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:12
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:12
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:12
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 0
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 2
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 3
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 1
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:13
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:13
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:13
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:13
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 0
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 1
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 2
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 3
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:14
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:14
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:14
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:14
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 3
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 0
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 2
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 1
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:15
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:15
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:15
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:15
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 0
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 1
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 3
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 2
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:16
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:16
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:16
    with 4 nodes.
    INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:16
    with 4 nodes.
    INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
    INFO colossalai - colossalai - INFO: process rank 3 is bound to device 3
    INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
    INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
    [09/08/22 13:59:20] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
    INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
    [09/08/22 13:59:20] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
    INFO colossalai - colossalai - INFO: process rank 2 is bound to device 2

SusuXu avatar Sep 08 '22 14:09 SusuXu

Hi, we are refactoring codes. Server and inference engine will preempt CPU now, which may lead to lag. This will be solved soon.

ver217 avatar Sep 13 '22 04:09 ver217

Has the issue been fixed?

semal avatar Mar 22 '23 06:03 semal

Has the issue been fixed?

Yes, it has been fixed. We have updated a lot. https://github.com/hpcaitech/EnergonAI This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 13 '23 04:04 binmakeswell