Run into Deadlocks when training or inference
When I trained the model or conducted the inference in docker container. The model just runs forever and falls into deadlock. It occupied all four GPUs with 100% GPU usage but around 1200MB GPU memory for each GPU. Do you have any idea of why it falls into deadlock? I suspect it is the multiprocessing issue.
sudo docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config hpcaitech/energon-ai:latest ++ dirname /config/server.sh
- cd /config
- export BASE=/config
- BASE=/config
- export PYTHONPATH=/config
- PYTHONPATH=/config
- energonai service init --config_file=/config/opt_config.py
*** Energon Init Configurations: ***
opt_30B : <function opt_30B at 0x7fa870a60550>
opt_125M : <function opt_125M at 0x7fa870a604c0>
opt_175B : <function opt_175B at 0x7fa870a60670>
launch_engine : <function launch_engine at 0x7fa85e379b80>
model_class : <function opt_125M at 0x7fa870a604c0>
model_type : gpt
host : 127.0.0.1
port : 29402
half : True
checkpoint : home/susu/opt_metaseq_125m/model/restored.pt
backend : nccl
tp_init_size : 4
pp_init_size : 1
engine_server : <function launch_engine at 0x7fa85e379b80>
tokenizer_path : facebook/opt-30b
server_host : 0.0.0.0
server_port : 8020
log_level : info
allow_cors : True
executor_max_batch_size : 16
cache_size : 50
cache_list_size : 2
timeout_keep_alive : 180
executor_max_queue_size : 0
fixed_cache_keys : [('Question: What is the name of the largest continent on earth?\nAnswer: Asia\n\nQuestion: What is at the center of the solar system?\nAnswer:', 64), ('A chat between a salesman and a student.\n\nSalesman: Hi boy, are you looking for a new phone?\nStudent: Yes, my phone is not functioning well.\nSalesman: What is your budget? \nStudent: I have received my scholarship so I am fine with any phone.\nSalesman: Great, then perhaps this latest flagship phone is just right for you.', 64), ("English: I am happy today.\nChinese: 我今天很开心。\n\nEnglish: I am going to play basketball.\nChinese: 我一会去打篮球。\n\nEnglish: Let's celebrate our anniversary.\nChinese:", 64)]
max_batch_size : 32
dtype : torch.float16
rm_padding : False
seed : 1024
verbose : True
trt_sample : None
Downloading vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 878k/878k [00:00<00:00, 20.4MB/s]
Downloading merges.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 446k/446k [00:00<00:00, 10.6MB/s]
Downloading special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 221/221 [00:00<00:00, 149kB/s]
Downloading tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 522kB/s]
Downloading config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 412kB/s]
[09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 1
[09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 2
[09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:1
with 4 nodes.
[09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:1
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:2
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:2
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:2
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:2
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:3
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:3
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:3
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:3
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:4
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:4
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:4
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:4
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:5
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:5
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:5
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:5
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:6
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:6
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:6
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:6
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:7
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:7
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:7
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:7
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:8
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:8
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:8
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:8
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:9
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:9
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:9
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:9
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:10
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:10
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:10
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:10
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:11
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:11
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:11
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:11
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:12
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:12
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:12
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:12
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:13
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:13
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:13
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:13
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:14
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:14
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:14
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:14
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:15
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:15
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:15
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:15
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:16
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:16
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:16
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:16
with 4 nodes.
INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 3 is bound to device 3
INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
[09/08/22 13:59:20] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[09/08/22 13:59:20] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 2 is bound to device 2
Hi, we are refactoring codes. Server and inference engine will preempt CPU now, which may lead to lag. This will be solved soon.
Has the issue been fixed?
Has the issue been fixed?
Yes, it has been fixed. We have updated a lot. https://github.com/hpcaitech/EnergonAI This issue was closed due to inactivity. Thanks.