ColossalAI
                                
                                 ColossalAI copied to clipboard
                                
                                    ColossalAI copied to clipboard
                            
                            
                            
                        [BUG]: load_checkpoint error
🐛 Describe the bug
gpus info : 3 nodes , 4 gpus per node (GeForce RTX 2080 Ti) pp:3 tp:2 dp:2
I use train_test.py in project [ColossalAI-Example] ,and get the checkpoint file,want to load and test. like this:
trainer = Trainer(engine=engine, logger=logger, timer=timier) last_epoch = 0 if len(args.from_cpt) > 0 & os.path.exists(args.from_cpt): last_epoch = load_checkpoint(args.from_cpt, model, _, _, False)
but it comes error,here is the error info.:
`
Traceback (most recent call last):
File "/workspace/ColossalAI-Examples/language/gpt/test_gpt.py", line 150, in 
`
Environment
CONDA_DEFAULT_ENV="base"
CONDA_PROMPT_MODIFIER="(base) "
CONDA_PYTHON_EXE="/opt/conda/bin/python"
CONDA_SHLVL="1"
CUDA_HOME="/usr/local/cuda"
CUDA_VERSION="11.6.1"
DATA="/workspace/gpt/traindata/train_data.json"
LD_LIBRARY_PATH="/root/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64"
LESSOPEN="| /usr/bin/lesspipe %s"
LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
NVARCH="x86_64"
NV_CUDA_COMPAT_PACKAGE="cuda-compat-11-6"
NV_CUDA_CUDART_DEV_VERSION="11.6.55-1"
NV_CUDA_CUDART_VERSION="11.6.55-1"
NV_CUDA_LIB_VERSION="11.6.1-1"
NV_CUDNN_PACKAGE_DEV="libcudnn8-dev=8.4.0.27-1+cuda11.6"
NV_CUDNN_PACKAGE="libcudnn8=8.4.0.27-1+cuda11.6"
NV_CUDNN_PACKAGE_NAME="libcudnn8"
NV_CUDNN_VERSION="8.4.0.27"
NVIDIA_DRIVER_CAPABILITIES="compute,utility"
NVIDIA_REQUIRE_CUDA="cuda>=11.6 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471" NV_LIBCUBLAS_DEV_PACKAGE="libcublas-dev-11-6=11.8.1.74-1"
NVIDIA_VISIBLE_DEVICES="all"
NV_LIBCUBLAS_DEV_PACKAGE_NAME="libcublas-dev-11-6"
NV_LIBCUBLAS_DEV_VERSION="11.8.1.74-1"
NV_LIBCUBLAS_PACKAGE="libcublas-11-6=11.8.1.74-1"
NV_LIBCUBLAS_PACKAGE_NAME="libcublas-11-6"
NV_LIBCUBLAS_VERSION="11.8.1.74-1"
NV_LIBCUSPARSE_DEV_VERSION="11.7.2.112-1"
NV_LIBCUSPARSE_VERSION="11.7.2.112-1"
NV_LIBNCCL_DEV_PACKAGE="libnccl-dev=2.12.7-1+cuda11.6"
NV_LIBNCCL_DEV_PACKAGE_NAME="libnccl-dev"
NV_LIBNCCL_DEV_PACKAGE_VERSION="2.12.7-1"
NV_LIBNCCL_PACKAGE="libnccl2=2.12.7-1+cuda11.6"
NV_LIBNCCL_PACKAGE_NAME="libnccl2"
NV_LIBNCCL_PACKAGE_VERSION="2.12.7-1"
NV_LIBNPP_DEV_PACKAGE="libnpp-dev-11-6=11.6.2.112-1"
NV_LIBNPP_DEV_VERSION="11.6.2.112-1"
NV_LIBNPP_PACKAGE="libnpp-11-6=11.6.2.112-1"
NV_LIBNPP_VERSION="11.6.2.112-1" NCCL_VERSION="2.12.7-1"
NV_NVML_DEV_VERSION="11.6.55-1"
NV_NVPROF_DEV_PACKAGE="cuda-nvprof-11-6=11.6.112-1"
NV_NVPROF_VERSION="11.6.112-1"
NV_NVTX_VERSION="11.6.112-1"
PATH="/opt/conda/bin:/opt/conda/condabin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Can I know the contents of your config file?
Can I know the contents of your
configfile?
Yes,I used the config file:
`from colossalai.amp import AMP_TYPE from colossalai.zero.shard_utils import TensorShardStrategy from titans.loss.lm_loss import GPTLMLoss from titans.model.gpt import gpt2_xl,gpt2_small from torch.optim import Adam from titans.loss.vocab_cross_entropy import vocab_parallel_cross_entropy
import torch
BATCH_SIZE = 8 NUM_EPOCHS = 60 SEQ_LEN = 1024
NUM_MICRO_BATCHES = 4 HIDDEN_SIZE = 768 PIPELINE = 3 TENSOR_PARALLEL = 2 MODE = '1d'
fp16 = dict(mode=AMP_TYPE.NAIVE)
parallel = dict(pipeline=PIPELINE, tensor=dict(mode=MODE, size=TENSOR_PARALLEL))
optimizer = dict( type=Adam, lr=0.00015, weight_decay=1e-2, )
model = dict( type=gpt2_xl, checkpoint=True, dtype=torch.half, )
loss_fn = dict(type=vocab_parallel_cross_entropy) `
I think tp + pp mode is not well supported in this sample. And if you have extra computes, you can increase the dimensions of DP!
I think tp + pp mode is not well supported in this sample. And if you have extra computes, you can increase the dimensions of DP!
with tp+pp mode,how can i save checkpoint and load checkpoint,to test or predict? have any examples?
I think tp + pp mode is not well supported in this sample. And if you have extra computes, you can increase the dimensions of DP!
` from colossalai.amp import AMP_TYPE from titans.loss.lm_loss import GPTLMLoss from titans.model.gpt import gpt2_small,gpt2_large,gpt2_xl #from model_zoo.gpt.gpt import gpt2_small_pipeline from torch.optim import Adam
BATCH_SIZE = 8 SEQ_LEN = 1024 NUM_EPOCHS = 60 HIDDEN_SIZE = 768 NUM_MICRO_BATCHES = 4 PIPELINE = 4
optimizer = dict( type=Adam, lr=0.00015, weight_decay=1e-2, )
fp16 = dict( mode=AMP_TYPE.NAIVE )
loss = dict( type=GPTLMLoss, )
model = dict( type=gpt2_large, checkpoint=True, )
parallel = dict( pipeline=PIPELINE, tensor=dict(size=1, mode=None), ) `
I try to run pp =4 TP=1 and DP=3 , and also comes error:
RuntimeError: Error(s) in loading state_dict for PipelinableModel: Missing key(s) in state_dict: "_module_list.0.norm1.weight", "_module_list.0.norm1.bias", "_module_list.0.attn.query_key_value.weight", "_module_list.0.attn.query_key_value.bias", "_module_list.0.attn.dense.weight", "_module_list.0.attn.dense.bias", "_module_list.0.norm2.weight", "_module_list.0.norm2.bias", "_module_list.0.mlp.linear_1.weight", "_module_list.0.mlp.linear_1.bias", "_module_list.0.mlp.linear_2.weight", "_module_list.0.mlp.linear_2.bias", "_module_list.1.norm1.weight", "_module_list.1.norm1.bias", "_module_list.1.attn.query_key_value.weight", "_module_list.1.attn.query_key_value.bias", "_module_list.1.attn.dense.weight", "_module_list.1.attn.dense.bias", "_module_list.1.norm2.weight", "_module_list.1.norm2.bias", "_module_list.1.mlp.linear_1.weight", "_module_list.1.mlp.linear_1.bias", "_module_list.1.mlp.linear_2.weight", "_module_list.1.mlp.linear_2.bias", "_module_list.2.norm1.weight", "_module_list.2.norm1.bias", "_module_list.2.attn.query_key_value.weight", "_module_list.2.attn.query_key_value.bias", "_module_list.2.attn.dense.weight", "_module_list.2.attn.dense.bias", "_module_list.2.norm2.weight", "_module_list.2.norm2.bias", "_module_list.2.mlp.linear_1.weight", "_module_list.2.mlp.linear_1.bias", "_module_list.2.mlp.linear_2.weight", "_module_list.2.mlp.linear_2.bias", "_module_list.3.norm1.weight", "_module_list.3.norm1.bias", "_module_list.3.attn.query_key_value.weight", "_module_list.3.attn.query_key_value.bias", "_module_list.3.attn.dense.weight", "_module_list.3.attn.dense.bias", "_module_list.3.norm2.weight", "_module_list.3.norm2.bias", "_module_list.3.mlp.linear_1.weight", "_module_list.3.mlp.linear_1.bias", "_module_list.3.mlp.linear_2.weight", "_module_list.3.mlp.linear_2.bias".
Hi @readme2gh , could you please try setting strict=False in load_checkpoint?
Hi @readme2gh , could you please try setting
strict=Falseinload_checkpoint? @kurisusnowdeng
Yes,when I set strict=False , the error like this:
Traceback (most recent call last):
File "/workspace/ColossalAI-Examples/language/gpt/test_gpt.py", line 150, in
main()
File "/workspace/ColossalAI-Examples/language/gpt/test_gpt.py", line 128, in main
last_epoch = load_checkpoint(args.from_cpt, model, _, _, False)
File "/opt/conda/lib/python3.9/site-packages/colossalai/utils/checkpointing.py", line 276, in load_checkpoint
raise e
File "/opt/conda/lib/python3.9/site-packages/colossalai/utils/checkpointing.py", line 263, in load_checkpoint
broadcast_model(model)
File "/opt/conda/lib/python3.9/site-packages/colossalai/utils/checkpointing.py", line 200, in broadcast_model
dist.broadcast(p, src_rank, group=group)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1197, in broadcast
work = group.broadcast([tensor], opts)
RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer
Traceback (most recent call last):
File "/workspace/ColossalAI-Examples/language/gpt/test_gpt.py", line 150, in
main()
File "/workspace/ColossalAI-Examples/language/gpt/test_gpt.py", line 128, in main
last_epoch = load_checkpoint(args.from_cpt, model, _, _, False)
File "/opt/conda/lib/python3.9/site-packages/colossalai/utils/checkpointing.py", line 279, in load_checkpoint
state_dict = broadcast_state_dict(state_dict, ParallelMode.MODEL)
File "/opt/conda/lib/python3.9/site-packages/colossalai/utils/checkpointing.py", line 23, in broadcast_state_dict
dist.broadcast_object_list(state_dict, src=src_rank, group=gpc.get_cpu_group(parallel_mode))
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1877, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1201, in broadcast
work.wait()
Hi @readme2gh , could you please try setting
strict=Falseinload_checkpoint?
Thank u . I fix by pass the model in it to save point function.