ColossalAI
ColossalAI copied to clipboard
[BUG]: Fail to load huggingface pretraining when use shardinit
🐛 Describe the bug
world_size = torch.distributed.get_world_size()
shard_pg = ProcessGroup(tp_degree=world_size) if args.shardinit else None
default_dist_spec = ShardSpec([-1], [world_size]) if args.shardinit else None
with ColoInitContext(device=get_current_device(),
dtype=torch.half,
default_dist_spec=default_dist_spec,
default_pg=shard_pg):
model = BloomForCausalLM.from_pretrained(args.model_name_or_path)
When using shardinit, the model will be split into multiple GPUs first and then load the huggingface pertaining, so checkpoint mismatch will occur.
RuntimeError: Error(s) in loading state_dict for BloomForCausalLM:
size mismatch for transformer.word_embeddings.weight: copying a param
with shape torch.Size([46145, 4096]) from checkpoint, the shape in current model
is torch.Size([46145, 512]).
size mismatch for transformer.word_embeddings_layernorm.weight: copying
a param with shape torch.Size([4096]) from checkpoint, the shape in current
model is torch.Size([512]).
size mismatch for transformer.word_embeddings_layernorm.bias: copying a
param with shape torch.Size([4096]) from checkpoint, the shape in current model
is torch.Size([512]).
size mismatch for transformer.h.0.input_layernorm.weight: copying a
param with shape torch.Size([4096]) from checkpoint, the shape in current model
is torch.Size([512]).
I wonder to know how to successfully load huggingface pertaining when using shardinit, seems that it's necessary when we want to fine-tune a very large model.
Environment
No response
Hi, any update?
opt-30 with --shardinit example same error, how to fix it
I also encountered this problem. Any solution?
any update? @YuliangLiu0306 @FrankLeeeee
A workaround is to construct the model first and then load the weights manually.
with ColoInitContext(
device=get_current_device(),
dtype=torch.half,
default_pg=ProcessGroup(tp_degree=world_size),
default_dist_spec=ShardSpec([-1], [world_size]),
):
model = BloomForCausalLM(BloomConfig.from_pretrained(pretrained_path))
for n, p in model.named_parameters():
x = state_dict[n]
x = x.chunk(world_size, dim=-1)
x = x[global_rank]
p.data.copy_(x)
@ShinoharaHare Thx for the reply. I'll give it a try. By the way, did you compare the accelerate performance between this strategy and Megatron?
hi @ShinoharaHare ,I come across the same error. Thanks for your solution ,I will have a try. but I still have a question here “ x = state_dict[n]” -- does it mean Deserialized the huggingface model into a state_dict like : state_dict = torch.load("xxx") (to cpu maybe) ? before the ColoInitContext process?
@ShinoharaHare Thx for the reply. I'll give it a try. By the way, did you compare the accelerate performance between this strategy and Megatron?
Nope, didn't test.
hi @ShinoharaHare ,I come across the same error. Thanks for your solution ,I will have a try. but I still have a question here “ x = state_dict[n]” -- does it mean Deserialized the huggingface model into a state_dict like : state_dict = torch.load("xxx") (to cpu maybe) ? before the ColoInitContext process?
Yes, that's correct.
And there is a better but cumbersome way:
- Convert the pretrained weights into safetensors
- Use
LazyInitContext
withColoInitContext
to construct the model faster (might need to passto_meta=False
) - To load only the required parts on each rank, you can utilize the
get_slice
API from safetensors. This way, the weights should be able to be loaded directly into GPUs without first being loaded into the CPU.
Hi guys, we are developing Lazy initialization of model, which provides a much better user experience and will come soon (hopefully next week).
Hi @ver217 Could you please help guys with shardinit? Thanks.
A workaround is to construct the model first and then load the weights manually.
with ColoInitContext( device=get_current_device(), dtype=torch.half, default_pg=ProcessGroup(tp_degree=world_size), default_dist_spec=ShardSpec([-1], [world_size]), ): model = BloomForCausalLM(BloomConfig.from_pretrained(pretrained_path)) for n, p in model.named_parameters(): x = state_dict[n] x = x.chunk(world_size, dim=-1) x = x[global_rank] p.data.copy_(x)
I modified the full-parameter fine-tuning process of Bloom according to your code, but I still encountered a dimension inconsistency issue. Specifically, x is one quarter of p, which is theoretically normal because I did split it into 4 chunks. How did you manage to run it successfully?
Specifically, x is one quarter of p, which is theoretically normal because I did split it into 4 chunks
According to your description, I think your sharding size and chunking size didn't match. Specifically, the chunking size is four times greater than the sharding size.
In my code, I sharded the parameters with world_size
default_pg=ProcessGroup(tp_degree=world_size),
default_dist_spec=ShardSpec([-1], [world_size])
and then chunk x
into world_size
chunks
x = x.chunk(world_size, dim=-1)
As long as the world_size
is consistent, there shouldn't be a problem.
Specifically, x is one quarter of p, which is theoretically normal because I did split it into 4 chunks
According to your description, I think your sharding size and chunking size didn' match. Specifically, the chunking size is four times greater than the sharding size.
In my code, I sharded the parameters with
world_size
default_pg=ProcessGroup(tp_degree=world_size), default_dist_spec=ShardSpec([-1], [world_size])
and then chunk
x
intoworld_size
chunksx = x.chunk(world_size, dim=-1)
As long as the
world_size
is consistent, there shouldn't be a problem.
Thanks for your help. I have successfully fine-tuned using this method in a single-node multi-GPU environment. However, I am now encountering issues when saving the model in a multi-node multi-GPU environment, using the save_model method from applications/Chat/coati/trainer/strategies/colossalai.py. The error log is as follows:
Traceback (most recent call last):
File "colossal_private/applications/Chat/examples/bloom_sft_train.py", line 303, in <module>
train(args)
File "colossal_private/applications/Chat/examples/bloom_sft_train.py", line 270, in train
trainer.save_model(path=args.save_path, only_rank0=True, tokenizer=tokenizer)
File "colossal_private/applications/Chat/coati/trainer/sft.py", line 260, in save_model
self.strategy.save_model(model=self.model, path=path, only_rank0=only_rank0, tokenizer=tokenizer)
File "colossal_private/applications/Chat/coati/trainer/strategies/colossalai.py", line 204, in save_model
unwrapped_model = self._unwrap_model(model)
File "colossal_private/applications/Chat/coati/trainer/strategies/colossalai.py", line 191, in _unwrap_model
model = get_static_torch_model(model)
File "colossal_private/colossalai/zero/gemini/utils.py", line 84, in get_static_torch_model
state_dict = zero_ddp_model.state_dict(only_rank_0=only_rank_0)
File "colossal_private/colossalai/zero/gemini/gemini_ddp.py", line 222, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars, only_rank_0)
File "colossal_private/colossalai/zero/gemini/gemini_ddp.py", line 275, in _save_to_state_dict
param_to_save_data = self._get_param_to_save_data(self.fp32_params, only_rank_0)
File "colossal_private/colossalai/zero/gemini/gemini_ddp.py", line 245, in _get_param_to_save_data
temp_chunk = get_temp_total_chunk_on_cuda(chunk)
File "colossal_private/colossalai/zero/gemini/utils.py", line 25, in get_temp_total_chunk_on_cuda
dist.all_gather(tensor_list=gather_list, tensor=shard_temp, group=chunk.torch_pg)
File "work/miniconda3/envs/test/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2277, in all_gather
work = group.allgather([tensor_list], [tensor])
RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: NCCL error: remote process exited or there was a network error, NCCL version 2.14.3
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
LazyInit is a new good choice. For large models causing OOM in init, you can use "lazy init + from huggingface config", through gemini, then load checkpoint.
from colossalai.utils.model.experimental import LazyInitContext
with LazyInitContext():
model = xxx()
Then pass the model to gemini. Note you cannot use 'from pretrained model' in LazyInitContext()
The detailed document will come soon.
LazyInit is a new good choice. For large models causing OOM in init, you can use "lazy init + from huggingface config", through gemini, then load checkpoint.
from colossalai.utils.model.experimental import LazyInitContext with LazyInitContext(): model = xxx()
Then pass the model to gemini. Note you cannot use 'from pretrained model' in LazyInitContext()
The detailed document will come soon.
how to use LazyInitContext in train_sft.py, path: https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/examples/train_sft.py
A workaround is to construct the model first and then load the weights manually.
with ColoInitContext( device=get_current_device(), dtype=torch.half, default_pg=ProcessGroup(tp_degree=world_size), default_dist_spec=ShardSpec([-1], [world_size]), ): model = BloomForCausalLM(BloomConfig.from_pretrained(pretrained_path)) for n, p in model.named_parameters(): x = state_dict[n] x = x.chunk(world_size, dim=-1) x = x[global_rank] p.data.copy_(x)
Thanks for your solution, I have a try in a single-node 4GPUs environment with coati sft example. but get size mismatch error, before chunk param x size is [vocab_size,4096], after chunk param x size is [vocab_size,1024], but param p size is [vocab_size,256], param p and x size is not match. casue a dimension inconsistency issue, is config not right, need to change?
this is my code:
state_dict = {}
state_dict.update( torch.load(f"{chatglm_model_path}/chatglm_6b/pytorch_model-00001-of-00008.bin"))
state_dict.update( torch.load(f"{chatglm_model_path}/chatglm_6b/pytorch_model-00002-of-00008.bin"))
state_dict.update( torch.load(f"{chatglm_model_path}/chatglm_6b/pytorch_model-00003-of-00008.bin"))
state_dict.update( torch.load(f"{chatglm_model_path}/chatglm_6b/pytorch_model-00004-of-00008.bin"))
state_dict.update( torch.load(f"{chatglm_model_path}/chatglm_6b/pytorch_model-00005-of-00008.bin"))
state_dict.update( torch.load(f"{chatglm_model_path}/chatglm_6b/pytorch_model-00006-of-00008.bin"))
state_dict.update( torch.load(f"{chatglm_model_path}/chatglm_6b/pytorch_model-00007-of-00008.bin"))
state_dict.update( torch.load(f"{chatglm_model_path}/chatglm_6b/pytorch_model-00008-of-00008.bin"))
if is_rank_0():print(f"------------------> load state dict: {len(state_dict)} ")
world_size = dist.get_world_size()
shard_pg = ProcessGroup(tp_degree=world_size) if self.shard_init else None
default_dist_spec = ShardSpec([-1], [world_size]) if self.shard_init else None
with ColoInitContext(device=get_current_device(),
dtype=torch.half,
default_pg=shard_pg,
default_dist_spec=default_dist_spec):
model = ChatGLMForConditionalGeneration(ChatGLMConfig.from_pretrained(chatglm_model_path))
world_size = dist.get_world_size()
for n, p in model.named_parameters(): # p size is [m, 256]
x = state_dict[n] # x size is [m,4096], p size is [m, 256]
x = x.chunk(world_size, dim=-1)
x = x[dist.get_rank()]
p.data.copy_(x) # x size is [m,1024], p size is [m, 256] report error
I print shard_pg and default_dist_spec ,shard_pg : ProcessGroup(ranks=[0, 1, 2, 3], rank=3, dp=1, tp=4), default_dist_spec : DistSpec(dims=(-1,), num_partitions=(4,), placement=DistPlacementPattern.SHARD), It seem no problem
Thanks for your help. I have successfully fine-tuned using this method in a single-node multi-GPU environment. However, I am now encountering issues when saving the model in a multi-node multi-GPU environment, using the save_model method from applications/Chat/coati/trainer/strategies/colossalai.py. The error log is as follows:
can you share you bloom_sft_train.py code? i met the same error in a single-node 4GPUs environment.