ColossalAI [BUG]: Fail to load huggingface pretraining when use shardinit

🐛 Describe the bug

        world_size = torch.distributed.get_world_size()
        shard_pg = ProcessGroup(tp_degree=world_size) if args.shardinit else None
        default_dist_spec = ShardSpec([-1], [world_size]) if args.shardinit else None

        with ColoInitContext(device=get_current_device(),
                             dtype=torch.half,
                             default_dist_spec=default_dist_spec,
                             default_pg=shard_pg):
            model = BloomForCausalLM.from_pretrained(args.model_name_or_path)

When using shardinit, the model will be split into multiple GPUs first and then load the huggingface pertaining, so checkpoint mismatch will occur.

RuntimeError: Error(s) in loading state_dict for BloomForCausalLM:
        size mismatch for transformer.word_embeddings.weight: copying a param 
with shape torch.Size([46145, 4096]) from checkpoint, the shape in current model
is torch.Size([46145, 512]).
        size mismatch for transformer.word_embeddings_layernorm.weight: copying 
a param with shape torch.Size([4096]) from checkpoint, the shape in current 
model is torch.Size([512]).
        size mismatch for transformer.word_embeddings_layernorm.bias: copying a 
param with shape torch.Size([4096]) from checkpoint, the shape in current model 
is torch.Size([512]).
        size mismatch for transformer.h.0.input_layernorm.weight: copying a 
param with shape torch.Size([4096]) from checkpoint, the shape in current model 
is torch.Size([512]).

I wonder to know how to successfully load huggingface pertaining when using shardinit, seems that it's necessary when we want to fine-tune a very large model.

Environment

No response

Feb 16 '23 11:02 sega-hsj

Hi, any update?

Mar 17 '23 20:03 ShinoharaHare

opt-30 with --shardinit example same error， how to fix it

Mar 18 '23 07:03 donghucey

I also encountered this problem. Any solution?

Mar 20 '23 07:03 caoyu-noob

any update? @YuliangLiu0306 @FrankLeeeee

Mar 21 '23 12:03 lwmlyy

A workaround is to construct the model first and then load the weights manually.

with ColoInitContext(
    device=get_current_device(),
    dtype=torch.half,
    default_pg=ProcessGroup(tp_degree=world_size),
    default_dist_spec=ShardSpec([-1], [world_size]),
):
    model = BloomForCausalLM(BloomConfig.from_pretrained(pretrained_path))
    for n, p in model.named_parameters():
        x = state_dict[n]
        x = x.chunk(world_size, dim=-1)
        x = x[global_rank]
        p.data.copy_(x)

Mar 21 '23 13:03 ShinoharaHare

@ShinoharaHare Thx for the reply. I'll give it a try. By the way, did you compare the accelerate performance between this strategy and Megatron?

Mar 22 '23 02:03 lwmlyy

hi @ShinoharaHare ，I come across the same error. Thanks for your solution ，I will have a try. but I still have a question here “ x = state_dict[n]” -- does it mean Deserialized the huggingface model into a state_dict like : state_dict = torch.load("xxx") （to cpu maybe） ? before the ColoInitContext process?

Mar 24 '23 02:03 taishiciR

@ShinoharaHare Thx for the reply. I'll give it a try. By the way, did you compare the accelerate performance between this strategy and Megatron?

Nope, didn't test.

hi @ShinoharaHare ，I come across the same error. Thanks for your solution ，I will have a try. but I still have a question here “ x = state_dict[n]” -- does it mean Deserialized the huggingface model into a state_dict like : state_dict = torch.load("xxx") （to cpu maybe） ? before the ColoInitContext process?

Yes, that's correct.

And there is a better but cumbersome way:

Convert the pretrained weights into safetensors
Use LazyInitContext with ColoInitContext to construct the model faster (might need to pass to_meta=False)
To load only the required parts on each rank, you can utilize the get_slice API from safetensors. This way, the weights should be able to be loaded directly into GPUs without first being loaded into the CPU.

Mar 24 '23 05:03 ShinoharaHare

Hi guys, we are developing Lazy initialization of model, which provides a much better user experience and will come soon (hopefully next week).

Apr 07 '23 07:04 binmakeswell

Hi @ver217 Could you please help guys with shardinit? Thanks.

Apr 07 '23 07:04 binmakeswell

A workaround is to construct the model first and then load the weights manually.

with ColoInitContext(
    device=get_current_device(),
    dtype=torch.half,
    default_pg=ProcessGroup(tp_degree=world_size),
    default_dist_spec=ShardSpec([-1], [world_size]),
):
    model = BloomForCausalLM(BloomConfig.from_pretrained(pretrained_path))
    for n, p in model.named_parameters():
        x = state_dict[n]
        x = x.chunk(world_size, dim=-1)
        x = x[global_rank]
        p.data.copy_(x)

I modified the full-parameter fine-tuning process of Bloom according to your code, but I still encountered a dimension inconsistency issue. Specifically, x is one quarter of p, which is theoretically normal because I did split it into 4 chunks. How did you manage to run it successfully?

Apr 10 '23 06:04 xrandx

Specifically, x is one quarter of p, which is theoretically normal because I did split it into 4 chunks

According to your description, I think your sharding size and chunking size didn't match. Specifically, the chunking size is four times greater than the sharding size.

In my code, I sharded the parameters with world_size

default_pg=ProcessGroup(tp_degree=world_size),
default_dist_spec=ShardSpec([-1], [world_size])

and then chunk x into world_size chunks

x = x.chunk(world_size, dim=-1)

As long as the world_size is consistent, there shouldn't be a problem.

Apr 11 '23 21:04 ShinoharaHare

Specifically, x is one quarter of p, which is theoretically normal because I did split it into 4 chunks

According to your description, I think your sharding size and chunking size didn' match. Specifically, the chunking size is four times greater than the sharding size.

In my code, I sharded the parameters with world_size
default_pg=ProcessGroup(tp_degree=world_size),
default_dist_spec=ShardSpec([-1], [world_size])
and then chunk x into world_size chunks
x = x.chunk(world_size, dim=-1)
As long as the world_size is consistent, there shouldn't be a problem.

Thanks for your help. I have successfully fine-tuned using this method in a single-node multi-GPU environment. However, I am now encountering issues when saving the model in a multi-node multi-GPU environment, using the save_model method from applications/Chat/coati/trainer/strategies/colossalai.py. The error log is as follows:


Traceback (most recent call last):
  File "colossal_private/applications/Chat/examples/bloom_sft_train.py", line 303, in <module>
    train(args)
  File "colossal_private/applications/Chat/examples/bloom_sft_train.py", line 270, in train
    trainer.save_model(path=args.save_path, only_rank0=True, tokenizer=tokenizer)
  File "colossal_private/applications/Chat/coati/trainer/sft.py", line 260, in save_model
    self.strategy.save_model(model=self.model, path=path, only_rank0=only_rank0, tokenizer=tokenizer)
  File "colossal_private/applications/Chat/coati/trainer/strategies/colossalai.py", line 204, in save_model
    unwrapped_model = self._unwrap_model(model)
  File "colossal_private/applications/Chat/coati/trainer/strategies/colossalai.py", line 191, in _unwrap_model
    model = get_static_torch_model(model)
  File "colossal_private/colossalai/zero/gemini/utils.py", line 84, in get_static_torch_model
    state_dict = zero_ddp_model.state_dict(only_rank_0=only_rank_0)
  File "colossal_private/colossalai/zero/gemini/gemini_ddp.py", line 222, in state_dict
    self._save_to_state_dict(destination, prefix, keep_vars, only_rank_0)
  File "colossal_private/colossalai/zero/gemini/gemini_ddp.py", line 275, in _save_to_state_dict
    param_to_save_data = self._get_param_to_save_data(self.fp32_params, only_rank_0)
  File "colossal_private/colossalai/zero/gemini/gemini_ddp.py", line 245, in _get_param_to_save_data
    temp_chunk = get_temp_total_chunk_on_cuda(chunk)
  File "colossal_private/colossalai/zero/gemini/utils.py", line 25, in get_temp_total_chunk_on_cuda
    dist.all_gather(tensor_list=gather_list, tensor=shard_temp, group=chunk.torch_pg)
  File "work/miniconda3/envs/test/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2277, in all_gather
    work = group.allgather([tensor_list], [tensor])
RuntimeError: NCCL communicator was aborted on rank 0.  Original reason for failure was: NCCL error: remote process exited or there was a network error, NCCL version 2.14.3
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.

Apr 12 '23 06:04 xrandx

LazyInit is a new good choice. For large models causing OOM in init, you can use "lazy init + from huggingface config", through gemini, then load checkpoint.

from colossalai.utils.model.experimental import LazyInitContext 

with LazyInitContext():
    model = xxx()

Then pass the model to gemini. Note you cannot use 'from pretrained model' in LazyInitContext()

The detailed document will come soon.

Apr 19 '23 10:04 binmakeswell

LazyInit is a new good choice. For large models causing OOM in init, you can use "lazy init + from huggingface config", through gemini, then load checkpoint.
from colossalai.utils.model.experimental import LazyInitContext 

with LazyInitContext():
    model = xxx()
Then pass the model to gemini. Note you cannot use 'from pretrained model' in LazyInitContext()

The detailed document will come soon.

how to use LazyInitContext in train_sft.py, path: https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/examples/train_sft.py

May 16 '23 14:05 zhangyuanscall

A workaround is to construct the model first and then load the weights manually.

with ColoInitContext(
    device=get_current_device(),
    dtype=torch.half,
    default_pg=ProcessGroup(tp_degree=world_size),
    default_dist_spec=ShardSpec([-1], [world_size]),
):
    model = BloomForCausalLM(BloomConfig.from_pretrained(pretrained_path))
    for n, p in model.named_parameters():
        x = state_dict[n]
        x = x.chunk(world_size, dim=-1)
        x = x[global_rank]
        p.data.copy_(x)

Thanks for your solution, I have a try in a single-node 4GPUs environment with coati sft example. but get size mismatch error, before chunk param x size is [vocab_size,4096], after chunk param x size is [vocab_size,1024], but param p size is [vocab_size,256], param p and x size is not match. casue a dimension inconsistency issue, is config not right, need to change?

this is my code:

state_dict = {}
state_dict.update( torch.load(f"{chatglm_model_path}/chatglm_6b/pytorch_model-00001-of-00008.bin"))
state_dict.update( torch.load(f"{chatglm_model_path}/chatglm_6b/pytorch_model-00002-of-00008.bin"))
state_dict.update( torch.load(f"{chatglm_model_path}/chatglm_6b/pytorch_model-00003-of-00008.bin"))
state_dict.update( torch.load(f"{chatglm_model_path}/chatglm_6b/pytorch_model-00004-of-00008.bin"))
state_dict.update( torch.load(f"{chatglm_model_path}/chatglm_6b/pytorch_model-00005-of-00008.bin"))
state_dict.update( torch.load(f"{chatglm_model_path}/chatglm_6b/pytorch_model-00006-of-00008.bin"))
state_dict.update( torch.load(f"{chatglm_model_path}/chatglm_6b/pytorch_model-00007-of-00008.bin"))
state_dict.update( torch.load(f"{chatglm_model_path}/chatglm_6b/pytorch_model-00008-of-00008.bin"))
if is_rank_0():print(f"------------------> load state dict: {len(state_dict)} ")

world_size = dist.get_world_size()
shard_pg = ProcessGroup(tp_degree=world_size) if self.shard_init else None
default_dist_spec = ShardSpec([-1], [world_size]) if self.shard_init else None

with ColoInitContext(device=get_current_device(),
dtype=torch.half,
default_pg=shard_pg,
default_dist_spec=default_dist_spec):

model = ChatGLMForConditionalGeneration(ChatGLMConfig.from_pretrained(chatglm_model_path))
world_size = dist.get_world_size()
for n, p in model.named_parameters():  # p size is [m, 256]
    x = state_dict[n]   # x size is [m,4096], p size is [m, 256]
    x = x.chunk(world_size, dim=-1)
    x = x[dist.get_rank()]  
    p.data.copy_(x)    # x size is [m,1024], p size is [m, 256] report error

I print shard_pg and default_dist_spec ,shard_pg : ProcessGroup(ranks=[0, 1, 2, 3], rank=3, dp=1, tp=4), default_dist_spec : DistSpec(dims=(-1,), num_partitions=(4,), placement=DistPlacementPattern.SHARD), It seem no problem

May 17 '23 03:05 zhangyuanscall

Thanks for your help. I have successfully fine-tuned using this method in a single-node multi-GPU environment. However, I am now encountering issues when saving the model in a multi-node multi-GPU environment, using the save_model method from applications/Chat/coati/trainer/strategies/colossalai.py. The error log is as follows:

can you share you bloom_sft_train.py code? i met the same error in a single-node 4GPUs environment.

May 17 '23 04:05 zhangyuanscall