InternLM-XComposer icon indicating copy to clipboard operation
InternLM-XComposer copied to clipboard

ZERO3 + Offload CPU Error when fine-tuning InternLM-XComposer2

Open Coobiw opened this issue 1 year ago • 2 comments

Hi, Thanks for your great work! When I fine-tune InternLM-XComposer2(unfreeze the proj and the whole LLM, freeze vit). In order to avoid OOM, I use zero3 and offload the optimizer to CPU(by modifying the https://github.com/InternLM/InternLM-XComposer/blob/main/InternLM-XComposer-2.0/finetune/ds_config_zero2.json#L17 to cpu). I find an error as following. The original ds_config_zero2.json will not raise this. How can I solve it. Thanks for your advice and reply!

Error Message:

Traceback (most recent call last):
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/qbw/ChartLLM/InternLM-XComposer/finetune/finetune_smoe.py", line 396, in <module>
    train()
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/qbw/ChartLLM/InternLM-XComposer/finetune/finetune_smoe.py", line 297, in train
    model = transformers.AutoModelForCausalLM.from_pretrained(
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/miniconda3/envs/intern_clean/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
    return model_class.from_pretrained(
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/miniconda3/envs/intern_clean/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2966, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/miniconda3/envs/intern_clean/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/qbw/cache/huggingface/modules/transformers_modules/internlm-xcomposer2-vl-7b/modeling_internlm_xcomposer2.py", line 67, in __init__
    self.vit = build_vision_tower()
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/qbw/cache/huggingface/modules/transformers_modules/internlm-xcomposer2-vl-7b/build_mlp.py", line 11, in build_vision_tower
    return CLIPVisionTower(vision_tower)
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/miniconda3/envs/intern_clean/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/qbw/cache/huggingface/modules/transformers_modules/internlm-xcomposer2-vl-7b/build_mlp.py", line 59, in __init__
    self.resize_pos()
  File "/data/FinAi_Mapping_Knowledge/qiyiyan/qbw/cache/huggingface/modules/transformers_modules/internlm-xcomposer2-vl-7b/build_mlp.py", line 85, in resize_pos
    pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size,
RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 24, 24, 0] because the unspecified dimension size -1 can be any value and is ambiguous

Coobiw avatar Jul 12 '24 19:07 Coobiw

I got a similar error, did you manage to fix this?

YerongLi avatar Sep 15 '24 22:09 YerongLi

I found this is due to ZERO3, both ZERO3 and ZERO3-offload

YerongLi avatar Oct 26 '24 08:10 YerongLi