VisualGLM-6B icon indicating copy to clipboard operation
VisualGLM-6B copied to clipboard

使用cli_demo.py 跑自己finetune的权重,报错 The size of tensor a (12288) must match the size of tensor b (25165824) at non-singleton dimension 0

Open shituo123456 opened this issue 1 year ago • 29 comments

image

shituo123456 avatar Aug 03 '23 06:08 shituo123456

qlora权重只能在cuda上跑,不能在cpu上跑。这是bitsandbytes的实现,我也控制不了。

1049451037 avatar Aug 03 '23 06:08 1049451037

qlora权重只能在cuda上跑,不能在cpu上跑。这是bitsandbytes的实现,我也控制不了。

那就是如果使用cli_demo.py 执行微调的权重,就不能加--quant 这个参数,加载了就只能用cpu,是这样吗

shituo123456 avatar Aug 03 '23 06:08 shituo123456

qlora不能加--quant,这是两个独立的功能。--quant是用来给本来没有qlora的模型用的。

1049451037 avatar Aug 03 '23 06:08 1049451037

那如何执行自己finetune后的权重呢

shituo123456 avatar Aug 03 '23 06:08 shituo123456

直接执行,按照readme的命令

https://github.com/THUDM/VisualGLM-6B#模型微调

1049451037 avatar Aug 03 '23 06:08 1049451037

我是2080Ti显卡,11g显存,单卡跑不动模型,直接就oom了

shituo123456 avatar Aug 03 '23 06:08 shituo123456

那你是怎么微调的。。。

1049451037 avatar Aug 03 '23 06:08 1049451037

微调就用的qlora。 保存的权重就是qlora类型是吗?但是我执行cli_demo.py,不加quant直接就OOM了

shituo123456 avatar Aug 03 '23 06:08 shituo123456

是的,保存的权重就是qlora。

1049451037 avatar Aug 03 '23 06:08 1049451037

能知道在哪里报错的OOM吗?也许是生成的序列太长,就爆了。

1049451037 avatar Aug 03 '23 06:08 1049451037

刚执行 python cli_demo.py --from_pretrained xxx 就报错了

[2023-08-03 14:26:06,452] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /root/anaconda3/envs/llm/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda115.so /root/anaconda3/envs/llm/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /root/anaconda3/envs/llm did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) CUDA SETUP: CUDA runtime path found: /root/.kiwi/lib/cuda11.5/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 7.5 CUDA SETUP: Detected CUDA version 115 CUDA SETUP: Loading binary /root/anaconda3/envs/llm/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda115.so... [2023-08-03 14:26:08,291] [INFO] building FineTuneVisualGLMModel model ... [2023-08-03 14:26:08,293] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-08-03 14:26:08,294] [INFO] [RANK 0] You are using model-only mode. For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK. Traceback (most recent call last): File "/data/data_01/llm/VisualGLM-6B/cli_demo.py", line 103, in main() File "/data/data_01/llm/VisualGLM-6B/cli_demo.py", line 30, in main model, model_args = AutoModel.from_pretrained( File "/root/anaconda3/envs/llm/lib/python3.9/site-packages/sat/model/base_model.py", line 310, in from_pretrained return cls.from_pretrained_base(name, args=args, home_path=home_path, url=url, prefix=prefix, build_only=build_only, overwrite_args=overwrite_args, **kwargs) File "/root/anaconda3/envs/llm/lib/python3.9/site-packages/sat/model/base_model.py", line 302, in from_pretrained_base model = get_model(args, model_cls, **kwargs) File "/root/anaconda3/envs/llm/lib/python3.9/site-packages/sat/model/base_model.py", line 352, in get_model model = model_cls(args, params_dtype=params_dtype, **kwargs) File "/data/data_01/llm/VisualGLM-6B/finetune_visualglm.py", line 13, in init super().init(args, transformer=transformer, parallel_output=parallel_output, **kw_args) File "/data/data_01/llm/VisualGLM-6B/model/visualglm.py", line 32, in init super().init(args, transformer=transformer, **kwargs) File "/root/anaconda3/envs/llm/lib/python3.9/site-packages/sat/model/official/chatglm_model.py", line 167, in init super(ChatGLMModel, self).init(args, transformer=transformer, activation_func=gelu, **kwargs) File "/root/anaconda3/envs/llm/lib/python3.9/site-packages/sat/model/base_model.py", line 92, in init self.transformer = BaseTransformer( File "/root/anaconda3/envs/llm/lib/python3.9/site-packages/sat/model/transformer.py", line 459, in init [get_layer(layer_id) for layer_id in range(num_layers)]) File "/root/anaconda3/envs/llm/lib/python3.9/site-packages/sat/model/transformer.py", line 459, in [get_layer(layer_id) for layer_id in range(num_layers)]) File "/root/anaconda3/envs/llm/lib/python3.9/site-packages/sat/model/transformer.py", line 432, in get_layer return BaseTransformerLayer( File "/root/anaconda3/envs/llm/lib/python3.9/site-packages/sat/model/transformer.py", line 336, in init self.mlp = MLP( File "/root/anaconda3/envs/llm/lib/python3.9/site-packages/sat/model/transformer.py", line 208, in init self.dense_h_to_4h = ColumnParallelLinear( File "/root/anaconda3/envs/llm/lib/python3.9/site-packages/sat/mpu/layers.py", line 253, in init self.weight = Parameter(torch.empty(self.output_size_per_partition, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 10.76 GiB total capacity; 10.12 GiB already allocated; 113.56 MiB free; 10.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

shituo123456 avatar Aug 03 '23 06:08 shituo123456

需要merge 下qlora权重吗

shituo123456 avatar Aug 03 '23 06:08 shituo123456

哦,我知道了,因为构建模型的时候就放在cuda上了……暂时只能这样改:

把cli_demo.py这段代码:

    # load model
    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=True if (torch.cuda.is_available() and args.quant is None) else False,
        device='cuda' if (torch.cuda.is_available() and args.quant is None) else 'cpu',
    ))

换成这样:

    # load model
    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=False,
        device='cpu',
    ), build_only=True)
    model = model.to('cuda')
    from sat.training.model_io import load_checkpoint
    load_checkpoint(model, model_args, load_path=args.from_pretrained)

1049451037 avatar Aug 03 '23 06:08 1049451037

佬,还是直接oom了

shituo123456 avatar Aug 03 '23 06:08 shituo123456

不应该呀,你确定你就是在这张卡上微调的吗……如果是的话显然是可以加载的。

1049451037 avatar Aug 03 '23 06:08 1049451037

不应该呀,你确定你就是在这张卡上微调的吗……如果是的话显然是可以加载的。 是同一张卡。。 是在这块就报oom了 image

shituo123456 avatar Aug 03 '23 07:08 shituo123456

这块如果你换成了我的代码,就不会占用任何gpu。因为device='cpu'。说明你没改代码。

1049451037 avatar Aug 03 '23 07:08 1049451037

哦,我知道了,因为构建模型的时候就放在cuda上了……暂时只能这样改:

把cli_demo.py这段代码:

    # load model
    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=True if (torch.cuda.is_available() and args.quant is None) else False,
        device='cuda' if (torch.cuda.is_available() and args.quant is None) else 'cpu',
    ))

换成这样:

    # load model
    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=False,
        device='cpu',
    ), build_only=True)
    model = model.to('cuda')
    from sat.training.model_io import load_checkpoint
    load_checkpoint(model, model_args, load_path=args.from_pretrained)

请按照我说的做。

1049451037 avatar Aug 03 '23 07:08 1049451037

之前确实改了,这次好了。 感谢感谢

shituo123456 avatar Aug 03 '23 07:08 shituo123456

我刚才测试了下发现这种微调效果很差

shituo123456 avatar Aug 03 '23 07:08 shituo123456

模型训练相关的就需要您自己探索了。

1049451037 avatar Aug 03 '23 07:08 1049451037

好的 我再试试

shituo123456 avatar Aug 03 '23 07:08 shituo123456

parser.add_argument("--prompt_zh", type=str, default="描述这张图片", help='Chinese prompt for the first round') 请问您在执行cli_demo.py时,这个prompt是什么,我训练给的prompt有好几种,这块为什么需要提前输入一种prompt呢

shituo123456 avatar Aug 03 '23 07:08 shituo123456

您可以根据需要修改cli_demo.py。

1049451037 avatar Aug 03 '23 07:08 1049451037

感谢耐心答复

shituo123456 avatar Aug 03 '23 07:08 shituo123456

您可以根据需要修改cli_demo.py。

哦,我知道了,因为构建模型的时候就放在cuda上了……暂时只能这样改:

把cli_demo.py这段代码:

    # load model
    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=True if (torch.cuda.is_available() and args.quant is None) else False,
        device='cuda' if (torch.cuda.is_available() and args.quant is None) else 'cpu',
    ))

换成这样:

    # load model
    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=False,
        device='cpu',
    ), build_only=True)
    model = model.to('cuda')
    from sat.training.model_io import load_checkpoint
    load_checkpoint(model, model_args, load_path=args.from_pretrained)

佬,我这样改了以后报了新的错误,这又是因为什么原因? [2023-11-12 19:24:25,752] [INFO] [RANK 0] global rank 0 is loading checkpoint /home/llf/VisualGLM-6B/checkpoints/finetune-visualglm-6b-11-12-19-08/100/mp_rank_00_model_states.pt Traceback (most recent call last): File "/home/llf/VisualGLM-6B/cli_demo.py", line 115, in main() File "/home/llf/VisualGLM-6B/cli_demo.py", line 48, in main load_checkpoint(model, model_args, load_path=args.from_pretrained) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/sat/training/model_io.py", line 238, in load_checkpoint missing_keys, unexpected_keys = module.load_state_dict(sd['module'], strict=False) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2027, in load_state_dict load(self, state_dict) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load load(child, child_state_dict, child_prefix) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load load(child, child_state_dict, child_prefix) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load load(child, child_state_dict, child_prefix) [Previous line repeated 3 more times] File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2009, in load module._load_from_state_dict( File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/sat/model/finetune/lora2.py", line 49, in _load_from_state_dict copy_nested_list(state_dict[prefix+'quant_state'], self.weight.quant_state) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/sat/model/finetune/lora2.py", line 37, in copy_nested_list for i in range(len(dst)): TypeError: object of type 'QuantState' has no len()

PeekLee avatar Nov 12 '23 11:11 PeekLee

您可以根据需要修改cli_demo.py。

哦,我知道了,因为构建模型的时候就放在cuda上了……暂时只能这样改: 把cli_demo.py这段代码:

    # load model
    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=True if (torch.cuda.is_available() and args.quant is None) else False,
        device='cuda' if (torch.cuda.is_available() and args.quant is None) else 'cpu',
    ))

换成这样:

    # load model
    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=False,
        device='cpu',
    ), build_only=True)
    model = model.to('cuda')
    from sat.training.model_io import load_checkpoint
    load_checkpoint(model, model_args, load_path=args.from_pretrained)

佬,我这样改了以后报了新的错误,这又是因为什么原因? [2023-11-12 19:24:25,752] [INFO] [RANK 0] global rank 0 is loading checkpoint /home/llf/VisualGLM-6B/checkpoints/finetune-visualglm-6b-11-12-19-08/100/mp_rank_00_model_states.pt Traceback (most recent call last): File "/home/llf/VisualGLM-6B/cli_demo.py", line 115, in main() File "/home/llf/VisualGLM-6B/cli_demo.py", line 48, in main load_checkpoint(model, model_args, load_path=args.from_pretrained) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/sat/training/model_io.py", line 238, in load_checkpoint missing_keys, unexpected_keys = module.load_state_dict(sd['module'], strict=False) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2027, in load_state_dict load(self, state_dict) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load load(child, child_state_dict, child_prefix) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load load(child, child_state_dict, child_prefix) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load load(child, child_state_dict, child_prefix) [Previous line repeated 3 more times] File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2009, in load module._load_from_state_dict( File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/sat/model/finetune/lora2.py", line 49, in _load_from_state_dict copy_nested_list(state_dict[prefix+'quant_state'], self.weight.quant_state) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/sat/model/finetune/lora2.py", line 37, in copy_nested_list for i in range(len(dst)): TypeError: object of type 'QuantState' has no len()

请问这个问题解决了吗?我用QLora方法跑完官方微调项目,加载模型的时候报了一样的错

KinokoY avatar Nov 28 '23 14:11 KinokoY

您可以根据需要修改cli_demo.py。

哦,我知道了,因为构建模型的时候就放在cuda上了……暂时只能这样改: 把cli_demo.py这段代码:

    # load model
    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=True if (torch.cuda.is_available() and args.quant is None) else False,
        device='cuda' if (torch.cuda.is_available() and args.quant is None) else 'cpu',
    ))

换成这样:

    # load model
    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=False,
        device='cpu',
    ), build_only=True)
    model = model.to('cuda')
    from sat.training.model_io import load_checkpoint
    load_checkpoint(model, model_args, load_path=args.from_pretrained)

佬,我这样改了以后报了新的错误,这又是因为什么原因? [2023-11-12 19:24:25,752] [INFO] [RANK 0] global rank 0 is loading checkpoint /home/llf/VisualGLM-6B/checkpoints/finetune-visualglm-6b-11-12-19-08/100/mp_rank_00_model_states.pt Traceback (most recent call last): File "/home/llf/VisualGLM-6B/cli_demo.py", line 115, in main() File "/home/llf/VisualGLM-6B/cli_demo.py", line 48, in main load_checkpoint(model, model_args, load_path=args.from_pretrained) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/sat/training/model_io.py", line 238, in load_checkpoint missing_keys, unexpected_keys = module.load_state_dict(sd['module'], strict=False) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2027, in load_state_dict load(self, state_dict) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load load(child, child_state_dict, child_prefix) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load load(child, child_state_dict, child_prefix) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load load(child, child_state_dict, child_prefix) [Previous line repeated 3 more times] File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2009, in load module._load_from_state_dict( File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/sat/model/finetune/lora2.py", line 49, in _load_from_state_dict copy_nested_list(state_dict[prefix+'quant_state'], self.weight.quant_state) File "/home/llf/anaconda3/envs/visualglm/lib/python3.10/site-packages/sat/model/finetune/lora2.py", line 37, in copy_nested_list for i in range(len(dst)): TypeError: object of type 'QuantState' has no len()

请问这个问题解决了吗?我用QLora方法跑完官方微调项目,加载模型的时候报了一样的错

wwlaoxi avatar Dec 05 '23 10:12 wwlaoxi

我也遇到相同的问题,我用QLora方法跑完官方微调项目。在4090上运行报错,(py310_chat) yl@4-gpu:~/llm_ll/VisualGLM-6B$ CUDA_VISIBLE_DEVICES=1 python cli_demo.py --from_pretrained checkpoints/finetune-visualglm-6b-12-11-16-53/ --prompt_zh 这张图片的背景里有什么内容? [2023-12-12 12:34:49,199] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-12 12:34:52,703] [INFO] building FineTuneVisualGLMModel model ... [2023-12-12 12:34:52,705] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-12-12 12:34:52,706] [INFO] [RANK 0] You didn't pass in LOCAL_WORLD_SIZE environment variable. We use the guessed LOCAL_WORLD_SIZE=1. If this is wrong, please pass the LOCAL_WORLD_SIZE manually. [2023-12-12 12:34:52,707] [INFO] [RANK 0] You are using model-only mode. For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK. /data1/yl/anaconda3/envs/py310_chat/lib/python3.10/site-packages/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") [2023-12-12 12:35:30,798] [INFO] [RANK 0] replacing layer 0 attention with lora [2023-12-12 12:35:32,391] [INFO] [RANK 0] replacing layer 14 attention with lora [2023-12-12 12:35:34,383] [INFO] [RANK 0] replacing chatglm linear layer with 4bit [2023-12-12 12:38:06,999] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7802848768 [2023-12-12 12:38:20,010] [INFO] [RANK 0] global rank 0 is loading checkpoint checkpoints/finetune-visualglm-6b-12-11-16-53/300/mp_rank_00_model_states.pt Traceback (most recent call last): File "/data1/yl/llm_ll/VisualGLM-6B/cli_demo.py", line 116, in main() File "/data1/yl/llm_ll/VisualGLM-6B/cli_demo.py", line 49, in main load_checkpoint(model, model_args, load_path=args.from_pretrained) File "/data1/yl/anaconda3/envs/py310_chat/lib/python3.10/site-packages/sat/training/model_io.py", line 242, in load_checkpoint missing_keys, unexpected_keys = module.load_state_dict(sd['module'], strict=False) File "/data1/yl/anaconda3/envs/py310_chat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2138, in load_state_dict load(self, state_dict) File "/data1/yl/anaconda3/envs/py310_chat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2126, in load load(child, child_state_dict, child_prefix) File "/data1/yl/anaconda3/envs/py310_chat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2126, in load load(child, child_state_dict, child_prefix) File "/data1/yl/anaconda3/envs/py310_chat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2126, in load load(child, child_state_dict, child_prefix) [Previous line repeated 3 more times] File "/data1/yl/anaconda3/envs/py310_chat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2120, in load module._load_from_state_dict( File "/data1/yl/anaconda3/envs/py310_chat/lib/python3.10/site-packages/sat/model/finetune/lora2.py", line 49, in _load_from_state_dict copy_nested_list(state_dict[prefix+'quant_state'], self.weight.quant_state) File "/data1/yl/anaconda3/envs/py310_chat/lib/python3.10/site-packages/sat/model/finetune/lora2.py", line 37, in copy_nested_list for i in range(len(dst)): TypeError: object of type 'QuantState' has no len()

@1049451037

Caro-zll avatar Dec 12 '23 05:12 Caro-zll