PaddleNLP icon indicating copy to clipboard operation
PaddleNLP copied to clipboard

[Bug]: 量化llama模型导出静态模型后,无法使用静态模式推理

Open 1826133674 opened this issue 1 year ago • 9 comments

软件环境

paddle-bfloat            0.1.7
paddle2onnx              1.1.0
paddlefsl                1.1.0
paddlenlp                2.7.0.post0
paddlenlp-ops            0.0.0
paddlepaddle-gpu         2.6.0.post112

重复问题

  • [X] I have searched the existing issues

错误描述

使用导出的静态大模型推理时报错如下:

Traceback (most recent call last):
  File "/projects/pdllm/PaddleNLP-develop/llm/predictor.py", line 1622, in <module>
    predict()
  File "/projects/pdllm/PaddleNLP-develop/llm/predictor.py", line 1550, in predict
    outputs = predictor.predict(batch_source_text)
  File "/projects/pdllm/PaddleNLP-develop/llm/predictor.py", line 259, in predict
    predictions = self._infer(tokenized_source)
  File "/anaconda3/envs/padllm/lib/python3.10/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/anaconda3/envs/padllm/lib/python3.10/site-packages/paddle/base/dygraph/base.py", line 352, in _decorate_function
    return func(*args, **kwargs)
  File "/projects/pdllm/PaddleNLP-develop/llm/predictor.py", line 690, in _infer
    self.predictor.run()
RuntimeError: (NotFound) Operator (matmul) does not have kernel for {data_type[int8_t]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]}.
  [Hint: Expected kernel_iter != kernels.end(), but received kernel_iter == kernels.end().] (at ../paddle/fluid/framework/operator.cc:2380)
  [operator < matmul > error]

稳定复现步骤 & 代码

python export_model.py --model_name_or_path ./llama/Llama-2-7b-chat_ptq_ckpts/ --inference_model --output_path ./llama/Llama-2-7b-chat_a8w8_inf/ --dtype float16

export FLAGS_use_autotune=1 export FLAGS_cublaslt_exhaustive_search_times=10 export FLAGS_cache_inference_while_scope=1

python predictor.py --model_name_or_path ./llama/Llama-2-7b-chat_a8w8_inf/ --inference_model --quant_type weight_only_int8 --dtype "float16" --mode "static"

1826133674 avatar Feb 04 '24 06:02 1826133674

相关的环境变量 declare -x FLAGS_allocator_strategy="naive_best_fit" declare -x FLAGS_cache_inference_while_scope="1" declare -x FLAGS_control_flow_use_new_executor="1" declare -x FLAGS_cublaslt_exhaustive_search_times="10" declare -x FLAGS_fraction_of_gpu_memory_to_use="0.92" declare -x FLAGS_new_executor_serial_run="1" declare -x FLAGS_use_autotune="1"

1826133674 avatar Feb 04 '24 06:02 1826133674

你跑的原生gpu还是trt推理

lizexu123 avatar Feb 06 '24 09:02 lizexu123

matmul这个算子没找到int8的实现,如果你用的paddle-trt推理,使用config.exp_disable_tensorrt_ops(["name"]) 这个name是你这个op输出的名字

lizexu123 avatar Feb 06 '24 11:02 lizexu123

@lizexu123 怎么确认使用的是原生gpu还是trt推理?

1826133674 avatar Feb 07 '24 01:02 1826133674

你看你的运行过程中,有没有出现detected a subgraph with ***nodes

lizexu123 avatar Feb 07 '24 02:02 lizexu123

@lizexu123 没有出现 --- Running analysis [ir_graph_build_pass] I0207 15:37:05.177603 2824 executor.cc:187] Old Executor is Running. --- Running analysis [ir_analysis_pass] --- Running IR pass [map_op_to_another_pass] --- Running IR pass [identity_op_clean_pass] I0207 15:37:09.198072 2824 fuse_pass_base.cc:59] --- detected 2 subgraphs --- Running IR pass [simplify_with_basic_ops_pass] --- Running IR pass [silu_fuse_pass] --- Running IR pass [delete_quant_dequant_linear_op_pass] --- Running IR pass [delete_weight_dequant_linear_op_pass] --- Running IR pass [conv_bn_fuse_pass] --- Running IR pass [conv_eltwiseadd_bn_fuse_pass] --- Running IR pass [conv_elementwise_add_act_fuse_pass] --- Running IR pass [conv_elementwise_add2_act_fuse_pass] --- Running IR pass [conv_elementwise_add_fuse_pass] --- Running IR pass [fused_conv2d_add_act_layout_transfer_pass] --- Running IR pass [multihead_matmul_fuse_pass_v2] --- Running IR pass [fused_multi_transformer_encoder_pass] --- Running IR pass [fused_multi_transformer_decoder_pass] --- Running IR pass [fused_multi_transformer_encoder_fuse_qkv_pass] --- Running IR pass [fused_multi_transformer_decoder_fuse_qkv_pass] --- Running IR pass [multi_devices_fused_multi_transformer_encoder_pass] --- Running IR pass [multi_devices_fused_multi_transformer_encoder_fuse_qkv_pass] --- Running IR pass [multi_devices_fused_multi_transformer_decoder_fuse_qkv_pass] --- Running IR pass [fuse_multi_transformer_layer_pass] --- Running IR pass [gpu_cpu_map_matmul_v2_to_mul_pass] I0207 15:37:10.197242 2824 fuse_pass_base.cc:59] --- detected 1 subgraphs --- Running IR pass [gpu_cpu_map_matmul_v2_to_matmul_pass] I0207 15:37:10.239908 2824 fuse_pass_base.cc:59] --- detected 128 subgraphs --- Running IR pass [gpu_cpu_map_matmul_to_mul_pass] --- Running IR pass [fc_fuse_pass] --- Running IR pass [embedding_eltwise_layernorm_fuse_pass] --- Running IR pass [inplace_op_var_pass] --- Running analysis [save_optimized_model_pass] --- Running analysis [ir_params_sync_among_devices_pass] I0207 15:37:10.284910 2824 ir_params_sync_among_devices_pass.cc:53] Sync params from CPU to GPU --- Running analysis [adjust_cudnn_workspace_size_pass] --- Running analysis [inference_op_replace_pass] --- Running analysis [ir_graph_to_program_pass] I0207 15:37:13.814369 2824 analysis_predictor.cc:1838] ======= optimize end =======

1826133674 avatar Feb 07 '24 07:02 1826133674

是下载的cuda相关的Paddle吗,我看matmul_kernel.cu中如果包括ifdef PADDLE_WITH_CUDA ,才支持int8

lizexu123 avatar Feb 07 '24 08:02 lizexu123

我的所有版本是这样子的,我不太清楚是不是符合你说的这个,我有个问题是,同样的量化权重,动态图就可以推理,静态图则不行,这跟paddle的实现有关吗?这两个推理最终调用的paddle计算函数不一样吗?

paddle-bfloat 0.1.7 paddle2onnx 1.1.0 paddlefsl 1.1.0 paddlenlp 2.7.0.post0 paddlenlp-ops 0.0.0 paddlepaddle-gpu 2.6.0.post112

1826133674 avatar Feb 07 '24 09:02 1826133674

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

github-actions[bot] avatar Apr 27 '24 00:04 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。

github-actions[bot] avatar May 11 '24 00:05 github-actions[bot]