PaddleNLP
PaddleNLP copied to clipboard
[Bug]: 量化llama模型导出静态模型后,无法使用静态模式推理
软件环境
paddle-bfloat 0.1.7
paddle2onnx 1.1.0
paddlefsl 1.1.0
paddlenlp 2.7.0.post0
paddlenlp-ops 0.0.0
paddlepaddle-gpu 2.6.0.post112
重复问题
- [X] I have searched the existing issues
错误描述
使用导出的静态大模型推理时报错如下:
Traceback (most recent call last):
File "/projects/pdllm/PaddleNLP-develop/llm/predictor.py", line 1622, in <module>
predict()
File "/projects/pdllm/PaddleNLP-develop/llm/predictor.py", line 1550, in predict
outputs = predictor.predict(batch_source_text)
File "/projects/pdllm/PaddleNLP-develop/llm/predictor.py", line 259, in predict
predictions = self._infer(tokenized_source)
File "/anaconda3/envs/padllm/lib/python3.10/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/anaconda3/envs/padllm/lib/python3.10/site-packages/paddle/base/dygraph/base.py", line 352, in _decorate_function
return func(*args, **kwargs)
File "/projects/pdllm/PaddleNLP-develop/llm/predictor.py", line 690, in _infer
self.predictor.run()
RuntimeError: (NotFound) Operator (matmul) does not have kernel for {data_type[int8_t]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]}.
[Hint: Expected kernel_iter != kernels.end(), but received kernel_iter == kernels.end().] (at ../paddle/fluid/framework/operator.cc:2380)
[operator < matmul > error]
稳定复现步骤 & 代码
python export_model.py --model_name_or_path ./llama/Llama-2-7b-chat_ptq_ckpts/ --inference_model --output_path ./llama/Llama-2-7b-chat_a8w8_inf/ --dtype float16
export FLAGS_use_autotune=1 export FLAGS_cublaslt_exhaustive_search_times=10 export FLAGS_cache_inference_while_scope=1
python predictor.py --model_name_or_path ./llama/Llama-2-7b-chat_a8w8_inf/ --inference_model --quant_type weight_only_int8 --dtype "float16" --mode "static"
相关的环境变量 declare -x FLAGS_allocator_strategy="naive_best_fit" declare -x FLAGS_cache_inference_while_scope="1" declare -x FLAGS_control_flow_use_new_executor="1" declare -x FLAGS_cublaslt_exhaustive_search_times="10" declare -x FLAGS_fraction_of_gpu_memory_to_use="0.92" declare -x FLAGS_new_executor_serial_run="1" declare -x FLAGS_use_autotune="1"
你跑的原生gpu还是trt推理
matmul这个算子没找到int8的实现,如果你用的paddle-trt推理,使用config.exp_disable_tensorrt_ops(["name"]) 这个name是你这个op输出的名字
@lizexu123 怎么确认使用的是原生gpu还是trt推理?
你看你的运行过程中,有没有出现detected a subgraph with ***nodes
@lizexu123 没有出现 --- Running analysis [ir_graph_build_pass] I0207 15:37:05.177603 2824 executor.cc:187] Old Executor is Running. --- Running analysis [ir_analysis_pass] --- Running IR pass [map_op_to_another_pass] --- Running IR pass [identity_op_clean_pass] I0207 15:37:09.198072 2824 fuse_pass_base.cc:59] --- detected 2 subgraphs --- Running IR pass [simplify_with_basic_ops_pass] --- Running IR pass [silu_fuse_pass] --- Running IR pass [delete_quant_dequant_linear_op_pass] --- Running IR pass [delete_weight_dequant_linear_op_pass] --- Running IR pass [conv_bn_fuse_pass] --- Running IR pass [conv_eltwiseadd_bn_fuse_pass] --- Running IR pass [conv_elementwise_add_act_fuse_pass] --- Running IR pass [conv_elementwise_add2_act_fuse_pass] --- Running IR pass [conv_elementwise_add_fuse_pass] --- Running IR pass [fused_conv2d_add_act_layout_transfer_pass] --- Running IR pass [multihead_matmul_fuse_pass_v2] --- Running IR pass [fused_multi_transformer_encoder_pass] --- Running IR pass [fused_multi_transformer_decoder_pass] --- Running IR pass [fused_multi_transformer_encoder_fuse_qkv_pass] --- Running IR pass [fused_multi_transformer_decoder_fuse_qkv_pass] --- Running IR pass [multi_devices_fused_multi_transformer_encoder_pass] --- Running IR pass [multi_devices_fused_multi_transformer_encoder_fuse_qkv_pass] --- Running IR pass [multi_devices_fused_multi_transformer_decoder_fuse_qkv_pass] --- Running IR pass [fuse_multi_transformer_layer_pass] --- Running IR pass [gpu_cpu_map_matmul_v2_to_mul_pass] I0207 15:37:10.197242 2824 fuse_pass_base.cc:59] --- detected 1 subgraphs --- Running IR pass [gpu_cpu_map_matmul_v2_to_matmul_pass] I0207 15:37:10.239908 2824 fuse_pass_base.cc:59] --- detected 128 subgraphs --- Running IR pass [gpu_cpu_map_matmul_to_mul_pass] --- Running IR pass [fc_fuse_pass] --- Running IR pass [embedding_eltwise_layernorm_fuse_pass] --- Running IR pass [inplace_op_var_pass] --- Running analysis [save_optimized_model_pass] --- Running analysis [ir_params_sync_among_devices_pass] I0207 15:37:10.284910 2824 ir_params_sync_among_devices_pass.cc:53] Sync params from CPU to GPU --- Running analysis [adjust_cudnn_workspace_size_pass] --- Running analysis [inference_op_replace_pass] --- Running analysis [ir_graph_to_program_pass] I0207 15:37:13.814369 2824 analysis_predictor.cc:1838] ======= optimize end =======
是下载的cuda相关的Paddle吗,我看matmul_kernel.cu中如果包括ifdef PADDLE_WITH_CUDA ,才支持int8
我的所有版本是这样子的,我不太清楚是不是符合你说的这个,我有个问题是,同样的量化权重,动态图就可以推理,静态图则不行,这跟paddle的实现有关吗?这两个推理最终调用的paddle计算函数不一样吗?
paddle-bfloat 0.1.7 paddle2onnx 1.1.0 paddlefsl 1.1.0 paddlenlp 2.7.0.post0 paddlenlp-ops 0.0.0 paddlepaddle-gpu 2.6.0.post112
This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。
This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。