PaddleNLP
PaddleNLP copied to clipboard
support chatglmv2 infer with block_attn fp16 & wint8
PR types
Feature
PR changes
Models
Description
这个PR是用来支持chatglmv2和chatglmv3在block_attn组网模式下跑通fp16和weight-only int8量化下的高性能推理
目前可以跑通chatglmv2和chatglmv3所有的block_attn fp16版本和weight-only int8量化版本,复现命令如下
(chatglm3换一下model name即可同样复现)
fp16 动态图
python predict/predictor.py --model_name_or_path THUDM/chatglm2-6b --dtype float16 --output_file ./output.json --decode_strategy greedy_search --mode dynamic --inference_model --batch_size 1 --block_attn 1
fp16 动转静
python predict/export_model.py --model_name_or_path THUDM/chatglm2-6b --output_path /root/.cache/paddlenlp/exported_model/THUDM/chatglm2-6b --dtype float16 --inference_model --block_attn 1 --batch_size 1
fp16静态图
python predict/predictor.py --model_name_or_path /root/.cache/paddlenlp/exported_model/THUDM/chatglm2-6b --dtype float16 --output_file ./output.json --mode static --inference_model --batch_size 1 --block_attn 1
weight-only int8动态图
python predict/predictor.py --model_name_or_path THUDM/chatglm2-6b --dtype float16 --output_file ./output.json --decode_strategy greedy_search --mode dynamic --inference_model --batch_size 1 --block_attn 1 --quant_type weight_only_int8
weight-only int8动转静
python predict/export_model.py --model_name_or_path THUDM/chatglm2-6b --output_path /root/.cache/paddlenlp/exported_model/THUDM/chatglm2-6b-wint8 --dtype float16 --inference_model --block_attn 1 --batch_size 1 --quant_type weight_only_int8
weight-only int8静态图
python predict/predictor.py --model_name_or_path/root/.cache/paddlenlp/exported_model/THUDM/chatglm2-6b-wint8 --dtype float16 --output_file ./output.json --mode static --inference_model --batch_size 1 --block_attn 1 --quant_type weight_only_int8
目前存在一个问题,即在block_attn模式下跑chatglm3会有一点点类似精度问题的状况,下个pr解决该情况
输出如下
Thanks for your contribution!
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.
Codecov Report
Attention: Patch coverage is 0.91743% with 108 lines in your changes missing coverage. Please review.
Project coverage is 53.89%. Comparing base (
aaacb32) to head (a3952d5). Report is 667 commits behind head on develop.
:x: Your patch check has failed because the patch coverage (0.91%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. :x: Your project check has failed because the head coverage (53.89%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage.
Additional details and impacted files
@@ Coverage Diff @@
## develop #8881 +/- ##
===========================================
- Coverage 54.38% 53.89% -0.49%
===========================================
Files 648 650 +2
Lines 103266 104337 +1071
===========================================
+ Hits 56161 56236 +75
- Misses 47105 48101 +996
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。
This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。
This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。