PaddleNLP support chatglmv2 infer with block

trafficstars

PR types

Feature

PR changes

Models

Description

这个PR是用来支持chatglmv2和chatglmv3在block_attn组网模式下跑通fp16和weight-only int8量化下的高性能推理目前可以跑通chatglmv2和chatglmv3所有的block_attn fp16版本和weight-only int8量化版本，复现命令如下（chatglm3换一下model name即可同样复现） fp16 动态图 python predict/predictor.py --model_name_or_path THUDM/chatglm2-6b --dtype float16 --output_file ./output.json --decode_strategy greedy_search --mode dynamic --inference_model --batch_size 1 --block_attn 1 fp16 动转静 python predict/export_model.py --model_name_or_path THUDM/chatglm2-6b --output_path /root/.cache/paddlenlp/exported_model/THUDM/chatglm2-6b --dtype float16 --inference_model --block_attn 1 --batch_size 1 fp16静态图 python predict/predictor.py --model_name_or_path /root/.cache/paddlenlp/exported_model/THUDM/chatglm2-6b --dtype float16 --output_file ./output.json --mode static --inference_model --batch_size 1 --block_attn 1

weight-only int8动态图 python predict/predictor.py --model_name_or_path THUDM/chatglm2-6b --dtype float16 --output_file ./output.json --decode_strategy greedy_search --mode dynamic --inference_model --batch_size 1 --block_attn 1 --quant_type weight_only_int8 weight-only int8动转静 python predict/export_model.py --model_name_or_path THUDM/chatglm2-6b --output_path /root/.cache/paddlenlp/exported_model/THUDM/chatglm2-6b-wint8 --dtype float16 --inference_model --block_attn 1 --batch_size 1 --quant_type weight_only_int8 weight-only int8静态图 python predict/predictor.py --model_name_or_path/root/.cache/paddlenlp/exported_model/THUDM/chatglm2-6b-wint8 --dtype float16 --output_file ./output.json --mode static --inference_model --batch_size 1 --block_attn 1 --quant_type weight_only_int8

目前存在一个问题，即在block_attn模式下跑chatglm3会有一点点类似精度问题的状况，下个pr解决该情况输出如下 CC455630784357224B21BACAFBB83451

Aug 06 '24 10:08 xue-yun-liang

Thanks for your contribution!

Aug 06 '24 10:08 paddle-bot[bot]

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Aug 06 '24 10:08 CLAassistant

Codecov Report

Attention: Patch coverage is 0.91743% with 108 lines in your changes missing coverage. Please review.

Project coverage is 53.89%. Comparing base (aaacb32) to head (a3952d5). Report is 667 commits behind head on develop.

Files with missing lines	Patch %	Lines
...p/experimental/transformers/chatglm_v2/modeling.py	0.00%	88 Missing :warning:
paddlenlp/utils/llm_utils.py	0.00%	19 Missing :warning:
...enlp/experimental/transformers/generation_utils.py	0.00%	1 Missing :warning:

:x: Your patch check has failed because the patch coverage (0.91%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. :x: Your project check has failed because the head coverage (53.89%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #8881      +/-   ##
===========================================
- Coverage    54.38%   53.89%   -0.49%     
===========================================
  Files          648      650       +2     
  Lines       103266   104337    +1071     
===========================================
+ Hits         56161    56236      +75     
- Misses       47105    48101     +996

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Aug 06 '24 12:08 codecov[bot]

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动，被标记为stale。

Dec 14 '24 00:12 github-actions[bot]

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动，被标记为stale。

Feb 14 '25 00:02 github-actions[bot]

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动，被标记为stale。

Apr 17 '25 00:04 github-actions[bot]

PaddleNLP
PaddleNLP copied to clipboard

support chatglmv2 infer with block_attn fp16 & wint8

PR types

PR changes

Description

Codecov Report

PaddleNLP PaddleNLP copied to clipboard

support chatglmv2 infer with block_attn fp16 & wint8

PR types

PR changes

Description

Codecov Report

PaddleNLP
PaddleNLP copied to clipboard