Paddle
Paddle copied to clipboard
GemmEpilogueOp with series of CUTLASS kernel
PR Category
Others
PR Types
Others
Description
P-card-71501
目标是要融合形如 matmul + add + act 的模式。用Cutlass编写GemmEpilogueOp,生成多种内核配置,寻求更优的融合实现。
matmul_add_act_fuse_pass支持 cublasLt(FcOp) 和 cutlass(GemmEpilogueOp) 两种路径,用户通过Exp_EnableUseCutlass() API修改analysis_config,来选择是否启用cutlass实现的Op(GemmEpilogueOp):在create_predictor的时候会读取analysis_config,给matmul_add_act_fuse_pass设置use_cutlass属性,并将该pass加入passManager。在Run该passManager的时候,matmul_add_act_fuse_pass对象的InitializePatterns方法被调用,pass对象根据get到的use_cutlass属性值,选择生成GemmEpilogueOp对应的模式或FcOp对应的模式,从而达成双路径的选择。
新Op(GemmEpilogueOp)在elementwiseAdd的时候,bias支持两种规模[1,N] 和 [M, N]([M,N]是matmul的输出规模)。 新Op支持 paddle.add(paddle.matmul(x, w), y) 和 paddle.add(y, paddle.matmul(x, w))两种模式(add参数位置调换)。 新Op支持Relu和Gelu激活。
新Op和原来的FcOp共用FCInferMeta函数,我放宽了该函数的约束以匹配额外模式。也就是说FcOp不能处理的模式,目前只在pass的约束中过滤,在FCInferMeta中的check被取消了。
关于性能: GemmEpilogueOp与散op相比,在大模型上跑2batch的端到端测速: 在llama上有大概 2.0% 的提速 在chatglm2上有大概 8.5% 的提速
TODO: pass目前提供Relu和Gelu激活的融合,还有三种激活目前已在kernel里实现(处于注释状态),但尚未在pass里支持。在kernel层面,解注释即可使用。
你的PR提交成功,感谢你对开源项目的贡献! 请关注后续CI自动化测试结果,详情请参考Paddle-CI手册。 Your PR has been submitted. Thanks for your contribution! Please wait for the result of CI firstly. See Paddle CI Manual for details.
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.
Sorry to inform you that f959b87's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.
Sorry to inform you that 519a02b's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.
PR描述中的一些"fc"字样也需要更新下,以及PR标题完善下~
Sorry to inform you that 008e268's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.