Paddle
Paddle copied to clipboard
fused_embedding_eltwise_layernorm_op and skip_layernorm_op support fp16
PR types
Others
PR changes
Others
Describe
该PR效果:fused_embedding_eltwise_layernorm_op 和 skip_layernorm_op 算子添加对fp16的支持
测试模型:
import paddle
from paddle.static import InputSpec
import numpy as np
import os
import paddle.inference as inference
class EmbEwLnNet(paddle.nn.Layer):
def __init__(self):
super(EmbEwLnNet, self).__init__()
self.embedding_layer1 = paddle.nn.Embedding(1024, 256, sparse=True)
self.embedding_layer2 = paddle.nn.Embedding(1024, 256, sparse=True)
self.embedding_layer3 = paddle.nn.Embedding(1024, 256, sparse=True)
self.layer_norm = paddle.nn.LayerNorm(256)
def forward(self, x1, x2, x3):
x = self.embedding_layer1(x1) + self.embedding_layer2(x2)
x = x + self.embedding_layer3(x3)
x = self.layer_norm(x)
return x
class SkipLnNet(paddle.nn.Layer):
def __init__(self):
super(SkipLnNet, self).__init__()
self.layer_norm = paddle.nn.LayerNorm(256)
def forward(self, x1, x2):
x = x1 + x2
x = self.layer_norm(x)
return x
测试环境: GPU Compute Capability: 7.5, Driver API Version: 11.4, Runtime API Version: 11.2, cuDNN Version: 8.1. with warmup: 100 and repeats: 10000
fused_embedding_eltwise_layernorm_op算子
gpu原生
| model | gpu |
|---|---|
| float | 0.336549 ms |
| float16 | 0.165548 ms(原生没有cast) |
trt加速
| model | trt32 | trt16 |
|---|---|---|
| float | 0.336706 ms | 0.346024 ms |
| float16 | 0.338106 ms | 0.284455 ms(trt加了cast) |
skip_layernorm_op算子
gpu原生
| model | gpu |
|---|---|
| float | 0.288717 ms |
| float16 | 0.161595 ms(原生没有cast) |
trt加速
| model | trt32 | trt16 |
|---|---|---|
| float | 0.293951 ms | - |
| float16 | - | 0.270939 ms(trt加了cast) |
你的PR提交成功,感谢你对开源项目的贡献! 请关注后续CI自动化测试结果,详情请参考Paddle-CI手册。 Your PR has been submitted. Thanks for your contribution! Please wait for the result of CI firstly. See Paddle CI Manual for details.