ncnn
ncnn copied to clipboard
[WIP] rnn/lstm/gru dynamic quantization
- [x] rnn
- [x] rnn-arm
- [x] lstm
- [x] lstm-arm
- [x] lstm-x86
- [x] gru
- [x] gru-arm
- [x] fix over load s8
- [ ] coverage
- [ ] doc
- [ ] speed test
- [x] rnn aq
- [x] rnn-arm aq
- [x] lstm aq
- [x] lstm-arm aq
- [x] lstm-x86 aq
- [x] gru aq
- [x] gru-arm aq
太赞啦!
import torch
import torch.nn as nn
import torch.nn.functional as F
import pnnx
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.rnn = nn.RNN(input_size=256, hidden_size=256, num_layers=30)
self.lstm = nn.LSTM(input_size=256, hidden_size=256, num_layers=30)
self.gru = nn.GRU(input_size=256, hidden_size=256, num_layers=30)
def forward(self, x):
out0, _ = self.rnn(x)
out1, _ = self.lstm(x)
out2, _ = self.gru(x)
return out0, out1, out2
net = Model().half().float()
net.eval()
torch.manual_seed(0)
x = torch.rand(300, 1, 256)
pnnx.export(net, "rnn.pt", x)
ncnn2int8 rnn.ncnn.param rnn.ncnn.bin rnn-int8.ncnn.param rnn-int8.ncnn.bin /dev/null
| rnn/rnn-int8.bin | fp16 | int8 |
|---|---|---|
| 模型体积 | 60.1M | 30.6M |
| qcom855plus MAE | fp32 | fp16 | int8 |
|---|---|---|---|
| 30层rnn | 0 | 2.29E-08 | 7.31E-08 |
| 30层lstm | 0 | 4.39E-09 | 5.54E-09 |
| 30层gru | 0 | 6.75E-09 | 1.96E-08 |
| qcom855plus 单线程耗时 | fp32 | fp16 | int8 |
|---|---|---|---|
| 30层rnn | 45.16 | 24.81 | 19.87 |
| 30层lstm | 256.51 | 121.99 | 60.7 |
| 30层gru | 167.52 | 94.68 | 46.29 |
| i5-12400 单线程跑30层lstm-int8模型耗时 | |
|---|---|
| naive(sse2) | 95.24 |
| sse2 | 87.02 |
| avx | 64.85 |
| avx2 | 42.22 |
| avxvnni | 23.24 |
| avx512 | 27.95 |
| avx512vnni | 15.8 |
| imx6d 单线程耗时 | fp32 | int8 |
|---|---|---|
| 30层rnn | 1392.22 | 504.83 |
| 30层lstm | 6063.91 | 1833.46 |
| 30层gru | 4357.59 | 1300.93 |