benchmark
benchmark copied to clipboard
Optimize inference performance of ERNIE INT8 on CPU
Now, paddle ERNIE fp32 inference on CPU performance is ass below: single thread: 251.464 m 20 threads:29.8818 ms Our goal is to prove that with INT8 real kernel, ERNIE can get the performance gain. @Sand3r- @wojtuss Please update your benchmark progress here. @wzzju @luotao1 Please track the status here.
We are currently working with the following two models: ernie_quant.zip
and fp32_model.tar.gz
.
Results gathered with the optimized fp32_model
model will be labeled FP32
.
Results gathered with the fake quantized FP32 ernie_quant
model will be labeled QAT FP32
.
Results gathered with an optimized INT8 model obtained from ernie_quant
model will be labeled QAT INT8
.
Initial performance results using the former (for FP32, 4 inputs, test_ds
dataset) and the latest (for QAT FP32/INT8, 2 inputs, 1.8w.bs1
dataset) unit test obtained from Baidu.
(20 threads, bs 1, CLX 6248)
FP32:
Run 5010 samples, average latency: 55.7539 ms per sample.
Run 5009 samples, average latency [exclude 1 warmup steps]: 55.7415 ms per sample.
QAT FP32:
Run 2490 samples, average latency: 92.2313ms per sample.
QAT INT8 (with mul):
Run 2490 samples, average latency: 47.1024ms per sample.
I have just noticed the above results were mixed. They are updated now.
FP32 results with @GaoWei8's fix (https://github.com/PaddlePaddle/Paddle/pull/20972):
Run 5010 samples, average latency: 32.0582 ms per sample.
Attached the baseline profile results:
- FP32 with Gaowei's padding patch
Event | Calls | Total | Min. | Max. | Ave. | Ratio. |
---|---|---|---|---|---|---|
fc | 370666 | 127408 | 0.01446 | 20.14810 | 0.34373 | 0.75962 |
elementwise_add | 190342 | 10271.7 | 0.03167 | 4.93884 | 0.05396 | 0.06124 |
transpose2 | 240432 | 7704.78 | 0.02098 | 3.71560 | 0.03205 | 0.04594 |
matmul | 125225 | 7266.52 | 0.02186 | 3.96825 | 0.05803 | 0.04332 |
scale | 70126 | 4823.14 | 0.00407 | 4.92560 | 0.06878 | 0.02876 |
layer_norm | 125225 | 4184.69 | 0.02592 | 4.72700 | 0.03342 | 0.02495 |
softmax | 60108 | 2565.96 | 0.03276 | 2.04443 | 0.04269 | 0.01530 |
reshape2 | 240432 | 1383.47 | 0.00393 | 2.92184 | 0.00575 | 0.00825 |
lookup_table | 15027 | 1331.17 | 0.06888 | 1.23500 | 0.08859 | 0.00794 |
stack | 5009 | 462.643 | 0.06707 | 0.30953 | 0.09236 | 0.00276 |
tanh | 5009 | 181.429 | 0.02849 | 0.12309 | 0.03622 | 0.00108 |
slice | 5009 | 68.8743 | 0.01152 | 0.02409 | 0.01375 | 0.00041 |
fetch | 5009 | 40.985 | 0.00489 | 0.01910 | 0.00818 | 0.00024 |
feed | 20036 | 33.6655 | 0.00066 | 0.01238 | 0.00168 | 0.00020 |
- Real INT8 ERNIE profile
Event | Calls | Total | Min. | Max. | Ave. | Ratio. |
---|---|---|---|---|---|---|
elementwise_add | 290080 | 48707.3 | 0.00604 | 7.29221 | 0.16791 | 0.39068 |
mul | 191660 | 41987.3 | 0.02023 | 4.70104 | 0.21907 | 0.33678 |
reshape2 | 124320 | 9530.43 | 0.02449 | 0.37624 | 0.07666 | 0.07644 |
quantize | 129500 | 5801.52 | 0.00967 | 2.96898 | 0.04480 | 0.04653 |
transpose2 | 124320 | 4886.13 | 0.01829 | 3.68880 | 0.03930 | 0.03919 |
matmul | 64750 | 3845.23 | 0.02139 | 15.42340 | 0.05939 | 0.03084 |
scale | 36260 | 2754.43 | 0.00546 | 0.14959 | 0.07596 | 0.02209 |
relu | 31080 | 2548.26 | 0.05381 | 2.56348 | 0.08199 | 0.02044 |
layer_norm | 64750 | 2127.01 | 0.02338 | 1.18454 | 0.03285 | 0.01706 |
softmax | 31080 | 1192.7 | 0.02731 | 2.38116 | 0.03838 | 0.00957 |
lookup_table | 7770 | 707.691 | 0.04808 | 0.23893 | 0.09108 | 0.00568 |
stack | 2590 | 260.936 | 0.06343 | 0.27896 | 0.10075 | 0.00209 |
slice | 7770 | 74.5761 | 0.00523 | 0.02923 | 0.00960 | 0.00060 |
tanh | 2590 | 57.4421 | 0.01791 | 0.08974 | 0.02218 | 0.00046 |
unsqueeze2 | 5180 | 45.0293 | 0.00550 | 0.02471 | 0.00869 | 0.00036 |
fill_constant | 10360 | 37.9978 | 0.00214 | 0.02271 | 0.00367 | 0.00030 |
fetch | 2590 | 23.3609 | 0.00753 | 0.01779 | 0.00902 | 0.00019 |
range | 2590 | 23.1647 | 0.00674 | 0.01581 | 0.00894 | 0.00019 |
expand | 2590 | 18.4988 | 0.00649 | 0.01439 | 0.00714 | 0.00015 |
cast | 5180 | 16.361 | 0.00223 | 0.01195 | 0.00316 | 0.00013 |
equal | 2590 | 12.2261 | 0.00411 | 0.01321 | 0.00472 | 0.00010 |
feed | 5180 | 11.6075 | 0.00127 | 0.00955 | 0.00224 | 0.00009 |
shape | 2590 | 5.52088 | 0.00176 | 0.01129 | 0.00213 | 0.00004 |
-
FP32 ERNIE model and INT8 ERNIE model have slightly different on inputs numbers
-
INT8 optimized latency is slower than FP32 optimized latency because
-
elementwise_add takes a lot of time after quantizing the mul, fc int8 op should be introduced (support for 3 dims fc input) elementwise_add bias/residual ?
-
Too much reorder time, INT8 pipeline for feed-forward part can be further optimizing the model
-
INT8 execution on GEMM with MKL is without padding (UT MKLDNN INT8 GEMM with the same dimension with FP32 GEMM )
-
Schedule plan of ERNIE INT8 optimization:
ERNIE INT8 Task | Status | Plan to Finish | Risk | Developer |
---|---|---|---|---|
Accuracy check in UT | WIP | 12.06 | Low | Asia |
MKL-DNN inner product performance benchmark | UT benchmark done | 12.06 | Low | Wojtuss |
3D support in FC INT8 kernel | WIP | 12.13 | High | Michal |
Fusion: FC INT8 | Not started | 12.13 | Dependency | - |
Fusion: FC+scale | WIP | 12.13 | Dependency | Danqing |
Reshape2 + Transpose2 INT8 kernel verification | WIP | 12.13 | Dependency | Asia & Wojtuss |
Overall performance & accuracy tuning | Not started | 12.20 | Dependency | - |
MKL-DNN 0.20 from paddle | |||
---|---|---|---|
aef88b7c2 | |||
20 threads | |||
Dimension | 128x768x768 | 128x3072x768 | 128x768x3072 |
u8s8s32s32 | 0.0545178 | 0.170797 | 0.129593 |
s8s8s32s32 | 0.102495 | 0.326905 | 0.320389 |
f32 | 0.135169 | 0.419238 | 0.271828 |
f32 -> s8 increase | 24% | 22% | -18% |
1 thread | |||
Dimension | 128x768x768 | 128x3072x768 | 128x768x3072 |
u8s8s32s32 | 0.21483 | 0.884561 | 0.904923 |
s8s8s32s32 | 0.638012 | 2.62521 | 2.57349 |
f32 | 0.848266 | 3.6114 | 3.40055 |
f32 -> s8 increase | 25% | 27% | 24% |
@luotao1
I'm working on accuracy test on ernie2 right now .
I've run test on:
ernie_origin_2_inputs.tar.gz
fp32 model with 2 inputs
data file - 1.8w.bs1
label file - label.xnli.dev
that looks like 2 0 1 2 0 1 0 1 2 1 0 2 ...
I got outputs like below:
--- iteration 204 --- 0.0957182 0.0770019 0.243675 --- iteration 205 --- 0.0957183 0.0770018 0.243675 --- iteration 206 --- 0.0957183 0.0770018 0.243675 --- iteration 207 --- 0.0957182 0.0770018 0.243675 --- iteration 208 --- 0.0957182 0.0770017 0.243675
My question is: am I using the correct file or is there any process for calculating accuracy?
MKL-DNN master (v 1.1) | |||
---|---|---|---|
1ee831fa6a2f802de1d399fe1de4e6cc629ad855 | |||
20 threads | |||
problem descriptor | 128x768x768 | 128x3072x768 | 128x768x3072 |
u8s8s32 | 0.0446907 | 0.162134 | 0.132759 |
s8s8s32 | 0.0455766 | 0.15978 | 0.13624 |
f32 | 0.081141 | 0.283924 | 0.467588 |
f32 -> s8 increase | 44% | 44% | 71% |
1 thread | |||
problem descriptor | 128x768x768 | 128x3072x768 | 128x768x3072 |
u8s8s32 | 0.216017 | 0.925034 | 0.901553 |
s8s8s32 | 0.221654 | 0.947028 | 0.942317 |
f32 | 0.742698 | 2.96747 | 6.87472 |
f32 -> s8 increase | 70% | 68% | 86% |
Below are our latest performance results for Ernie FP32 and INT8 runs. The tests were run with affinity settings
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1
on CLX 6248. INT8 tests were run with the memory_optimize_pass
disabled, due to https://github.com/PaddlePaddle/Paddle/issues/21492.
With the current develop
branch (1fd1f06) and the origin
FP32 model, the latency was:
FP32, 20 threads: 29.10 ms,
FP32, 1 thread: 180.95 ms.
With additional commits from PRs
- FC INT8 3D support: https://github.com/PaddlePaddle/Paddle/pull/21746
- FC+dequantize squash: https://github.com/PaddlePaddle/Paddle/pull/21714
- MKL-DNN 1.1.1 update: https://github.com/PaddlePaddle/Paddle/pull/21754
the latency for INT8 (with FC quantized only) was:
INT8, 20 threads: 23.73 ms (18.45% faster than FP32),
INT8, 1 thread: 84.32 ms (53.40% faster than FP32).
After enabling INT8 kernels of reshape2
and transpose2
(support already present in develop
branch), the results were:
INT8, 20 threads: 20.47 ms (29.68% faster than FP32),
INT8, 1 thread: 80.64 ms (55.43% faster than FP32).
With additional optimizations possible to implement:
- avoiding reorder
nwc->ncw
beforelayer_norm
(here imitated by removing the reorder), - fusing
scale
ops intofc
or intodequantize
(here imitated by removing thescale
ops),
the latency for INT8 (with FC quantized only) was: INT8, 20 threads: 20.68 ms (28.93% faster than FP32), INT8, 1 thread:: 77.84 ms (56.99% faster than FP32). With INT8 kernels also for reshape2 and transpose2, the latency was: INT8, 20 threads: 17.39 ms (40.24% faster than FP32), INT8, 1 thread: 75.66 ms (58.19% faster than FP32).
Other optimizations, like quantization of more operators and more fuses, are also investigated.
I repeat my question.
Im checking accuracy on trained ernie models with 2 inputs. But outputs still are float type.
It looks like that:
--- iteration 0 --- -2.22873 -1.84079 3.08339 ...
And label file - label.xnli.dev consists of integers: 2 0 1 2 0 1 0 1 2 1 0 2 ...
Should output labels have a floating point type?
The output float number is the logits of a 3-label classification task. and the label file consists of class_id of the true label. so you can evaluate the accuracy with python code below:
import sys
import numpy as np
all, acc = 0, 0
for i in sys.stdin:
a, b, c, d = i.strip().split('\t')
pred = np.array(map(float, (a, b, c))).argmax()
if pred == int(d):
acc += 1
all += 1
print(float(acc)/all)
run with:
paste score_3_columns label.xnli.dev | python eval.py
@Meiyim you are right. Thank you for the explanation
I have updated the results in https://github.com/PaddlePaddle/benchmark/issues/275#issuecomment-565858479 above. Results with additionally quantized reshape2
and transpose2
are added.
FP32, 1 thread: 180.95 ms.
@bingyanghuang @wojtuss Could you tell us why mkldnn update to 1.1 could have such great improvement (250ms -> 180ms)? We really want to know what happens in mkldnn 1.1?
@luotao1 , Today we have found that since our last benchmarks there were some changes in the FC INT8 3D support PR's code (PaddlePaddle/Paddle#21746) which influenced the performance. There is a difference in performance between our previous results, and today's, after merging the PRs into develop branch, both on CLX 6248 and on SKX 6148. Here are results showing the difference on SKX 6148:
- previous FP32 1 thread: 183.92 ms 20 threads: 31.09 ms
- current FP32 1 thread: 214.39 ms 20 threads: 53.69 ms
We are investigating that.
@luotao1 , The PR with FC INT8 3D is already fixed and updated.
The latest FP32 results for the current clean develop
branch (25e765a4fe
) on SKX 6148:
- 4-dimensional input (
fp32_model
,test_ds
): 1 thread: 189.39 ms, 20 threads: 30.20 ms. - 2-dimensional input (
origin
,1.8w.bs1
): 1 thread: 192.31 ms, 20 threads: 33:90 ms.
After merging the PRs PaddlePaddle/Paddle#21746 and PaddlePaddle/Paddle#21754, the FP32 results are basically the same:
- 2-dimensional -input (
origin
,1.8w.bs1
): 1 thread: 194.45 ms, 20 threads: 33.93 ms.
@GaoWei8 Please have a double-check on fp32 performance.
The latest FP32 results for the current clean develop branch (b9dbde12b3) on SKX 6148:
4-dimensional input (fp32_model, test_ds): 1 thread: 249.714 ms 20 threads: 29.741 ms
@GaoWei8 Could you please paste your profile log of FP32 baseline in this issue? Let's try to align our baseline.
@bingyanghuang The latest FP32 results for the current clean develop branch (b9dbde12b3) on SKX 6148 4-dimensional input (fp32_model, test_ds): 1 thread profile:
I0102 03:32:40.693799 12448 inference.cc:357] Run 5010 samples, average latency: 249.052 ms per sample.
I0102 03:32:40.693889 12448 inference.cc:362] Run 5009 samples, average latency [exclude 1 warmup steps]: 249.05 ms per sample.
-------------------------> Profiling Report <-------------------------
Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave. Ratio.
thread0::fc 370740 1.10423e+06 0.016733 6.69891 2.97844 0.885518
thread0::elementwise_add 190380 43708.8 0.115389 0.955863 0.229587 0.0350516
thread0::matmul 125250 33400 0.025127 2.02142 0.266667 0.0267846
thread0::transpose2 240480 22726.5 0.069161 0.33388 0.0945046 0.0182251
thread0::layer_norm 125250 19296 0.138214 0.232502 0.15406 0.0154741
thread0::softmax 60120 14590.1 0.236914 0.48581 0.242683 0.0117003
thread0::scale 70140 4239.09 0.006753 0.116512 0.0604376 0.00339947
thread0::reshape2 240480 2274.57 0.0074 0.055957 0.00945847 0.00182406
thread0::lookup_table 15030 1274.28 0.076111 0.160232 0.0847821 0.00102188
thread0::stack 5010 646.453 0.121365 0.180305 0.129033 0.000518413
thread0::load 202 270.43 0.008871 169.244 1.33876 0.000216867
thread0::tanh 5010 146.669 0.026564 0.197547 0.0292752 0.000117618
thread0::slice 5010 85.3612 0.015276 0.045218 0.0170382 6.84541e-05
thread0::feed 20040 56.402 0.001333 0.011693 0.00281447 4.52307e-05
thread0::fetch 5010 43.0507 0.00684 0.043593 0.00859295 3.45238e-05
20 threads profile:
I0102 03:11:21.370021 12402 inference.cc:357] Run 5010 samples, average latency: 29.5513 ms per sample.
I0102 03:11:21.370100 12402 inference.cc:362] Run 5009 samples, average latency [exclude 1 warmup steps]: 29.5477 ms per sample.
-------------------------> Profiling Report <-------------------------
Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave. Ratio.
thread0::fc 370740 108360 0.016532 18.4293 0.292281 0.73505
thread0::elementwise_add 190380 9398.93 0.027077 7.27942 0.0493693 0.0637567
thread0::matmul 125250 7594.48 0.021428 9.68409 0.0606346 0.0515164
thread0::transpose2 240480 7444.04 0.02215 7.14557 0.0309549 0.0504959
thread0::layer_norm 125250 4383.38 0.029262 6.79018 0.0349971 0.0297342
thread0::scale 70140 3930.57 0.00581 7.21213 0.0560389 0.0266626
thread0::softmax 60120 2166.03 0.029904 7.0274 0.0360284 0.014693
thread0::reshape2 240480 1896.58 0.006059 7.18181 0.00788662 0.0128652
thread0::lookup_table 15030 1283.8 0.070953 7.12537 0.0854156 0.0087085
thread0::stack 5010 376.056 0.063878 0.828854 0.075061 0.00255093
thread0::load 202 272.917 0.009124 171.089 1.35107 0.0018513
thread0::tanh 5010 136.909 0.024189 0.195245 0.0273272 0.00092871
thread0::slice 5010 81.6364 0.014465 0.043562 0.0162947 0.000553772
thread0::feed 20040 54.9063 0.00108 0.024591 0.00273984 0.000372451
thread0::fetch 5010 38.3997 0.00608 0.039843 0.00766461 0.00026048
The latest FP32 results for the clean develop branch (c7b03d308c, Jan 2nd) on CLX 6248 4-dimensional input (fp32_model, test_ds): 1 thread profile:
I0103 04:47:21.063290 111775 inference.cc:354] Run 5010 samples, average latency: 186.189 ms per sample.
I0103 04:47:21.063355 111775 inference.cc:359] Run 5009 samples, average latency [exclude 1 warmup steps]: 186.18 ms per sample.
-------------------------> Profiling Report <-------------------------
Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave. Ratio.
thread0::fc 370740 789200 0.012302 8.10634 2.12871 0.846926
thread0::elementwise_add 190380 67363.2 0.124292 11.9827 0.353835 0.0722904
thread0::matmul 125250 25282.5 0.021207 9.8986 0.201856 0.0271318
thread0::transpose2 240480 18951.5 0.061624 0.722954 0.078807 0.0203377
thread0::layer_norm 125250 14751.8 0.102778 0.470418 0.117778 0.0158308
thread0::softmax 60120 8870.11 0.138533 0.872575 0.14754 0.00951892
thread0::scale 70140 3878.18 0.004727 0.113928 0.055292 0.00416185
thread0::reshape2 240480 1665.39 0.005301 0.057911 0.00692528 0.00178721
thread0::lookup_table 15030 1000.76 0.054471 0.126544 0.0665843 0.00107396
thread0::stack 5010 470.501 0.083346 0.181878 0.0939125 0.000504916
thread0::load 202 151.379 0.007597 19.1899 0.7494 0.000162451
thread0::tanh 5010 116.446 0.020674 0.164315 0.0232427 0.000124964
thread0::slice 5010 63.1702 0.011075 0.037661 0.0126088 6.77908e-05
thread0::feed 20040 44.1006 0.000886 0.006938 0.00220063 4.73264e-05
thread0::fetch 5010 31.8417 0.00492 0.040041 0.00635563 3.41708e-05
Here come our latest performance results for Ernie FP32 and INT8 runs. The tests were run with affinity settings
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1
on CLX 6248.
With the current develop branch (ad0dfb1) and the origin FP32 model, the latency was: FP32, 20 threads: 29.73 ms, FP32, 1 thread: 196.86 ms.
INT8 results with fc
, reshape2
and transpose2
quantized:
INT8, 20 threads: 20.26 ms (31.8 % faster),
INT8, 1 thread: 78.40 ms (60.2 % faster).
We have finally resolved the accuracy issue for Ernie INT8 run.
Background: FC operators in the Ernie model have 2-dimensional weights and receive 3-dimensional inputs. For the MKL-DNN inner product to be used it requires transforming the weights and inputs into proper 3-dimensional tensors. As long as the input is in NCHW format (for FP32), the transformation is possible and Ernie FP32 achieves good accuracy. However, when the input is in NHWC format (for INT8), transforming the input and the weights into proper tensors requires modifications of strides, which are not supported by the MKL-DNN inner product.
Solution:
The NHWC format of INT8 input data is beneficial for conv2d INT8 kernel (because of the nature of the conv2d's operations), but for the FC the NCHW format remains optimal also for INT8. The solution is to add support for an NCHW INT8 output of the quantize
operator and use it before FC INT8. This fixes the accuracy problem. Having applied the fix we got 0.795582 accuracy for Ernie INT8 (on an SKX, on a CLX it will probably be even better; for the original QAT model, we got 0.79759 on a CLX). A PR with a fix and benchmark results will be prepared next week.
With the fix for FC INT8 (PR https://github.com/PaddlePaddle/Paddle/pull/22404, branch Sand3r-:mgallus/3d-fc-acc
) we got the following results for Ernie on a CLX 6248 machine:
accuracy | latency (ms) 1 thread | latency (ms) 20 threads | |
---|---|---|---|
FP32 | 0.802008 | 190.1 | 31.2 |
INT8 | 0.799598 | 70.6 | 17.3 |
QAT FP32 | 0.797590 |