benchmark Optimize inference performance of ERNIE INT8 on CPU

Now, paddle ERNIE fp32 inference on CPU performance is ass below: single thread： 251.464 m 20 threads：29.8818 ms Our goal is to prove that with INT8 real kernel, ERNIE can get the performance gain. @Sand3r- @wojtuss Please update your benchmark progress here. @wzzju @luotao1 Please track the status here.

Nov 20 '19 08:11 bingyanghuang

We are currently working with the following two models: ernie_quant.zip and fp32_model.tar.gz. Results gathered with the optimized fp32_model model will be labeled FP32. Results gathered with the fake quantized FP32 ernie_quant model will be labeled QAT FP32. Results gathered with an optimized INT8 model obtained from ernie_quant model will be labeled QAT INT8.

Initial performance results using the former (for FP32, 4 inputs, test_ds dataset) and the latest (for QAT FP32/INT8, 2 inputs, 1.8w.bs1 dataset) unit test obtained from Baidu. (20 threads, bs 1, CLX 6248) FP32: Run 5010 samples, average latency: 55.7539 ms per sample. Run 5009 samples, average latency [exclude 1 warmup steps]: 55.7415 ms per sample. QAT FP32: Run 2490 samples, average latency: 92.2313ms per sample. QAT INT8 (with mul): Run 2490 samples, average latency: 47.1024ms per sample.

Nov 21 '19 09:11 wojtuss

I have just noticed the above results were mixed. They are updated now.

Nov 22 '19 12:11 wojtuss

FP32 results with @GaoWei8's fix (https://github.com/PaddlePaddle/Paddle/pull/20972): Run 5010 samples, average latency: 32.0582 ms per sample.

Nov 22 '19 12:11 wojtuss

Attached the baseline profile results:

FP32 with Gaowei's padding patch

Event	Calls	Total	Min.	Max.	Ave.	Ratio.
fc	370666	127408	0.01446	20.14810	0.34373	0.75962
elementwise_add	190342	10271.7	0.03167	4.93884	0.05396	0.06124
transpose2	240432	7704.78	0.02098	3.71560	0.03205	0.04594
matmul	125225	7266.52	0.02186	3.96825	0.05803	0.04332
scale	70126	4823.14	0.00407	4.92560	0.06878	0.02876
layer_norm	125225	4184.69	0.02592	4.72700	0.03342	0.02495
softmax	60108	2565.96	0.03276	2.04443	0.04269	0.01530
reshape2	240432	1383.47	0.00393	2.92184	0.00575	0.00825
lookup_table	15027	1331.17	0.06888	1.23500	0.08859	0.00794
stack	5009	462.643	0.06707	0.30953	0.09236	0.00276
tanh	5009	181.429	0.02849	0.12309	0.03622	0.00108
slice	5009	68.8743	0.01152	0.02409	0.01375	0.00041
fetch	5009	40.985	0.00489	0.01910	0.00818	0.00024
feed	20036	33.6655	0.00066	0.01238	0.00168	0.00020

Real INT8 ERNIE profile

Event	Calls	Total	Min.	Max.	Ave.	Ratio.
elementwise_add	290080	48707.3	0.00604	7.29221	0.16791	0.39068
mul	191660	41987.3	0.02023	4.70104	0.21907	0.33678
reshape2	124320	9530.43	0.02449	0.37624	0.07666	0.07644
quantize	129500	5801.52	0.00967	2.96898	0.04480	0.04653
transpose2	124320	4886.13	0.01829	3.68880	0.03930	0.03919
matmul	64750	3845.23	0.02139	15.42340	0.05939	0.03084
scale	36260	2754.43	0.00546	0.14959	0.07596	0.02209
relu	31080	2548.26	0.05381	2.56348	0.08199	0.02044
layer_norm	64750	2127.01	0.02338	1.18454	0.03285	0.01706
softmax	31080	1192.7	0.02731	2.38116	0.03838	0.00957
lookup_table	7770	707.691	0.04808	0.23893	0.09108	0.00568
stack	2590	260.936	0.06343	0.27896	0.10075	0.00209
slice	7770	74.5761	0.00523	0.02923	0.00960	0.00060
tanh	2590	57.4421	0.01791	0.08974	0.02218	0.00046
unsqueeze2	5180	45.0293	0.00550	0.02471	0.00869	0.00036
fill_constant	10360	37.9978	0.00214	0.02271	0.00367	0.00030
fetch	2590	23.3609	0.00753	0.01779	0.00902	0.00019
range	2590	23.1647	0.00674	0.01581	0.00894	0.00019
expand	2590	18.4988	0.00649	0.01439	0.00714	0.00015
cast	5180	16.361	0.00223	0.01195	0.00316	0.00013
equal	2590	12.2261	0.00411	0.01321	0.00472	0.00010
feed	5180	11.6075	0.00127	0.00955	0.00224	0.00009
shape	2590	5.52088	0.00176	0.01129	0.00213	0.00004

Nov 26 '19 01:11 bingyanghuang

FP32 ERNIE model and INT8 ERNIE model have slightly different on inputs numbers
INT8 optimized latency is slower than FP32 optimized latency because
- elementwise_add takes a lot of time after quantizing the mul, fc int8 op should be introduced (support for 3 dims fc input) elementwise_add bias/residual ?
- Too much reorder time, INT8 pipeline for feed-forward part can be further optimizing the model
- INT8 execution on GEMM with MKL is without padding (UT MKLDNN INT8 GEMM with the same dimension with FP32 GEMM )

Nov 26 '19 01:11 bingyanghuang

Schedule plan of ERNIE INT8 optimization:

ERNIE INT8 Task	Status	Plan to Finish	Risk	Developer
Accuracy check in UT	WIP	12.06	Low	Asia
MKL-DNN inner product performance benchmark	UT benchmark done	12.06	Low	Wojtuss
3D support in FC INT8 kernel	WIP	12.13	High	Michal
Fusion: FC INT8	Not started	12.13	Dependency	-
Fusion: FC+scale	WIP	12.13	Dependency	Danqing
Reshape2 + Transpose2 INT8 kernel verification	WIP	12.13	Dependency	Asia & Wojtuss
Overall performance & accuracy tuning	Not started	12.20	Dependency	-

Dec 04 '19 06:12 bingyanghuang

MKL-DNN 0.20 from paddle

aef88b7c2

20 threads

Dimension 128x768x768 128x3072x768 128x768x3072

u8s8s32s32 0.0545178 0.170797 0.129593

s8s8s32s32 0.102495 0.326905 0.320389

f32 0.135169 0.419238 0.271828

f32 -> s8 increase 24% 22% -18%

1 thread

Dimension 128x768x768 128x3072x768 128x768x3072

u8s8s32s32 0.21483 0.884561 0.904923

s8s8s32s32 0.638012 2.62521 2.57349

f32 0.848266 3.6114 3.40055

f32 -> s8 increase 25% 27% 24%

MKL-DNN 0.20 from paddle
aef88b7c2
20 threads
Dimension	128x768x768	128x3072x768	128x768x3072
u8s8s32s32	0.0545178	0.170797	0.129593
s8s8s32s32	0.102495	0.326905	0.320389
f32	0.135169	0.419238	0.271828
f32 -> s8 increase	24%	22%	-18%

1 thread
Dimension	128x768x768	128x3072x768	128x768x3072
u8s8s32s32	0.21483	0.884561	0.904923
s8s8s32s32	0.638012	2.62521	2.57349
f32	0.848266	3.6114	3.40055
f32 -> s8 increase	25%	27%	24%

Dec 05 '19 08:12 bingyanghuang

@luotao1
I'm working on accuracy test on ernie2 right now . I've run test on: ernie_origin_2_inputs.tar.gz fp32 model with 2 inputs data file - 1.8w.bs1 label file - label.xnli.dev that looks like 2 0 1 2 0 1 0 1 2 1 0 2 ...

I got outputs like below: --- iteration 204 --- 0.0957182 0.0770019 0.243675 --- iteration 205 --- 0.0957183 0.0770018 0.243675 --- iteration 206 --- 0.0957183 0.0770018 0.243675 --- iteration 207 --- 0.0957182 0.0770018 0.243675 --- iteration 208 --- 0.0957182 0.0770017 0.243675

My question is: am I using the correct file or is there any process for calculating accuracy?

Dec 05 '19 15:12 wozna

MKL-DNN master (v 1.1)

1ee831fa6a2f802de1d399fe1de4e6cc629ad855

20 threads

problem descriptor 128x768x768 128x3072x768 128x768x3072

u8s8s32 0.0446907 0.162134 0.132759

s8s8s32 0.0455766 0.15978 0.13624

f32 0.081141 0.283924 0.467588

f32 -> s8 increase 44% 44% 71%

1 thread

problem descriptor 128x768x768 128x3072x768 128x768x3072

u8s8s32 0.216017 0.925034 0.901553

s8s8s32 0.221654 0.947028 0.942317

f32 0.742698 2.96747 6.87472

f32 -> s8 increase 70% 68% 86%

MKL-DNN master (v 1.1)
1ee831fa6a2f802de1d399fe1de4e6cc629ad855
20 threads
problem descriptor	128x768x768	128x3072x768	128x768x3072
u8s8s32	0.0446907	0.162134	0.132759
s8s8s32	0.0455766	0.15978	0.13624
f32	0.081141	0.283924	0.467588
f32 -> s8 increase	44%	44%	71%

1 thread
problem descriptor	128x768x768	128x3072x768	128x768x3072
u8s8s32	0.216017	0.925034	0.901553
s8s8s32	0.221654	0.947028	0.942317
f32	0.742698	2.96747	6.87472
f32 -> s8 increase	70%	68%	86%

Dec 11 '19 06:12 bingyanghuang

Below are our latest performance results for Ernie FP32 and INT8 runs. The tests were run with affinity settings

export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

on CLX 6248. INT8 tests were run with the memory_optimize_pass disabled, due to https://github.com/PaddlePaddle/Paddle/issues/21492.

With the current develop branch (1fd1f06) and the origin FP32 model, the latency was: FP32, 20 threads: 29.10 ms, FP32, 1 thread: 180.95 ms.

With additional commits from PRs

FC INT8 3D support: https://github.com/PaddlePaddle/Paddle/pull/21746
FC+dequantize squash: https://github.com/PaddlePaddle/Paddle/pull/21714
MKL-DNN 1.1.1 update: https://github.com/PaddlePaddle/Paddle/pull/21754

the latency for INT8 (with FC quantized only) was: INT8, 20 threads: 23.73 ms (18.45% faster than FP32), INT8, 1 thread: 84.32 ms (53.40% faster than FP32). After enabling INT8 kernels of reshape2 and transpose2 (support already present in develop branch), the results were: INT8, 20 threads: 20.47 ms (29.68% faster than FP32), INT8, 1 thread: 80.64 ms (55.43% faster than FP32).

With additional optimizations possible to implement:

avoiding reorder nwc->ncw before layer_norm (here imitated by removing the reorder),
fusing scale ops into fc or into dequantize (here imitated by removing the scale ops),

the latency for INT8 (with FC quantized only) was: INT8, 20 threads: 20.68 ms (28.93% faster than FP32), INT8, 1 thread:: 77.84 ms (56.99% faster than FP32). With INT8 kernels also for reshape2 and transpose2, the latency was: INT8, 20 threads: 17.39 ms (40.24% faster than FP32), INT8, 1 thread: 75.66 ms (58.19% faster than FP32).

Other optimizations, like quantization of more operators and more fuses, are also investigated.

Dec 15 '19 23:12 wojtuss

I repeat my question. Im checking accuracy on trained ernie models with 2 inputs. But outputs still are float type. It looks like that: --- iteration 0 --- -2.22873 -1.84079 3.08339 ... And label file - label.xnli.dev consists of integers: 2 0 1 2 0 1 0 1 2 1 0 2 ...

Should output labels have a floating point type?

Dec 16 '19 14:12 wozna

The output float number is the logits of a 3-label classification task. and the label file consists of class_id of the true label. so you can evaluate the accuracy with python code below:

import sys
import numpy as np

all, acc = 0, 0
for i in sys.stdin:
        a, b, c, d = i.strip().split('\t')
        pred = np.array(map(float, (a, b, c))).argmax()
        if pred == int(d):
                acc += 1
        all += 1
print(float(acc)/all)

run with:

paste score_3_columns label.xnli.dev | python eval.py

Dec 16 '19 15:12 Meiyim

@Meiyim you are right. Thank you for the explanation

Dec 16 '19 18:12 wozna

I have updated the results in https://github.com/PaddlePaddle/benchmark/issues/275#issuecomment-565858479 above. Results with additionally quantized reshape2 and transpose2 are added.

Dec 16 '19 23:12 wojtuss

FP32, 1 thread: 180.95 ms.

@bingyanghuang @wojtuss Could you tell us why mkldnn update to 1.1 could have such great improvement (250ms -> 180ms)? We really want to know what happens in mkldnn 1.1?

Dec 18 '19 07:12 luotao1

@luotao1 , Today we have found that since our last benchmarks there were some changes in the FC INT8 3D support PR's code (PaddlePaddle/Paddle#21746) which influenced the performance. There is a difference in performance between our previous results, and today's, after merging the PRs into develop branch, both on CLX 6248 and on SKX 6148. Here are results showing the difference on SKX 6148:

previous FP32 1 thread: 183.92 ms 20 threads: 31.09 ms
current FP32 1 thread: 214.39 ms 20 threads: 53.69 ms

We are investigating that.

Dec 19 '19 23:12 wojtuss

@luotao1 , The PR with FC INT8 3D is already fixed and updated.

Dec 20 '19 21:12 wojtuss

The latest FP32 results for the current clean develop branch (25e765a4fe) on SKX 6148:

4-dimensional input (fp32_model, test_ds): 1 thread: 189.39 ms, 20 threads: 30.20 ms.
2-dimensional input (origin, 1.8w.bs1): 1 thread: 192.31 ms, 20 threads: 33:90 ms.

After merging the PRs PaddlePaddle/Paddle#21746 and PaddlePaddle/Paddle#21754, the FP32 results are basically the same:

2-dimensional -input (origin, 1.8w.bs1): 1 thread: 194.45 ms, 20 threads: 33.93 ms.

Dec 20 '19 23:12 wojtuss

@GaoWei8 Please have a double-check on fp32 performance.

Dec 21 '19 05:12 luotao1

The latest FP32 results for the current clean develop branch (b9dbde12b3) on SKX 6148:

4-dimensional input (fp32_model, test_ds): 1 thread: 249.714 ms 20 threads: 29.741 ms

Dec 23 '19 03:12 GaoWei8

@GaoWei8 Could you please paste your profile log of FP32 baseline in this issue? Let's try to align our baseline.

Jan 02 '20 02:01 bingyanghuang

@bingyanghuang The latest FP32 results for the current clean develop branch (b9dbde12b3) on SKX 6148 4-dimensional input (fp32_model, test_ds): 1 thread profile:

I0102 03:32:40.693799 12448 inference.cc:357] Run 5010 samples, average latency: 249.052 ms per sample.
I0102 03:32:40.693889 12448 inference.cc:362] Run 5009 samples, average latency [exclude 1 warmup steps]: 249.05 ms per sample.

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::fc                 370740      1.10423e+06 0.016733    6.69891     2.97844     0.885518
thread0::elementwise_add    190380      43708.8     0.115389    0.955863    0.229587    0.0350516
thread0::matmul             125250      33400       0.025127    2.02142     0.266667    0.0267846
thread0::transpose2         240480      22726.5     0.069161    0.33388     0.0945046   0.0182251
thread0::layer_norm         125250      19296       0.138214    0.232502    0.15406     0.0154741
thread0::softmax            60120       14590.1     0.236914    0.48581     0.242683    0.0117003
thread0::scale              70140       4239.09     0.006753    0.116512    0.0604376   0.00339947
thread0::reshape2           240480      2274.57     0.0074      0.055957    0.00945847  0.00182406
thread0::lookup_table       15030       1274.28     0.076111    0.160232    0.0847821   0.00102188
thread0::stack              5010        646.453     0.121365    0.180305    0.129033    0.000518413
thread0::load               202         270.43      0.008871    169.244     1.33876     0.000216867
thread0::tanh               5010        146.669     0.026564    0.197547    0.0292752   0.000117618
thread0::slice              5010        85.3612     0.015276    0.045218    0.0170382   6.84541e-05
thread0::feed               20040       56.402      0.001333    0.011693    0.00281447  4.52307e-05
thread0::fetch              5010        43.0507     0.00684     0.043593    0.00859295  3.45238e-05

20 threads profile:

I0102 03:11:21.370021 12402 inference.cc:357] Run 5010 samples, average latency: 29.5513 ms per sample.
I0102 03:11:21.370100 12402 inference.cc:362] Run 5009 samples, average latency [exclude 1 warmup steps]: 29.5477 ms per sample.

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::fc                 370740      108360      0.016532    18.4293     0.292281    0.73505
thread0::elementwise_add    190380      9398.93     0.027077    7.27942     0.0493693   0.0637567
thread0::matmul             125250      7594.48     0.021428    9.68409     0.0606346   0.0515164
thread0::transpose2         240480      7444.04     0.02215     7.14557     0.0309549   0.0504959
thread0::layer_norm         125250      4383.38     0.029262    6.79018     0.0349971   0.0297342
thread0::scale              70140       3930.57     0.00581     7.21213     0.0560389   0.0266626
thread0::softmax            60120       2166.03     0.029904    7.0274      0.0360284   0.014693
thread0::reshape2           240480      1896.58     0.006059    7.18181     0.00788662  0.0128652
thread0::lookup_table       15030       1283.8      0.070953    7.12537     0.0854156   0.0087085
thread0::stack              5010        376.056     0.063878    0.828854    0.075061    0.00255093
thread0::load               202         272.917     0.009124    171.089     1.35107     0.0018513
thread0::tanh               5010        136.909     0.024189    0.195245    0.0273272   0.00092871
thread0::slice              5010        81.6364     0.014465    0.043562    0.0162947   0.000553772
thread0::feed               20040       54.9063     0.00108     0.024591    0.00273984  0.000372451
thread0::fetch              5010        38.3997     0.00608     0.039843    0.00766461  0.00026048

Jan 02 '20 03:01 GaoWei8

The latest FP32 results for the clean develop branch (c7b03d308c, Jan 2nd) on CLX 6248 4-dimensional input (fp32_model, test_ds): 1 thread profile:

I0103 04:47:21.063290 111775 inference.cc:354] Run 5010 samples, average latency: 186.189 ms per sample.
I0103 04:47:21.063355 111775 inference.cc:359] Run 5009 samples, average latency [exclude 1 warmup steps]: 186.18 ms per sample.

------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread

Event                       Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::fc                 370740      789200      0.012302    8.10634     2.12871     0.846926
thread0::elementwise_add    190380      67363.2     0.124292    11.9827     0.353835    0.0722904
thread0::matmul             125250      25282.5     0.021207    9.8986      0.201856    0.0271318
thread0::transpose2         240480      18951.5     0.061624    0.722954    0.078807    0.0203377
thread0::layer_norm         125250      14751.8     0.102778    0.470418    0.117778    0.0158308
thread0::softmax            60120       8870.11     0.138533    0.872575    0.14754     0.00951892
thread0::scale              70140       3878.18     0.004727    0.113928    0.055292    0.00416185
thread0::reshape2           240480      1665.39     0.005301    0.057911    0.00692528  0.00178721
thread0::lookup_table       15030       1000.76     0.054471    0.126544    0.0665843   0.00107396
thread0::stack              5010        470.501     0.083346    0.181878    0.0939125   0.000504916
thread0::load               202         151.379     0.007597    19.1899     0.7494      0.000162451
thread0::tanh               5010        116.446     0.020674    0.164315    0.0232427   0.000124964
thread0::slice              5010        63.1702     0.011075    0.037661    0.0126088   6.77908e-05
thread0::feed               20040       44.1006     0.000886    0.006938    0.00220063  4.73264e-05
thread0::fetch              5010        31.8417     0.00492     0.040041    0.00635563  3.41708e-05

Jan 03 '20 01:01 bingyanghuang

Here come our latest performance results for Ernie FP32 and INT8 runs. The tests were run with affinity settings

export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

on CLX 6248.

With the current develop branch (ad0dfb1) and the origin FP32 model, the latency was: FP32, 20 threads: 29.73 ms, FP32, 1 thread: 196.86 ms.

INT8 results with fc, reshape2 and transpose2 quantized: INT8, 20 threads: 20.26 ms (31.8 % faster), INT8, 1 thread: 78.40 ms (60.2 % faster).

Jan 09 '20 07:01 wojtuss

We have finally resolved the accuracy issue for Ernie INT8 run.

Background: FC operators in the Ernie model have 2-dimensional weights and receive 3-dimensional inputs. For the MKL-DNN inner product to be used it requires transforming the weights and inputs into proper 3-dimensional tensors. As long as the input is in NCHW format (for FP32), the transformation is possible and Ernie FP32 achieves good accuracy. However, when the input is in NHWC format (for INT8), transforming the input and the weights into proper tensors requires modifications of strides, which are not supported by the MKL-DNN inner product.

Solution: The NHWC format of INT8 input data is beneficial for conv2d INT8 kernel (because of the nature of the conv2d's operations), but for the FC the NCHW format remains optimal also for INT8. The solution is to add support for an NCHW INT8 output of the quantize operator and use it before FC INT8. This fixes the accuracy problem. Having applied the fix we got 0.795582 accuracy for Ernie INT8 (on an SKX, on a CLX it will probably be even better; for the original QAT model, we got 0.79759 on a CLX). A PR with a fix and benchmark results will be prepared next week.

Jan 17 '20 14:01 wojtuss

With the fix for FC INT8 (PR https://github.com/PaddlePaddle/Paddle/pull/22404, branch Sand3r-:mgallus/3d-fc-acc) we got the following results for Ernie on a CLX 6248 machine:

	accuracy	latency (ms) 1 thread	latency (ms) 20 threads
FP32	0.802008	190.1	31.2
INT8	0.799598	70.6	17.3
QAT FP32	0.797590

Jan 30 '20 14:01 wojtuss

benchmark benchmark copied to clipboard

Optimize inference performance of ERNIE INT8 on CPU

benchmark
benchmark copied to clipboard