Zhai Feiyue comments

Results 11 comments of


                                            Zhai Feiyue

Optimize inference performance of ERNIE on CPU

@tensor-tang `Ernie`关于多线程性能差的问题，我们这边做了如下测试： - 环境变量 *disable HT and Turbo Boost* ```bash export OMP_NUM_THREADS=20 export KMP_BLOCKTIME=1 export KMP_AFFINITY=granularity=fine,compact,1,0 numactl --cpunodebind=0 --membind=0 CMD ``` - Intel-MKLML 小包 UT |GEMM Size(M,N,K)|TH=1|TH=20| |--|--|--| | 128...

Optimize inference performance of ERNIE on CPU

> paddle的单侧数据我们也测了，20threads 时是600us左右，貌似跟你这里的结果不一致呢。 UT 单测和我们的数据一样吗？

Optimize inference performance of ERNIE on CPU

我更新了`128 * 800 * 3100`, 可以看出 UT 加上Padding 可以和TF 打平。

Optimize inference performance of ERNIE on CPU

# 结论 Ernie 20 线程瓶颈在 L3 和 DDR，原因是Ernie MKL 在20 线程的时候走了不同于TF 的MKL path。由于TF 中添加了Padding，如下 >128 * 768 * 3072 > 128 * 772 * 3076 128 * 3072 *...

Optimize inference performance of ERNIE on CPU

上面的结论只适用于 AVX512， AVX2 是正常的

Optimize inference performance of ERNIE on CPU

- 1 thread: UT TF Ernie 最终走的MKL path 都一样 - 20 threads: UT 和 Ernie 走的一样, TF 走的是另一个(和1 thread 的一样) - 20 threads: 加了padding之后 UT 可以和TF 持平. 并且走的 MKL path...

Optimize inference performance of ERNIE on CPU

Vtune 显示 Gaowei8 padding branch 走的和 TF 一致耗时对比MKL VERBOSE |size|Ernie|TF| |:--:|--:|--:| | 3072,128,768 | 352 us | 340 us | | 768,128,3072 | 340 us | 338 us |...

Optimize inference performance of ERNIE on CPU

目前的结论是 padding 之后 - docker 环境下 Ernie 比TF 多37% - 非docker 环境下 Ernie 比TF 多 65%

Optimize inference performance of ERNIE on CPU

@GaoWei8 更新 padding 后单线程数据

Optimize inference performance of ERNIE on CPU

padding memory time cost 包含 `申请内存`, `数据拷贝` 以及 `释放内存` 是吗？