TurboTransformers icon indicating copy to clipboard operation
TurboTransformers copied to clipboard

Turbo slower than Torch on V100

Open xutianming opened this issue 4 years ago • 15 comments

Dear developers,

I am trying to reproduce the bert benchmarking result on my machine.

image

I just run bash run_gpu_benchmark.sh but the QPS is much slower than the declared value. When seq_len becomes larger than 80, turbo becomes slower than torch.

image

I installed TurboTransformers from source

mkdir -p build && cd build
cmake .. -DWITH_GPU=ON
make -j 4
pip install `find . -name *whl`

xutianming avatar Jul 20 '20 06:07 xutianming

Thanks for your report. Compared with my previous results. The QSP of both torch and turbo from your screenshot are not correct. According to a 0418 version, when seq_len 100, batch size 1, QPS of Turbo is 298.10 while QPS of the torch is 122.33 on V100. I will check it with you.

feifeibear avatar Jul 20 '20 06:07 feifeibear

Thanks for your report. Compared with my previous results. The QSP of both torch and turbo from your screenshot are not correct. According to a 0418 version, when seq_len 100, batch size 1, QPS of Turbo is 298.10 while QPS of the torch is 122.33 on V100. I will check it with you.

Thanks for your quick relay. I installed TurboTransformers from source with following commands. Is there any way to verify installation ?

mkdir -p build && cd build
cmake .. -DWITH_GPU=ON
make -j 4
pip install `find . -name *whl`

xutianming avatar Jul 20 '20 06:07 xutianming

I have no V100 on hand. Could you please try our previous commit and check if the benchmark results. git reset --hard 64dd569da9ce8bf1f78fcd108356607371b742ed

feifeibear avatar Jul 20 '20 06:07 feifeibear

I tried the benchmark again with thufeifeibear/turbo_transformers_gpu:latest.

image

The speed of TurboTransformer seemed reasonable but the torch was slower than expected.

xutianming avatar Jul 20 '20 07:07 xutianming

Check if your torch is using cuda. torch.cuda.is_available()

feifeibear avatar Jul 20 '20 07:07 feifeibear

I have no V100 on hand. Could you please try our previous commit and check if the benchmark results. git reset --hard 64dd569

For lastest master, the problem still exists: image

For commit 64dd569, it looks good image

By the way, the torch is as slow. torch.cuda.is_available() returns True.

xutianming avatar Jul 20 '20 08:07 xutianming

What I can make sure is that everything is OK on RTX 2060. Avoid using dockerhub's image, maybe you should build a docker image from scratch by yourself.

feifeibear avatar Jul 20 '20 08:07 feifeibear

What I can make sure is that everything is OK on RTX 2060.

Do you mean that the performance of the latest master meets expectation on RTX 2060 ?

We mainly inference with T4 online. I'll try to build a docker image.

xutianming avatar Jul 20 '20 08:07 xutianming

I will apply for a V100 and check the code on it. BTW: you can also benchmark the latest turbo version to see which kernel is wrong. https://github.com/Tencent/TurboTransformers/blob/master/docs/profiler.md

feifeibear avatar Jul 20 '20 08:07 feifeibear

image

According to the unit-tests report, the performance of Albert seems abnormal, too.

xutianming avatar Jul 20 '20 13:07 xutianming

image

I profiled Albert

xutianming avatar Jul 20 '20 14:07 xutianming

image

According to the unit-tests report, the performance of Albert seems abnormal, too.

It seems turbo albert's performance is randomly higher or lower than torch?

lsy641 avatar Jul 20 '20 17:07 lsy641

image According to the unit-tests report, the performance of Albert seems abnormal, too.

It seems turbo albert's performance is randomly higher or lower than torch?

The tuple after AlbertLayer in the log shows (batch_size, seq_length). So as the workload increases, the performance of Turbo declines faster than Torch.

xutianming avatar Jul 21 '20 01:07 xutianming

It may be the bug of the allocator. We now use NVLab/cub. Try a hand-crafted one instead.

git reset --hard bebe404b4d9ea8e18c72c19625dadcc184188236

feifeibear avatar Jul 21 '20 02:07 feifeibear

After git reset --hard 78356c54ad8fee0e4dad6357948f5af3cf02d2d3, the performance is as expected on V100 + CUDA10

xutianming avatar Jul 21 '20 03:07 xutianming