TurboTransformers
TurboTransformers copied to clipboard
Turbo slower than Torch on V100
Dear developers,
I am trying to reproduce the bert benchmarking result on my machine.
I just run bash run_gpu_benchmark.sh
but the QPS is much slower than the declared value. When seq_len
becomes larger than 80, turbo becomes slower than torch.
I installed TurboTransformers from source
mkdir -p build && cd build
cmake .. -DWITH_GPU=ON
make -j 4
pip install `find . -name *whl`
Thanks for your report. Compared with my previous results. The QSP of both torch and turbo from your screenshot are not correct. According to a 0418 version, when seq_len 100, batch size 1, QPS of Turbo is 298.10 while QPS of the torch is 122.33 on V100. I will check it with you.
Thanks for your report. Compared with my previous results. The QSP of both torch and turbo from your screenshot are not correct. According to a 0418 version, when seq_len 100, batch size 1, QPS of Turbo is 298.10 while QPS of the torch is 122.33 on V100. I will check it with you.
Thanks for your quick relay. I installed TurboTransformers from source with following commands. Is there any way to verify installation ?
mkdir -p build && cd build
cmake .. -DWITH_GPU=ON
make -j 4
pip install `find . -name *whl`
I have no V100 on hand. Could you please try our previous commit and check if the benchmark results. git reset --hard 64dd569da9ce8bf1f78fcd108356607371b742ed
I tried the benchmark again with thufeifeibear/turbo_transformers_gpu:latest.
The speed of TurboTransformer seemed reasonable but the torch was slower than expected.
Check if your torch is using cuda.
torch.cuda.is_available()
I have no V100 on hand. Could you please try our previous commit and check if the benchmark results. git reset --hard 64dd569
For lastest master, the problem still exists:
For commit 64dd569, it looks good
By the way, the torch is as slow. torch.cuda.is_available()
returns True.
What I can make sure is that everything is OK on RTX 2060. Avoid using dockerhub's image, maybe you should build a docker image from scratch by yourself.
What I can make sure is that everything is OK on RTX 2060.
Do you mean that the performance of the latest master meets expectation on RTX 2060 ?
We mainly inference with T4 online. I'll try to build a docker image.
I will apply for a V100 and check the code on it. BTW: you can also benchmark the latest turbo version to see which kernel is wrong. https://github.com/Tencent/TurboTransformers/blob/master/docs/profiler.md
According to the unit-tests report, the performance of Albert seems abnormal, too.
I profiled Albert
According to the unit-tests report, the performance of Albert seems abnormal, too.
It seems turbo albert's performance is randomly higher or lower than torch?
According to the unit-tests report, the performance of Albert seems abnormal, too.
It seems turbo albert's performance is randomly higher or lower than torch?
The tuple after AlbertLayer
in the log shows (batch_size, seq_length). So as the workload increases, the performance of Turbo declines faster than Torch.
It may be the bug of the allocator. We now use NVLab/cub. Try a hand-crafted one instead.
git reset --hard bebe404b4d9ea8e18c72c19625dadcc184188236
After git reset --hard 78356c54ad8fee0e4dad6357948f5af3cf02d2d3
, the performance is as expected on V100 + CUDA10