FasterTransformer make TFusedMHAKernelFactory thread

issue

when running faster transformer with multiple threads, where each thread corresponds to a different GPU, and use_trt_kernels=true, kernel launch would throw CUDA_ERROR_INVALID_HANDLE. there will always be 1 out of all GPUs running fine.

cause

the custom kernels (and hence the CUFunction handles) were dynamically loaded and would be dropped during driver context switch for different threads.

solution

making the factory object thread_local so that each thread will have a copy of the custom kernels in memory.

Aug 26 '21 22:08 bangshengtang

Hi Bangsheng, could you please share more information, e.g. which GPU architecture? A small Gist with a reproducer would be very helpful. Thanks!

Aug 27 '21 06:08 yjk21

@yjk21 thanks for your reply.

I'm using a system with multiple A100 cards.

in order to repro, I made some changes to the cpp sample, essentially running two threads on two separate devices, please refer to: https://github.com/bangshengtang/FasterTransformer/blob/repro/sample/cpp/encoder_sample.cc (diff)

run it with ./encoder_sample 1 12 128 12 64 1 0 0 (essentially the same as in the original example, except for using fp16 instead of fp32)

you'll see the following error message repeatedly

CUDA Error: CUDA_ERROR_INVALID_HANDLE fastertransformer/trt_fused_multihead_attention/fused_multihead_attention_v2.h 564

the particular code path is only triggered when fp16 or int8 is in use, and use_trt_kernels=true

Sep 03 '21 05:09 bangshengtang

This PR has bug when we run multi-GPU BERT with multi-thread, so we fix this issue in latest release directly. Thank you for the feedback and PR.

Sep 08 '22 07:09 byshiue

make TFusedMHAKernelFactory thread_local

issue

cause

solution