make TFusedMHAKernelFactory thread_local
issue
when running faster transformer with multiple threads, where each thread corresponds to a different GPU, and use_trt_kernels=true, kernel launch would throw CUDA_ERROR_INVALID_HANDLE. there will always be 1 out of all GPUs running fine.
cause
the custom kernels (and hence the CUFunction handles) were dynamically loaded and would be dropped during driver context switch for different threads.
solution
making the factory object thread_local so that each thread will have a copy of the custom kernels in memory.
Hi Bangsheng, could you please share more information, e.g. which GPU architecture? A small Gist with a reproducer would be very helpful. Thanks!
@yjk21 thanks for your reply.
I'm using a system with multiple A100 cards.
in order to repro, I made some changes to the cpp sample, essentially running two threads on two separate devices, please refer to: https://github.com/bangshengtang/FasterTransformer/blob/repro/sample/cpp/encoder_sample.cc (diff)
run it with ./encoder_sample 1 12 128 12 64 1 0 0
(essentially the same as in the original example, except for using fp16 instead of fp32)
you'll see the following error message repeatedly
CUDA Error: CUDA_ERROR_INVALID_HANDLE fastertransformer/trt_fused_multihead_attention/fused_multihead_attention_v2.h 564
the particular code path is only triggered when fp16 or int8 is in use, and use_trt_kernels=true
This PR has bug when we run multi-GPU BERT with multi-thread, so we fix this issue in latest release directly. Thank you for the feedback and PR.