generative-recommenders
generative-recommenders copied to clipboard
Triton is running too slow?
Compared to the same structure(the qkv attention) I implemented with TensorFlow, triton runs 10 to 20 times slower. With the help of nsight system, I found that cudaMemcpySync takes off much time while triton is executing. Would you happen to have any ideas about that?
I feed data like this, batch: 8 seq_len: 8192, where each seq_len are the same size. emb_size = attn_size = linear_size As I changed the data size by a multiplier of 2
On Nvidia A30