bzxc

Results 1 issues of bzxc

Compared to the same structure(the qkv attention) I implemented with TensorFlow, triton runs 10 to 20 times slower. With the help of nsight system, I found that cudaMemcpySync takes off...