marian-dev
marian-dev copied to clipboard
Use bias epilogue in GPU affine operation if CUDA >= 10.1
https://github.com/marian-nmt/marian-dev/blob/master/src/graph/node_operators_binary.h#L256-L265 is rather inefficient in bias application: it uses a second GEMM with 1s to add the bias term.
In CUDA 10.1 there is explicit bias support using https://docs.nvidia.com/cuda/cublas/index.html#cublasLtEpilogue_t CUBLASLT_EPILOGUE_BIAS = 4
"Apply (broadcasted) bias from the bias vector. Bias vector length must match matrix D rows, and it must be packed (i.e., stride between vector elements is 1). Bias vector is broadcasted to all columns and added before applying the final postprocessing."
I'm not sure if it's there in prior versions.
Hat tip to @XapaJIaMnu.
Oh, good to know, will take a look.
Do we have similar things for MKL?
Regarding MKL, paging @sidkashyap-at-Intel
Regarding the CPU, I recall the current implementation was efficient. I have run several different options to add bias, but this one was fastest for the student models on single core at the time of wngt 19.
I had similar experiences with the GPU, but I did not use anything like proposed here. Unsurprisingly this is particularly efficient in the backward step.
@emjotde yes, that makes sense. fbgemm also has a bias epilogue, but it didn't help.
This is a new feature in cuda 10.1, which was released long after this code was written. I think it's worth investigating
It may help more on a GPU than on a CPU, as GPUs have more cores and are therefore more bottlenecked by memory bandwidth.
@XapaJIaMnu absolutely.
I have added a code path for CUDA >= 11 that will use the cublasLt bias fused op when possible.
FYI - cuda versions before 11 only support this op for int8.
Will make a new PR with this soonTM