gluon-nlp [Sparse Attention][Performance] Accelerate the performance of sparse attention + Benchmark

[Sparse Attention][Performance] Accelerate the performance of sparse attention + Benchmark

Open sxjscience opened this issue 5 years ago • 1 comments

We are having ongoing efforts about supporting sparse attention in GluonNLP: https://github.com/dmlc/gluon-nlp/pull/1395. To better accelerate related kernels, we can compare the performance of these potential solutions, including:

Use BlockSparse kernel to implement the operator We may try out these implementations
- https://github.com/openai/blocksparse
- https://github.com/huggingface/pytorch_block_sparse
- TVM Block Sparse: https://github.com/ceruleangu/Block-Sparse-Benchmark
Directly implement window attention
- Use CUTLASS and implement our own version
- Use TVM + Ansor: https://tvm.apache.org/docs/tutorials/auto_scheduler/tune_conv2d_layer_cuda.html#sphx-glr-tutorials-auto-scheduler-tune-conv2d-layer-cuda-py

Oct 21 '20 19:10 sxjscience

@ZiyueHuang Created the issue here about how we may use TVM to accelerate the speed.

Oct 22 '20 00:10 sxjscience

gluon-nlp gluon-nlp copied to clipboard

[Sparse Attention][Performance] Accelerate the performance of sparse attention + Benchmark

gluon-nlp
gluon-nlp copied to clipboard