gluon-nlp icon indicating copy to clipboard operation
gluon-nlp copied to clipboard

[Sparse Attention][Performance] Accelerate the performance of sparse attention + Benchmark

Open sxjscience opened this issue 5 years ago • 1 comments

We are having ongoing efforts about supporting sparse attention in GluonNLP: https://github.com/dmlc/gluon-nlp/pull/1395. To better accelerate related kernels, we can compare the performance of these potential solutions, including:

  • Use BlockSparse kernel to implement the operator We may try out these implementations
    • https://github.com/openai/blocksparse
    • https://github.com/huggingface/pytorch_block_sparse
    • TVM Block Sparse: https://github.com/ceruleangu/Block-Sparse-Benchmark
  • Directly implement window attention
    • Use CUTLASS and implement our own version
    • Use TVM + Ansor: https://tvm.apache.org/docs/tutorials/auto_scheduler/tune_conv2d_layer_cuda.html#sphx-glr-tutorials-auto-scheduler-tune-conv2d-layer-cuda-py

sxjscience avatar Oct 21 '20 19:10 sxjscience

@ZiyueHuang Created the issue here about how we may use TVM to accelerate the speed.

sxjscience avatar Oct 22 '20 00:10 sxjscience