gluon-nlp
gluon-nlp copied to clipboard
[Sparse Attention][Performance] Accelerate the performance of sparse attention + Benchmark
We are having ongoing efforts about supporting sparse attention in GluonNLP: https://github.com/dmlc/gluon-nlp/pull/1395. To better accelerate related kernels, we can compare the performance of these potential solutions, including:
- Use BlockSparse kernel to implement the operator
We may try out these implementations
- https://github.com/openai/blocksparse
- https://github.com/huggingface/pytorch_block_sparse
- TVM Block Sparse: https://github.com/ceruleangu/Block-Sparse-Benchmark
- Directly implement window attention
- Use CUTLASS and implement our own version
- Use TVM + Ansor: https://tvm.apache.org/docs/tutorials/auto_scheduler/tune_conv2d_layer_cuda.html#sphx-glr-tutorials-auto-scheduler-tune-conv2d-layer-cuda-py
@ZiyueHuang Created the issue here about how we may use TVM to accelerate the speed.