tilelang [Roadmap] Release Plan of tilelang 0.2.0

Release Plan for v0.2.0

[x] explicit warp specialize
[ ] tile scheduler
[x] support transposeB=False for Rocm
- [x] Correctness Evaluation
- [x] Layout Swizzling

[x] Implement Flash MLA kernel
- [x] init version
- [x] optimize to SoTA
- [x] MI300
[x] Implement NSA kernel
- [x] init version
- [x] decoding
- [x] varlen
- [x] fuse topk
- [x] bwd
- [x] MI300
[x] Implement Flash seerAttention
- [x] init version
- [x] different q/kv seq
- [x] varlen
- [x] bwd
[x] optimize TileLang Flash Attention kernel to SoTA
- [x] H100
- [x] MI300
[ ] Complete support for commonly used attributes in Flash Attention
- [x] varlen
- [ ] mask/bias
- [ ] list all supported dims (benchmark)
- [ ] fa3 dim 256 fwd + bwd
- [ ] fa3 bwd (64, 128)

[x] Pass and Migrate CI to H100
- [ ] fix fp16xfp4 dequant: testing/python/kernel/test_tilelang_kernel_dequantize_gemm.py: test_simple_impl_float16xfp4_gemm
- [ ] fix tma load for float32: testing/python/kernel/test_tilelang_kernel_gemm.py:test_gemm_f32f32f32_nn
[x] Add support for WebGPU
[ ] Add support for Metal
[ ] Add support for Hexagon

[x] Nightly Build
[ ] Update API: Replace all tilelang.lower into tilelang.compile in examples and tests.
[ ] Reduce LLVM dependencies
[ ] Provide prebuilt and PyPI packages for ROCm platforms
[ ] Integrate TileLang with Torch Inductor
[ ] Configure API access level to enable advanced features