flashinfer initial support blackwell

10.0 blackwell b100/b200 12.0 blackwell rtx50

Jan 21 '25 22:01 johnnynunez

Hi @johnnynunez , thanks for bringing this up! Could we hold this PR and wait for the official release of torch 2.6 and blackwell software stack?

Jan 23 '25 15:01 yzh119

Hi @johnnynunez , thanks for bringing this up! Could be hold this PR and wait for the official release of torch 2.6 and blackwell software stack?

Yeah for sure! I put all codegen blackwell family on pytorch. Also you have references here: https://github.com/NVIDIA/cccl/issues/3493

Jan 23 '25 16:01 johnnynunez

Hi @johnnynunez , thanks for bringing this up! Could we hold this PR and wait for the official release of torch 2.6 and blackwell software stack?

FYI: https://github.com/pytorch/pytorch/pull/145436

Jan 23 '25 16:01 johnnynunez

FYI: https://docs.nvidia.com/cuda/pdf/ptx_isa_8.7.pdf

Jan 23 '25 21:01 johnnynunez

FYI: https://docs.nvidia.com/cuda/pdf/ptx_isa_8.7.pdf

This is huge!

Jan 23 '25 21:01 yzh119

@yzh119 can you merge?

Jan 25 '25 10:01 johnnynunez

@yzh119 can you merge?

@johnnynunez remind https://github.com/flashinfer-ai/flashinfer/pull/747#issuecomment-2610198665

Jan 25 '25 10:01 zhyncs

well, sure... pytorch is coming this week : M6: Release Day (1/29/25)

Jan 25 '25 10:01 johnnynunez

Is there a prebuilt that can work for B200?

Feb 16 '25 15:02 ghostplant

What performance improvement should we expect out of the box on B200 compared to H100 SXM5 for different size models ? 8B, 70B, 400B. I expected to get some benefit even for 8B (e.g. 30% for low batch sizes), but I am getting no benefit using Llama 8B.

Also is there any planned on in-progress work on flashinfer utilizing B200 specific capabilities (e.g. Tensor Memory Accelerator) ?

Apr 16 '25 15:04 YavorGIvanov