initial support blackwell
10.0 blackwell b100/b200 12.0 blackwell rtx50
Hi @johnnynunez , thanks for bringing this up! Could we hold this PR and wait for the official release of torch 2.6 and blackwell software stack?
Hi @johnnynunez , thanks for bringing this up! Could be hold this PR and wait for the official release of torch 2.6 and blackwell software stack?
Yeah for sure! I put all codegen blackwell family on pytorch. Also you have references here: https://github.com/NVIDIA/cccl/issues/3493
Hi @johnnynunez , thanks for bringing this up! Could we hold this PR and wait for the official release of torch 2.6 and blackwell software stack?
FYI: https://github.com/pytorch/pytorch/pull/145436
FYI: https://docs.nvidia.com/cuda/pdf/ptx_isa_8.7.pdf
FYI: https://docs.nvidia.com/cuda/pdf/ptx_isa_8.7.pdf
This is huge!
@yzh119 can you merge?
@yzh119 can you merge?
@johnnynunez remind https://github.com/flashinfer-ai/flashinfer/pull/747#issuecomment-2610198665
well, sure... pytorch is coming this week : M6: Release Day (1/29/25)
Is there a prebuilt that can work for B200?
What performance improvement should we expect out of the box on B200 compared to H100 SXM5 for different size models ? 8B, 70B, 400B. I expected to get some benefit even for 8B (e.g. 30% for low batch sizes), but I am getting no benefit using Llama 8B.
Also is there any planned on in-progress work on flashinfer utilizing B200 specific capabilities (e.g. Tensor Memory Accelerator) ?
