Jackmin801 comments

Results 11 comments of


                                            Jackmin801

30b Checkpoint pickle is published with half precision, no bias tensors and no final layers

All the biases in 6.7B checkpoint are 0

30b Checkpoint pickle is published with half precision, no bias tensors and no final layers

I've checked the other model checkpoints. All of them have bias tensors that are all zero (except 30b which has no bias tensors). This is the info I have about...

Why is torch2.9.0 not supported?

@tridao https://github.com/Dao-AILab/flash-attention/pull/2007

Why is torch2.9.0 not supported?

When can we expect the next version to be released?

[Feature] Add return hidden state in the native API

This should remove the need to have it in the server args and instead have it as a kwarg passed to generate. https://github.com/Jackmin801/sglang/pull/2 Something like this would then work: ```python...

Allow serving llama models with tensor parallel

@justheuristic @borzunov Does this implementation look roughly correct to you? It doesnt seem to be working and hangs trying to process outputs in the `def process_output(output, output_actions: Dict[Arg, Callable[[torch.Tensor, int],...

add support for torch 2.9.0 and deprecate 3.9

Testing that the build and deploy action will work in a fork here: https://github.com/Jackmin801/flash-attention/actions/runs/19301136888 It seems some of the matrix elements dont build. Will look into it further

[Bug] Unusual CPU overhead of SDPA call on H100 on torch nightly

It seems to occur every batch. hrmm dont think its about pos_embeds, otherwise it would happen for flash attn too?

[Bug] Unusual CPU overhead of SDPA call on H100 on torch nightly

Im wondering if its a regression from cudnn so im building pytorch with older cudnn versions to see if anything changes. Changing cudnn version to 9.2.0 doesnt seem to help....

[Bug] Unusual CPU overhead of SDPA call on H100 on torch nightly

Seems to be from the CUDNN_ATTENTION implementation of sdpa. With the `viable/strict` build of pytorch, you can toggle the bug by forcing FLASH or CUDNN implementation. ```python from torch.nn.attention import...