Terry Chen
Terry Chen
Thanks for fixing this issue. I tried it for the case(M=16, K=64, N=10), which needs alignment_c(softmax) = 2, then we still have inf/nan in the output softmax tensor when we...
> > Thanks for fixing this issue. I tried it for the case(M=16, K=64, N=10), which needs alignment_c(softmax) = 2, then we still have inf/nan in the output softmax tensor...
Thank you! The new algo works well on all of my current problem sizes! no numerical issue now. Do you have any timeline for bmm support? Look forward to it.
Tested with B=16, M=16, K=64, N=24, the result of first batch is correct, but from 2nd batch the output contain inf values. I set batch_stride_Max_ and batch_stride_Sum_ as M*N.
still not working, before this PR https://github.com/NVIDIA/cutlass/pull/546 stride should be M*N, would be good if you can provide a example/code snippet for BMM. i did a benchmark for fused bmm+softmax...
That's super interesting, but unfortunately circular padding is not supported in v0.1 release.
> > @terrychenism maybe we can prioritize it and make a v0.11/v0.12 release. Yes add it into wishlist.
We didn't notice the mem usage increase when running multiple inference. Can you please provide the reproducible script?
Just wanna double check you mean CPU memory, not GPU memory?
1024x1024 is easy to have, you would need to compile vae model with 128x128 input: https://github.com/facebookincubator/AITemplate/blob/main/examples/05_stable_diffusion/compile.py#L180-L181 For mem we don't support xformet yet, but AIT should be very efficient compared...