Terry Chen comments

Results 37 comments of


                                            Terry Chen

support for alignment != 8 and adding a new BMM example

Thanks for fixing this issue. I tried it for the case(M=16, K=64, N=10), which needs alignment_c(softmax) = 2, then we still have inf/nan in the output softmax tensor when we...

support for alignment != 8 and adding a new BMM example

> > Thanks for fixing this issue. I tried it for the case(M=16, K=64, N=10), which needs alignment_c(softmax) = 2, then we still have inf/nan in the output softmax tensor...

support for alignment != 8 and adding a new BMM example

Thank you! The new algo works well on all of my current problem sizes! no numerical issue now. Do you have any timeline for bmm support? Look forward to it.

support for alignment != 8 and adding a new BMM example

Tested with B=16, M=16, K=64, N=24, the result of first batch is correct, but from 2nd batch the output contain inf values. I set batch_stride_Max_ and batch_stride_Sum_ as M*N.

support for alignment != 8 and adding a new BMM example

still not working, before this PR https://github.com/NVIDIA/cutlass/pull/546 stride should be M*N, would be good if you can provide a example/code snippet for BMM. i did a benchmark for fused bmm+softmax...

Seamless images in stable diffusion

That's super interesting, but unfortunately circular padding is not supported in v0.1 release.

Seamless images in stable diffusion

> > @terrychenism maybe we can prioritize it and make a v0.11/v0.12 release. Yes add it into wishlist.

Memory usage increases when tableDiffusionAITPipeline is run repeatedly.

We didn't notice the mem usage increase when running multiple inference. Can you please provide the reproducible script?

Memory usage increases when tableDiffusionAITPipeline is run repeatedly.

Just wanna double check you mean CPU memory, not GPU memory?

can we get stable diffusion example work with xformers?

1024x1024 is easy to have, you would need to compile vae model with 128x128 input： https://github.com/facebookincubator/AITemplate/blob/main/examples/05_stable_diffusion/compile.py#L180-L181 For mem we don't support xformet yet, but AIT should be very efficient compared...