Tri Dao

Results 432 comments of Tri Dao
trafficstars

The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic.

Normal if you're training the model, not normal if you're only doing inference (forward pass only).

> > The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic. > > Hi, I am wondering if there is a way to...

> Even for the forward pass, I noticed that results are somewhat unstable in my experiments. Given two inputs `x1` and `x2`, the result of `model(torch.stack([x1, x2])` (i.e. batching) differs...

> I also found that mamba will bring randomness during forward propagation and greatly affect model convergence. Can you isolate which layer or function that first produces different outputs?

This is very helpful, thanks @Akatsuki030 and @Panchovix. @Akatsuki030 is it possible to fix it by declaring these variables (Headdim, kBlockM) with `constexpr static int` instead of `constexpr int`? I've...

Great, thanks for the confirmation @Panchovix. I'll cut a release now (v2.3.2). Ideally we'd set up prebuilt CUDA wheels for Windows at some point so folks can just download instead...

I see, thanks for the confirmation. I guess we rely on Cutlass and Cutlass requires CUDA 12.x to build on [Windows](https://github.com/NVIDIA/cutlass/blob/main/media/docs/build/building_in_windows_with_visual_studio.md).

> Another note, it may be a good idea to build wheels for cu121 as well, since github actions currently doesn't build for that version. Right now github actions only...