divchenko

Results 7 comments of divchenko

- 'overlapped' is quite an odd name imho. Something more concrete is better, e.g. 'in autograd optimizer' . - What happens if I don't specify overlapped optimizers for some parameters...

@IonThruster full code is here. I've played w/ tiles. This is the best config. ``` #include #include #include #include #include #include #include #include #include #include #include #include #include namespace cuscratch...

@IonThruster for fp8 version, you can just look at my old post https://github.com/NVIDIA/cutlass/issues/1139

Thanks @rawnhenry . The memory-bound case for fp8 (where I have 64x16x256 tiles) actually works quite well reaching closed to 60% memory b/w. It's the mixed precision case w/ tile...

GPU is a scarce resource. We have many hosts w/ fully occupied GPUs and only few hosts w/ free GPUs. runner pods can be scheduled on any node. They will...

I also see same pattern for non-grouped gemm (same --m and --k as above, but I pass a very large --n to ensure that there is enough data to load).

Update: I've transposed gemm i.e. using TNN instead of TNT with 256x64x256 tile on 2 SMs. This way N dimension is the smallest. This gives me a bit better perf,...