Ivan Komarov

Results 49 comments of Ivan Komarov

> I think this has been fixed only in [February](https://github.com/microsoft/terminal/pull/14745/files) this year Yep, I just built and ran Windows Terminal from [the commit fixing the issue](https://github.com/microsoft/terminal/commit/599b55081762af1594cd8419320e79b9be533944) and [the previous one](https://github.com/microsoft/terminal/commit/cf87590b31edc6e9d9816425ee8a3ec5f38ee41c)....

Using CUDA graphs would make sense if the duration of our kernels were comparable with the launch overhead (a couple of microseconds). As far as I understand, we intentionally use...

@slaren > I think it is very unlikely that a general purpose allocator like that will be faster than this Yeah, I agree a general purpose allocator would have no...

> cuBLAS already uses tensors cores As far as I can see (I'm running your changes from #1207), we only use cuBLAS for SGEMMs. TF32 tensor cores on Ampere and...

I'm a little confused about what's happening with `V x KQ_soft_max` and `K x Q` matmuls (the ones that are parallel) in my nsys profile. I don't see them happening...

> in this screenshot it is the noisy lines that look to happen at the start of the layer (actually it is the end of the previous layer). Maybe your...

> For AMD, I can use `rocprof`, I tried to see if something similar is available on Nvidia, and there was just some confusing info For high-level tracing of what's...

> This is a single tensor mat mul of `(512 x 128 x 32) x (512 x 512 x 32)`, which is split into 32 mat muls, and each of...

> `cudaStreamAttachMemAsync()` is what provides the synchronization between the compute and copy streams Wait, what? Could you please clarify how exactly `cudaStreamAttachMemAsync()` in `copyStream` prevents `cudaMemcpyAsync()` (`A`) in `copyStream` from...

@slaren Since you and @SlyEcho have the cuBLAS stuff proper covered, I decided to take a look at the quantization kernels in #1221. Turns out they can be optimized quite...