Ivan Komarov comments

Results 49 comments of


                                            Ivan Komarov

[User] Interactive mode immediately exits on Windows with Zig

> I think this has been fixed only in [February](https://github.com/microsoft/terminal/pull/14745/files) this year Yep, I just built and ran Windows Terminal from [the commit fixing the issue](https://github.com/microsoft/terminal/commit/599b55081762af1594cd8419320e79b9be533944) and [the previous one](https://github.com/microsoft/terminal/commit/cf87590b31edc6e9d9816425ee8a3ec5f38ee41c)....

perf(CuBLAS): explore reduction in launch overhead via CUDA graphs

Using CUDA graphs would make sense if the duration of our kernels were comparable with the launch overhead (a couple of microseconds). As far as I understand, we intentionally use...

Improve cuBLAS performance by using a memory pool

@slaren > I think it is very unlikely that a general purpose allocator like that will be faster than this Yeah, I agree a general purpose allocator would have no...

Improve cuBLAS performance by using a memory pool

> cuBLAS already uses tensors cores As far as I can see (I'm running your changes from #1207), we only use cuBLAS for SGEMMs. TF32 tensor cores on Ampere and...

Improve cuBLAS performance by using a memory pool

I'm a little confused about what's happening with `V x KQ_soft_max` and `K x Q` matmuls (the ones that are parallel) in my nsys profile. I don't see them happening...

Improve cuBLAS performance by using a memory pool

> in this screenshot it is the noisy lines that look to happen at the start of the layer (actually it is the end of the previous layer). Maybe your...

Performance issues with cuBLAS and a bug

> For AMD, I can use `rocprof`, I tried to see if something similar is available on Nvidia, and there was just some confusing info For high-level tracing of what's...

Performance issues with cuBLAS and a bug

> This is a single tensor mat mul of `(512 x 128 x 32) x (512 x 512 x 32)`, which is split into 32 mat muls, and each of...

Performance issues with cuBLAS and a bug

> `cudaStreamAttachMemAsync()` is what provides the synchronization between the compute and copy streams Wait, what? Could you please clarify how exactly `cudaStreamAttachMemAsync()` in `copyStream` prevents `cudaMemcpyAsync()` (`A`) in `copyStream` from...

Performance issues with cuBLAS and a bug

@slaren Since you and @SlyEcho have the cuBLAS stuff proper covered, I decided to take a look at the quantization kernels in #1221. Turns out they can be optimized quite...