Thomas Faingnaert

Results 8 issues of Thomas Faingnaert

This PR contains an initial implementation of (my proposal for) an API to instantiate flexible matrix multiplication kernels. It is divided in two large parts: - A Tiling API that...

It would be nice to have automated integration tests with several language servers: - clangd - vim LS - ...

enhancement

As requested: we're getting a `CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)` in the benchmark suite of GemmKernels. I placed a reproducer at https://github.com/JuliaGPU/GemmKernels.jl/tree/tf/repro-illegal-memacces. Interestingly, the...

bug

Add alternative pipelining kernel. Compared to the old pipelining kernel, the loads/stores are reordered somewhat, and shared memory is split in two stages. This reduces the number of necessary bar.syncs...

- [ ] GitHub comment is too long (ref. https://github.com/JuliaGPU/GemmKernels.jl/pull/186#issuecomment-1912468519) - [ ] Benchmarks take a long time, requiring us to bump the timeout to 2h+. We should either reduce...

Given the fact that SMs in Volta, Turing, Ampere, and Hopper have four processing blocks, each with one warp scheduler, I don't think it makes sense to try configurations during...

Something I noticed while working on optimisations for Volta. Most of the time, our explicitly vectorised loads and stores for 8 Float16 elements are emitted as e.g. `ld.shared.v4.b32`, as expected....

Extracted from #179