Thomas Faingnaert issues

Results 8 issues of


                                            Thomas Faingnaert

Add matmul API

This PR contains an initial implementation of (my proposal for) an API to instantiate flexible matrix multiplication kernels. It is divided in two large parts: - A Tiling API that...

Integration tests with language servers

It would be nice to have automated integration tests with several language servers: - clangd - vim LS - ...

enhancement

Illegal memory access in GemmKernels on Julia 1.9

As requested: we're getting a `CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)` in the benchmark suite of GemmKernels. I placed a reproducer at https://github.com/JuliaGPU/GemmKernels.jl/tree/tf/repro-illegal-memacces. Interestingly, the...

bug

Add new pipelined kernel

Add alternative pipelining kernel. Compared to the old pipelining kernel, the loads/stores are reordered somewhat, and shared memory is split in two stages. This reduces the number of necessary bar.syncs...

Resolve remaining issues with benchmarking

- [ ] GitHub comment is too long (ref. https://github.com/JuliaGPU/GemmKernels.jl/pull/186#issuecomment-1912468519) - [ ] Benchmarks take a long time, requiring us to bump the timeout to 2h+. We should either reduce...

Skip configurations with fewer than 4 warps in tuning

Given the fact that SMs in Volta, Turing, Ampere, and Hopper have four processing blocks, each with one warp scheduler, I don't think it makes sense to try configurations during...

Incomplete vectorisation of FP16 loads and stores

Something I noticed while working on optimisations for Volta. Most of the time, our explicitly vectorised loads and stores for 8 Float16 elements are emitted as e.g. `ld.shared.v4.b32`, as expected....

Check size limits of LocalArray

Extracted from #179