Tobias Ribizel
Tobias Ribizel
This improves the symbolic Cholesky performance by preprocessing the matrix on the GPU with a Minimum Spanning Tree algorithm. Example rgg_22 from SuiteSparse with METIS nested dissection on H100: *...
This adds a primitive that allows the distribution of variable-sized chunks of work across a warp for better memory coalescing and warp utilization. This can be used as a component...
Extracted from the symbolic Cholesky (but might also be useful for other things, e.g. parallel OpenMP COO sorting)
This adds another optional column to the `ProfilerHook::create_(nested_)summary` logger that computes memory bandwidths/FLOPS/custom rates for kernels with work estimates. Merge stack: - [x] #1782 - [ ] #802
As a starting point and example for adding work estimates to kernels, this adds the necessary operations to all non-trivial kernels in a simple unpreconditioned Cg solve. Example output for...
We could make the performance of `matrix_assembly_data` much better by building a row-wise flat hash map ourselves. For that we only need an upper bound for the number of columns...
The former is being phased out, see https://github.com/ROCm/roctracer/issues/56#issuecomment-2385675072 for more details
- [x] memory atomics - [x] sorting - [x] bitvectors - [ ] searching - [ ] merging - [ ] sync-free operations
### Steps to reproduce When installing multiple versions of LLVM that use the same patch files, I am seeing test failures that look like a race condition between the creation/access...
### Steps to reproduce Somewhat related to #50696 When building multiple LLVM versions that share the same patch files, for some reason the write locks on the patch files are...