Julian Samaroo
Julian Samaroo
This would reduce the need for using `rocprof` or the Profile stdlib to observe kernel execution ordering and latency hiding efficiency.
...because it loads `libhsa-runtime64.so`, not `libhsa-runtime64.so.1`.
The current implementation has multiple flaws: - Resizing operations on `Array` are not thread-safe - wait-to-mark exhibits TOCTTOU races We need a solution for this that doesn't involve taking the...
The current approach of escaping kernel inputs during kernel execution, and having finalizers directly free HSA memory allocations, is problematic when considering the potential benefits of https://github.com/JuliaLang/julia/pull/44056. We could instead...
When the GPU is under high load and spinning on `AMDGPU.device_signal_wait`, `hsa_executable_freeze` can hang as it tries to synchronize with the GPU. We should switch to using an `InterruptSignal` (exposed...
This shouldn't be necessary, but let's have CI confirm that for us.
Currently, a single thread serves hostcalls, but it should be possible to spawn multiple threads to service the same hostcall (especially once #50 is merged) concurrently. This would be useful...