Anton Smirnov
Anton Smirnov
And at this moment matrix multiplication is not a bottleneck in DL applications for AMDGPU. Timely memory freeing is.
@jaydeeppatel1111 am I missing something or it looks like ROCm docs [contain](https://rocm.docs.amd.com/projects/HIP/en/docs-6.1.0/doxygen/html/structhip_mem_pool_props.html#a214586a7598eb73e4ff5ebb8aed5294d) information about `maxSize` field, but the actual release does not include https://github.com/ROCm/clr/commit/b72d8da1bdd6547c86baa119f1bacab4d418a5ea ? I'm not able to find...
I also ran tests using debug Julia & HIP build and besides hitting [this](https://github.com/ROCm-Developer-Tools/clr/issues/36) assert (which I commented out) there were no other issues.
Unfortunately, I was unable to create a MWE as it is unclear to me what causes it. Running the tests one-by-one does not reproduce it, only when running them all....
Also, on Windows there are no issues at all with RX7900XT, it passes all AMDGPU.jl tests without hanging.
@iassiour, not sure if this is expected, but I noticed that async malloc/free vs non-async is ~300x slower (tried on RX6700 XT and RX7900 XT). MWE: ```cpp #include #include using...
Indeed, smaller than 8 bytes allocations are much slower. Thanks! However, with e.g. 16 bytes it is still 3-5x slower: ``` pxl-th@Leleka:~/code$ time ./a.out Regular real 0m0,255s user 0m0,203s sys...
Thank you for the fix! Regarding `hipFreeAsync` and hangs, I recently upgraded to ROCm 6 and when running AMDGPU.jl tests it reported some page faults (and errored instead of hanged),...
There are tests that reliably trigger the hang. In Julia we use Task-Local State (TLS) as opposed to Thread-Local State. And each Task in Julia has its own HIP stream,...
Reviving this as I have a fairly small MWE that consistently reproduces the issue. On ROCm 6.0.2 and RX7900 XTX. Again in Julia as it is much easier to set...