German Andryeyev

Results 25 comments of German Andryeyev

Try 256 threads per block. In general you have to profile each kernel and report the slowest to the compiler.

Use rocprofiler (--hip-trace option), it should produce JSON compatible with chrome://tracing. Then identify where runtime/HW spends the most of time.

There is no limitation on constant buffer on AMD HW and it is equal to the max single allocation, which is about the total memory size. Hence you see an...

I'm not sure about that, since HIP matches CUDA on API's side and I don't see any reason to add extra extension. IMO, this particular query is useless on any...

hipDeviceGetLimit() with hipLimitMallocHeapSize. That's the closest what the app can query right now, because I couldn't find the max single alloc query in HIP. The returned value should be the...

Allocate any size there is no limit, except max single allocation size on GPU, which is almost the same as the total device memory size I really don't understand why...

Thank you for reporting the issue, it will be fixed. Basically for some reason your build enabled _GLIBCXX_ASSERTIONS and by default it's disabled in our environment.

Sure, if you rebuild runtime. In hsaCopy() and copyBufferRect() under rocblit.cpp, there are calls of hsa_amd_memory_async_copy(). Replace the argument &wait_events[0] with "(wait_events.size() > 0) ? &wait_events[0] : nullptr" Basically if...

Try to remove -Wno-dev and add explicit release build with -DCMAKE_BUILD_TYPE=Release. However I don't really know what exactly triggered _GLIBCXX_ASSERTIONS in your build. Maybe some global setting in the compiler...

Actually instead of that condition you can use just wait_events.data(). That will produce a bit more optimal code.