[Issue]: ROCm 6.3.3 allocation failure with stream
Problem Description
The 4.6 kokkos release lead to a change in the way the allocations are done. They now default to hipMallocAsync. https://github.com/kokkos/kokkos/pull/7659
The behaviors of hipMallocAsync leads to OOM where hipMalloc does not. I have a reproducer for the allocation issue. Using rocm 6.3.3:
#include <hip/hip_runtime.h>
#include <cstdio>
#include <iostream>
#include <vector>
void checkhiperror(hipError_t error) {
if(error != hipSuccess) {
std::cout << hipGetErrorString(error) << '\n';
std::exit(1);
}
}
static size_t current_allocated = 0;
// #define NOSTREAM
void* GPUMalloc(size_t byte_count, hipStream_t stream) {
void* ptr;
#if defined(NOSTREAM)
checkhiperror(hipMalloc(&ptr, byte_count));
#else
checkhiperror(hipMallocAsync(&ptr, byte_count, stream));
checkhiperror(hipDeviceSynchronize());
#endif
current_allocated += byte_count;
std::cout << "a current_allocated: " << current_allocated << std::endl;
return ptr;
}
void GPUFree(void* ptr, size_t byte_count, hipStream_t stream) {
#if defined(NOSTREAM)
checkhiperror(hipFree(ptr));
#else
checkhiperror(hipFreeAsync(ptr, stream));
checkhiperror(hipDeviceSynchronize());
#endif
current_allocated -= byte_count;
std::cout << "f current_allocated: " << current_allocated << std::endl;
}
void Workload(hipStream_t stream, double* host_stuff, size_t count) {
void* device_ptr = GPUMalloc(count * sizeof(double), stream);
checkhiperror(hipMemcpyAsync(device_ptr, host_stuff, count * sizeof(double), hipMemcpyHostToDevice, stream));
GPUFree(device_ptr, count * sizeof(double), stream);
}
void Test1() {
static constexpr size_t kCount = (1ULL << 32) + /* kokkos stuff */ 100;
std::vector<double> host_stuff;
host_stuff.resize(kCount);
hipStream_t stream;
checkhiperror(hipStreamCreate(&stream));
for(int i = 0; i < 10; ++i)
Workload(stream, host_stuff.data(), kCount);
}
int main() {
Test1();
}
I get the following results if stream alloc are used:
$ hipcc -O3 --offload-arch=gfx90a reproducer.hip && ./a.out
a current_allocated: 34359739168
f current_allocated: 0
out of memory
$ hipcc -O3 --offload-arch=gfx90a -DNOSTREAM reproducer.hip && ./a.out
a current_allocated: 34359739168
f current_allocated: 0
a current_allocated: 34359739168
f current_allocated: 0
a current_allocated: 34359739168
f current_allocated: 0
...
$echo "OS:" && cat /etc/os-release | grep -E "^(NAME=|VERSION=)";
OS:
NAME="Red Hat Enterprise Linux"
VERSION="8.10 (Ootpa)"
$ echo "CPU: " && cat /proc/cpuinfo | grep "model name" | sort --unique;
CPU:
model name : AMD EPYC 7A53 64-Core Processor
$ echo "GPU:" && /opt/rocm/bin/rocminfo | grep -E "^\s*(Name|Marketing Name)";
GPU:
Name: AMD EPYC 7A53 64-Core Processor
Marketing Name: AMD EPYC 7A53 64-Core Processor
Name: AMD EPYC 7A53 64-Core Processor
Marketing Name: AMD EPYC 7A53 64-Core Processor
Name: AMD EPYC 7A53 64-Core Processor
Marketing Name: AMD EPYC 7A53 64-Core Processor
Name: AMD EPYC 7A53 64-Core Processor
Marketing Name: AMD EPYC 7A53 64-Core Processor
Name: gfx90a
Marketing Name: AMD Instinct MI250X
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Hi @etiennemlb. Internal ticket has been created to investigate your issue. Thanks!
Hi @etiennemlb,
Thanks for the report! I was able to reproduce the issue on ROCm 6.3.3 and can confirm that we have a fix that will be released with ROCm 7.
This issue has been migrated to: https://github.com/ROCm/rocm-systems/issues/399