llvm
llvm copied to clipboard
failed to allocate memory when using malloc_device
template<typename T>
class Linear {
private:
T* weight;
T* input;
T* result;
T* bias;
T* dz;
const int M;
const int N;
const int K;
public:
Linear(T* x, T* r, int m, int n, intk, queue& Q): input(x), result(r), M(m), N(n), K(k) {
weight = malloc_device<T>(M * N, Q);
bias = malloc_device<T>(M, Q);
dz = malloc_device<T>(N * K, Q);
}
...
x = malloc_device<T>(N * K, Q);, r = malloc_device<T>(M * K, Q);
In my codes, when I have multiple Linear instances sequently, all of them can allocate successful for weight and bias. However, only the last Linear instance can allocate successfully for dz, others failed to allocate for dz and return 0. (dz == nullptr is true).
I use dz for storing temporary result in each Linear.
Furthermore, if I change my code and put dz int member function of Linear, like bellow:
T* update(T* diff, queue& Q) {
T* dz = malloc_device<T>(N * K, Q); // also failed to allocate
T* dw = malloc_device<T>(M * N, Q); // but dw always allocate successfully
/* events here*/
...
free(dw, Q);
return dz;
}
I call update in a for loop:
T* diff = inputs.back(); // all elements in inputs are allocated by malloc_device
for (auto linear = layers.rbegin(); linear != layers.rend(); linear++) {
diff = linear->update(diff, Q);
}
free(diff, Q);
if I change the above code as:
void update(T* diff, queue& Q) {
T* dz = malloc_device<T>(N * K, Q); // also failed to allocate
T* dw = malloc_device<T>(M * N, Q); // but dw always allocate successfully
/* events here*/
...
free(dw, Q);
free(diff, Q);
Q.memcpy(diff, dz, N * K * sizeof(T)).wait();
free(dz, Q);
}
then, call it as:
T* diff = inputs.back(); // all elements in inputs are allocated by malloc_device
for (auto linear = layers.rbegin(); linear != layers.rend(); linear++) {
linear->update(diff, Q);
}
free(diff, Q);
all circumstances only can allocate successfully dz for the last Linear, others will get 0 (nullptr);
This problem has driven me crazy! Hope guys can explain why it happened! Thanks!
I tested those codes in Windows 10 on Intel CPU, using oneAPI toolkit and Ubuntu 18.04 on Nvidia GPU. They gave me same errors.