GPUMD
GPUMD copied to clipboard
nep executable error - "no kernel image is available for execution on the device"
Hello,
I would like to report an issue I found using GPUMD version 3.9.1.
I was trying to create the first test neuroevolution potential using the “nep” executable on the cluster I use. After preparing the input files (nep.in, test.xyz, and train.xyz) and running the "nep" command, GPUMD gives the information:
number of GPUs = 1
Device id: 0
Device name: Tesla K80 Compute capability: 3.7 Amount of global memory: 11.1731 GB Number of SMs: 13
Then the nep.in file is read successfully. Later:
Started reading train.xyz.
Number of configurations = 20.
Number of devices = 1
Number of batches = 1
Hello, I changed the batch_size from 1000 to 20.
Batch 0:
Number of configurations = 20.
Constructing train_set in device 0.
Total number of atoms = 1000.
Number of atoms in the largest configuration = 50.
Number of configurations having virial = 0.
CUDA Error:
File: main_nep/dataset.cu Line: 266 Error code: 209
Error text: no kernel image is available for execution on the device
With the help of the cluster admins, we checked that the error is caused by the command “CUDA_CHECK_KERNEL”, defined in the utilities/error.cuh as:
#define CUDA_CHECK_KERNEL \
{ \
CHECK(cudaGetLastError()); \
CHECK(cudaDeviceSynchronize()); \
}
#else
#define CUDA_CHECK_KERNEL \
{ \
CHECK(cudaGetLastError()); \
}
#endif
The function we think is causing the error is cudaDevicesSynchronize(). However, this command seems to work when we run it outside GPUMD.
Configuration of the Cluster: driver version: 470.129.06, CUDA Version: 11.4., GPU card: Tesla K80. The nvcc compilation with NVHPC 23.3 and CUDA 11.8. gave the same effect.
I do not know how to solve this issue. I would be very grateful for your help!
Kind regards, Antoni
You can try to change CFLAGS = -std=c++14 -O3 -arch=sm_60
to CFLAGS = -std=c++14 -O3 -arch=sm_37
in src/makefile
and try again (make clean
and then make
).
Thank you for your answer. Unfortunately, the error persists. Below I am sending the makefile that was used during the compilation. Different combinations of CFLAGS were tried:
CFLAGS = -std=c++11 -O3 -arch=sm_37
CFLAGS = -std=c++14 -O3 -arch=sm_37
… as well as compilation with and without PLUMED and NetCDF, giving the same effect.
Moreover, the error message remains the same while using the input files (nep.in, train.xyz, and test.xyz) from the repository (GPUMD/examples/11_NEP_potential_PbTe/).
Then I guess CUDA code does not work in your platform at all. You can try to compile and run the folloiwng simplest CUDA code:
#include <stdio.h>
__global__ void hello_from_gpu()
{
printf("Hello World from the GPU!\n");
}
int main(void)
{
hello_from_gpu<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}
Save the above code into file hello.cu
and compile it using nvcc -arch=sm_37 hello.cu -o hello
and then run the executable ./hello
. If it is successful, you will see the message Hello World from the GPU!
.
I am sending the output from the commands after creating the hello.cu file:
nvcc -arch=sm_37 hello.cu -o hello
nvcc warning : The 'compute_35', 'compute_37', 'sm_35', and 'sm_37' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
./hello
Hello World from the GPU!
It seems that the cudaDeviceSynchronize() command works correctly outside GPUMD on the cluster I use. Unfortunately, I do not know what is the origin of such behaviour (that the command works outside GPUMD, and within it does not). Do you have an idea?
Being able to compile and run the simplest CUDA code means you have a working CUDA platform.
Then did you run gpumd
(or nep
) from comand line directly? The error
Error text: no kernel image is available for execution on the device
means that your executable was not compiled to target your GPU architecture. However, you showed that you have used -arch=sm_37
to compile, which macthes the K80 GPU you mentioned. So I am really puzzled.
The error log I reported at the beginning of this issue, was shown after running the "nep" command directly from the command line. This was done in the directory with the input files (nep.in, train.xyz, and test.xyz). I did not use "gpumd" command yet.
If possible, could you change a platform to test?
I encountered a similar problem before. Just changed -arch=sm_XX to a smaller number and the problem was solved.
Thanks for the tip. Unfortunately, in my case, the compilation with a lower number in -arch=sm_XX resulted in the same effect. The tested options were: -arch=sm_35 (lowest possible value for the compilation with each of two different nvcc versions, that are available on the cluster I use) -arch=native (this is probably equal to sm_37. This was also tried with different versions of nvcc)
I would like to close this if there is no more discussion. I believe this is a problem related to the CUDA environment instead of GPUMD.