CCMpred icon indicating copy to clipboard operation
CCMpred copied to clipboard

Bug in CCMpred (CUDA)

Open fsimkovic opened this issue 8 years ago • 4 comments

Running CCMpred with a sequence alignment in a CUDA compiled version of CCMpred gives crashes sometimes. Error give:

adenine: felix > ccmpred alignments/1bdo.jones 1bdo.mat
Found 1 CUDA devices, using device #0: Quadro K4000
Total GPU RAM:      3,217,752,064
Free GPU RAM:       2,617,708,544
Needed GPU RAM:       792,606,940 ✓
CUDA error No. 0 in /opt/CCMpred/src/evaluate_cuda_kernels.cu at line 819

Running the same command with flag -t 2 runs fine.

fsimkovic avatar Oct 13 '16 13:10 fsimkovic

Hi Felix, I don't have access to a suitable GPU/computer combination to debug this at the moment so I'm afraid that I will not be able to help 😞

sseemayer avatar Oct 13 '16 18:10 sseemayer

No worries, the CPU version works fine so there's no rush. Just thought I'd report it ...

fsimkovic avatar Oct 13 '16 18:10 fsimkovic

I encountered a similar error. The reason seems to be that I fed CCMpred with too much sequences (~70k). (The error code I got was 6.) Besides, the macro CHECK_ERR(err) defined in include/evaluate_cuda_kernels.h and lib/libconjugrad/include/conjugrad_kernels.h (and maybe other files) may call cudaGetLastError() multiple times, like those in src/evaluate_cuda_kernels.cu, after expansion. The problem is, referring to http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__ERROR.html#group__CUDART__ERROR_1g3529f94cb530a83a76613616782bd233, the error code will have been reset to cudaSuccess when output. So we always get "CUDA error No. 0". Something like https://codeyarns.com/2011/03/02/how-to-do-error-checking-in-cuda/ may be a solution.

tianmingzhou avatar May 21 '17 03:05 tianmingzhou

This issue is still present, hiding error codes and always showing No. 0. The reason being the error checking via CHECK_ERR(cudaGetLastError()); which is not a function but a preprocessor macro defined as #define CHECK_ERR(err) {if (cudaSuccess != (err)) { printf("CUDA error No. %d in %s at line %d\n", (err), __FILE__, __LINE__); exit(EXIT_FAILURE); } } in evaluate_cuda_kernels.h, line 9. It therefore expands to call cudaGetLastError() two times, consuming the actual error code before displaying it.

I suggest to change the macro to #define CHECK_ERR(err) { int e = (err); if (cudaSuccess != e) { printf("CUDA error No. %d in %s at line %d\n", e, __FILE__, __LINE__); exit(EXIT_FAILURE); } }

kWeissenow avatar Feb 13 '20 14:02 kWeissenow