OpenNMT-tf icon indicating copy to clipboard operation
OpenNMT-tf copied to clipboard

Non-OK-status: GpuLaunchKernel in softmax_op_gpu.cu.cc

Open guillaumekln opened this issue 6 years ago • 5 comments

@luozhouyang reported this error during evaluation:

INFO:tensorflow:Step = 4300 ; source words/s = 78948, target words/s = 2012 ; Learning rate = 0.000100 ; Loss = 2.087587
INFO:tensorflow:Step = 4400 ; source words/s = 79358, target words/s = 2059 ; Learning rate = 0.000100 ; Loss = 2.108997
INFO:tensorflow:Step = 4500 ; source words/s = 79888, target words/s = 1977 ; Learning rate = 0.000100 ; Loss = 2.675094
INFO:tensorflow:Step = 4600 ; source words/s = 77566, target words/s = 2015 ; Learning rate = 0.000100 ; Loss = 2.173948
INFO:tensorflow:Step = 4700 ; source words/s = 80029, target words/s = 1967 ; Learning rate = 0.000100 ; Loss = 2.588823
INFO:tensorflow:Step = 4800 ; source words/s = 80122, target words/s = 1950 ; Learning rate = 0.000100 ; Loss = 2.336910
INFO:tensorflow:Step = 4900 ; source words/s = 78610, target words/s = 1998 ; Learning rate = 0.000100 ; Loss = 2.527997
INFO:tensorflow:Step = 5000 ; source words/s = 79802, target words/s = 1957 ; Learning rate = 0.000100 ; Loss = 2.110916
INFO:tensorflow:Running evaluation for step 5000
2019-10-17 06:07:54.980482: F tensorflow/core/kernels/softmax_op_gpu.cu.cc:192] Non-OK-status: GpuLaunchKernel( GenerateNormalizedProb<T, acc_type>, numBlocks, numThreadsPerBlock, 0, cu_stream, reinterpret_cast<const T*>(logits_in_.flat<T>().data()), reinterpret_cast<const acc_type*>(sum_probs.flat<acc_type>().data()), reinterpret_cast<const T*>(max_logits.flat<T>().data()), const_cast<T*>(softmax_out->flat<T>().data()), rows, cols, log_) status: Internal: invalid configuration argument

And here is a similar issue of tensorflow Non-OK-status for CudaLaunchKernel when torch is also imported #27487

Originally posted by @luozhouyang in https://github.com/OpenNMT/OpenNMT-tf/issues/519#issuecomment-543029009

guillaumekln avatar Oct 17 '19 08:10 guillaumekln

@luozhouyang Never saw this issue. Is there something special in your installation?

guillaumekln avatar Oct 17 '19 08:10 guillaumekln

I install OpenNMT-tf using pip, and train this model in a docker container based on tensorflow/tensorflow:2.0.0-gpu-py3 image.

luozhouyang avatar Oct 17 '19 12:10 luozhouyang

Do you still face this issue?

guillaumekln avatar Oct 25 '19 12:10 guillaumekln

Closing this one. I don't think this is related to something we do in OpenNMT-tf.

guillaumekln avatar Oct 29 '19 17:10 guillaumekln

Could this maybe have something to do with batches ? From tensorflow issues

In case anyone else is going crazy because of the GpuLaunchKernel(...) status: Internal: invalid configuration argumnent error, please note that this may also occur if the batch size you use is such that there will be an odd batch with a single record, in my case, the error occurred with the following numbers, when distributed across 4 GPUs: To fix the issue, change your batch size such that there won't be an odd batch with a single record.

Currently I see this error when using 'score' repeatedly. Trying a different TF version.

FPBHW avatar Oct 12 '21 08:10 FPBHW