OpenNMT-tf
OpenNMT-tf copied to clipboard
Non-OK-status: GpuLaunchKernel in softmax_op_gpu.cu.cc
@luozhouyang reported this error during evaluation:
INFO:tensorflow:Step = 4300 ; source words/s = 78948, target words/s = 2012 ; Learning rate = 0.000100 ; Loss = 2.087587
INFO:tensorflow:Step = 4400 ; source words/s = 79358, target words/s = 2059 ; Learning rate = 0.000100 ; Loss = 2.108997
INFO:tensorflow:Step = 4500 ; source words/s = 79888, target words/s = 1977 ; Learning rate = 0.000100 ; Loss = 2.675094
INFO:tensorflow:Step = 4600 ; source words/s = 77566, target words/s = 2015 ; Learning rate = 0.000100 ; Loss = 2.173948
INFO:tensorflow:Step = 4700 ; source words/s = 80029, target words/s = 1967 ; Learning rate = 0.000100 ; Loss = 2.588823
INFO:tensorflow:Step = 4800 ; source words/s = 80122, target words/s = 1950 ; Learning rate = 0.000100 ; Loss = 2.336910
INFO:tensorflow:Step = 4900 ; source words/s = 78610, target words/s = 1998 ; Learning rate = 0.000100 ; Loss = 2.527997
INFO:tensorflow:Step = 5000 ; source words/s = 79802, target words/s = 1957 ; Learning rate = 0.000100 ; Loss = 2.110916
INFO:tensorflow:Running evaluation for step 5000
2019-10-17 06:07:54.980482: F tensorflow/core/kernels/softmax_op_gpu.cu.cc:192] Non-OK-status: GpuLaunchKernel( GenerateNormalizedProb<T, acc_type>, numBlocks, numThreadsPerBlock, 0, cu_stream, reinterpret_cast<const T*>(logits_in_.flat<T>().data()), reinterpret_cast<const acc_type*>(sum_probs.flat<acc_type>().data()), reinterpret_cast<const T*>(max_logits.flat<T>().data()), const_cast<T*>(softmax_out->flat<T>().data()), rows, cols, log_) status: Internal: invalid configuration argument
And here is a similar issue of tensorflow Non-OK-status for CudaLaunchKernel when torch is also imported #27487
Originally posted by @luozhouyang in https://github.com/OpenNMT/OpenNMT-tf/issues/519#issuecomment-543029009
@luozhouyang Never saw this issue. Is there something special in your installation?
I install OpenNMT-tf using pip, and train this model in a docker container based on tensorflow/tensorflow:2.0.0-gpu-py3 image.
Do you still face this issue?
Closing this one. I don't think this is related to something we do in OpenNMT-tf.
Could this maybe have something to do with batches ? From tensorflow issues
In case anyone else is going crazy because of the GpuLaunchKernel(...) status: Internal: invalid configuration argumnent error, please note that this may also occur if the batch size you use is such that there will be an odd batch with a single record, in my case, the error occurred with the following numbers, when distributed across 4 GPUs: To fix the issue, change your batch size such that there won't be an odd batch with a single record.
Currently I see this error when using 'score' repeatedly. Trying a different TF version.