installTensorFlowTX2 icon indicating copy to clipboard operation
installTensorFlowTX2 copied to clipboard

Memory issue (?) : failed to synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED

Open ericj974 opened this issue 6 years ago • 2 comments

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04.LTS
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.3.0
Python version: 2.7.12
CUDA/cuDNN version: 8.0/6.0.21
GPU model and memory: Nvidia Tegra X2

Describe the problem

I'm trying to run an inference using resnet50 as a feature encoder (semantic segmentation with 2 classes). Depending on my memory load, I get the following error log sooner or later:

2017-11-10 05:10:43.484563: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Invalid reduction dimension (-1146944963 for input with 4 dimension(s) 2017-11-10 05:10:44.646881: E tensorflow/stream_executor/cuda/cuda_driver.cc:1068] failed to synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED 2017-11-10 05:10:44.646946: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0x30eb3d0: CUDA_ERROR_LAUNCH_FAILED 2017-11-10 05:10:44.646975: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0x30eb3d0: CUDA_ERROR_LAUNCH_FAILED 2017-11-10 05:10:44.647369: E tensorflow/stream_executor/cuda/cuda_blas.cc:551] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED 2017-11-10 05:10:44.647478: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: slice index 1000163558 of dimension 0 out of bounds. 2017-11-10 05:10:44.647529: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: slice index 1021428837 of dimension 0 out of bounds. 2017-11-10 05:10:44.647573: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: slice index 1004492442 of dimension 0 out of bounds.

This happens whether a swapfile is being used or not. When this happens, any other inference run is impossible, even with a network with a small footprint. I'm wondering whether there is a memory issue and if yes how to deal with this ?

For info, I happen to get a similar error log when using a TX1 (compiled and binary tensorflow were tried, same os / tf configuration as above)

ericj974 avatar Nov 13 '17 08:11 ericj974

hi eric, i just met the same problem on jetson Tx2, have you solve this?

LanYangXiXi avatar Apr 17 '18 13:04 LanYangXiXi

+1 @ericj974 @LanYangXiXi any update?

nvnnghia avatar Apr 23 '18 07:04 nvnnghia