PocketFlow icon indicating copy to clipboard operation
PocketFlow copied to clipboard

InternalError (see above for traceback): Blas SGEMM launch failed : m=802816, n=64, k=32

Open to-be-snail opened this issue 6 years ago • 23 comments
trafficstars

When I perform channel pruning on the mobilenet at ilsvrc12 dataset,this error occured. But the pruning at cifar10 dataset can be done normally.

to-be-snail avatar Feb 12 '19 02:02 to-be-snail

Maybe something related to the GPU memory? https://stackoverflow.com/questions/37337728/tensorflow-internalerror-blas-sgemm-launch-failed

jiaxiang-wu avatar Feb 12 '19 03:02 jiaxiang-wu

Maybe something related to the GPU memory? https://stackoverflow.com/questions/37337728/tensorflow-internalerror-blas-sgemm-launch-failed

My machine is GTX2080,the GPUmemory is 8G,I dont know if i can finish the pruning...

to-be-snail avatar Feb 12 '19 03:02 to-be-snail

Could you try solutions provided in the above stack-overflow link, and see if anything helps?

jiaxiang-wu avatar Feb 12 '19 03:02 jiaxiang-wu

anything I'm sure I only run a tensorflow program at the same time and have reinstalled the tensorflow-gpu,it didn't worked.

to-be-snail avatar Feb 12 '19 03:02 to-be-snail

Maybe this one? https://stackoverflow.com/a/43130779/10611647

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.3)
sess = tf.Session(config=tf.ConfigProto(
  allow_soft_placement=True, log_device_placement=True))

jiaxiang-wu avatar Feb 12 '19 03:02 jiaxiang-wu

Could you try solutions provided in the above stack-overflow link, and see if anything helps?

I'm sure I only run a tensorflow program at the same time and have reinstalled the tensorflow-gpu,it didn't worked.

to-be-snail avatar Feb 12 '19 03:02 to-be-snail

Maybe this one? https://stackoverflow.com/a/43130779/10611647 gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.3) sess = tf.Session(config=tf.ConfigProto( allow_soft_placement=True, log_device_placement=True))

I have tried,although I'm not sure where to put it. image

to-be-snail avatar Feb 12 '19 03:02 to-be-snail

How many GPU cards do you have?

jiaxiang-wu avatar Feb 12 '19 03:02 jiaxiang-wu

How many GPU cards do you have?

only one...

to-be-snail avatar Feb 12 '19 03:02 to-be-snail

Try to reduce the batch size?

jiaxiang-wu avatar Feb 12 '19 04:02 jiaxiang-wu

Try to reduce the batch size?

I have reduced the batch_size_eval to 1

to-be-snail avatar Feb 12 '19 04:02 to-be-snail

If the error occurs in the training process, then you should reduce FLAGS.batch_size instead of FLAGS.batch_size_eval.

jiaxiang-wu avatar Feb 12 '19 04:02 jiaxiang-wu

If the error occurs in the training process, then you should reduce FLAGS.batch_size instead of FLAGS.batch_size_eval.

It didn't work...

to-be-snail avatar Feb 12 '19 04:02 to-be-snail

Any updates? Still not working?

jiaxiang-wu avatar Feb 25 '19 00:02 jiaxiang-wu

Hey bro, have u figured it out ? I met the same issue

ShuteLee avatar Mar 18 '19 12:03 ShuteLee

plz if you solve this problem, let me know how to solve it,,,

0113bernoyoun avatar Apr 04 '19 12:04 0113bernoyoun

I encountered the same issue when I run my code at the machine of the GTX2080(the signal GPU memory is 8G, total have two card), the error info as the following:

InternalError (see above for traceback): Blas SGEMM launch failed : m=53290, n=80, k=64
	 [[node while/AdvInceptionV3/AdvInceptionV3/Conv2d_3b_1x1/Conv2D (defined at /home/suy/.pyenv/versions/mypython3.6/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py:1057)  = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](while/AdvInceptionV3/AdvInceptionV3/MaxPool_3a_3x3/MaxPool, while/AdvInceptionV3/AdvInceptionV3/Conv2d_3b_1x1/kernel/Regularizer/l2_regularizer/L2Loss/Enter)]]
	 [[{{node while/Exit/_791}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4223_while/Exit", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

However, I could run the same code at another machine of the GTX2080(the signal GPU memory is 10G, total have two card).

I still don't know why.

Donald-Su avatar Aug 07 '19 07:08 Donald-Su

I fixed this issue just by installing the patches of CUDA_Toolkit @Donald-Su @0113bernoyoun

ShuteLee avatar Aug 07 '19 08:08 ShuteLee

I fixed this issue just by installing the patches of CUDA_Toolkit @Donald-Su @0113bernoyoun

Hi ShuteLee, the machine installed the CUDA_Toolkit, but still have the issue

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

Donald-Su avatar Aug 07 '19 09:08 Donald-Su

I fixed this issue just by installing the patches of CUDA_Toolkit @Donald-Su @0113bernoyoun

Hi ShuteLee, the machine installed the CUDA_Toolkit, but still have the issue

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

Please be sure that you have installed the four PATCHES

https://developer.nvidia.com/cuda-90-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=runfilelocal

ShuteLee avatar Aug 07 '19 11:08 ShuteLee

I fixed this issue just by installing the patches of CUDA_Toolkit @Donald-Su @0113bernoyoun

Hi ShuteLee, the machine installed the CUDA_Toolkit, but still have the issue

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

Please be sure that you have installed the four PATCHES

https://developer.nvidia.com/cuda-90-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=runfilelocal

There is not the package for my OS of the ubuntu 18.04

Donald-Su avatar Aug 08 '19 06:08 Donald-Su

I fixed this issue just by installing the patches of CUDA_Toolkit @Donald-Su @0113bernoyoun

Hi ShuteLee, the machine installed the CUDA_Toolkit, but still have the issue

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

Please be sure that you have installed the four PATCHES https://developer.nvidia.com/cuda-90-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=runfilelocal

There is not the package for my OS of the ubuntu 18.04

So, maybe the CUDA Tookit 9.0 is not so compatible with your Ubuntu 18.04. you can choose a more recent version.

ShuteLee avatar Aug 08 '19 09:08 ShuteLee

Make sure TensorFlow is in 1.12.0 version mentioned in main.sh?

pip install tensorflow-gpu==1.12.0

bryanbocao avatar Dec 29 '20 23:12 bryanbocao