AutonomousDrivingCookbook
AutonomousDrivingCookbook copied to clipboard
Help! training not starting! #urgent
My training is not starting. I have used python 3.6 with tensorflow gpu 1.8.0 and keras 2.1.2. Also I have a Geforce GTX 3060 running on my computer. So it shouldnt be a problem. I also installed Norton antivirus on this new computer. On the older computer which has a bad GPU I had Panda Dome, but there training was running. But after over 1 hour, the training was only on 1%. Thats why I bought a new computer with a good GPU and CPU. Some of this work is going to be presented in my master thesis. I would appreciate any help soon.
I got this error now:
InternalError Traceback (most recent call last) C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args) 1321 try: -> 1322 return fn(*args) 1323 except errors.OpError as e:
C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata) 1306 return self._call_tf_sessionrun( -> 1307 options, feed_dict, fetch_list, target_list, run_metadata) 1308
C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata) 1408 self._session, options, feed_dict, fetch_list, target_list, -> 1409 run_metadata) 1410 else:
InternalError: Blas GEMM launch failed : a.shape=(30, 64), b.shape=(64, 10), m=30, n=10, k=64 [[Node: dense2/MatMul = MatMul[T=DT_FLOAT, _class=["loc:@training/Nadam/gradients/dropout_2/cond/Merge_grad/cond_grad"], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dropout_2/cond/Merge, dense2/kernel/read)]] [[Node: loss/mul/_129 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1107_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
During handling of the above exception, another exception occurred:
InternalError Traceback (most recent call last)
----> 2 validation_data=eval_generator, validation_steps=num_eval_examples//batch_size, verbose=2)
C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\legacy\interfaces.py in wrapper(*args, **kwargs)
85 warnings.warn('Update your ' + object_name + 86 '
call to the Keras 2 API: ' + signature, stacklevel=2)
---> 87 return func(*args, **kwargs)
88 wrapper._original_function = func
89 return wrapper
C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\engine\training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch) 2145 outs = self.train_on_batch(x, y, 2146 sample_weight=sample_weight, -> 2147 class_weight=class_weight) 2148 2149 if not isinstance(outs, list):
C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\engine\training.py in train_on_batch(self, x, y, sample_weight, class_weight) 1837 ins = x + y + sample_weights 1838 self._make_train_function() -> 1839 outputs = self.train_function(ins) 1840 if len(outputs) == 1: 1841 return outputs[0]
C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\keras\backend\tensorflow_backend.py in call(self, inputs) 2355 session = get_session() 2356 updated = session.run(fetches=fetches, feed_dict=feed_dict, -> 2357 **self.session_kwargs) 2358 return updated[:len(self.outputs)] 2359
C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in run(self, fetches, feed_dict, options, run_metadata) 898 try: 899 result = self._run(None, fetches, feed_dict, options_ptr, --> 900 run_metadata_ptr) 901 if run_metadata: 902 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _run(self, handle, fetches, feed_dict, options, run_metadata) 1133 if final_fetches or final_targets or (handle and feed_dict_tensor): 1134 results = self._do_run(handle, final_targets, final_fetches, -> 1135 feed_dict_tensor, options, run_metadata) 1136 else: 1137 results = []
C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata) 1314 if handle is None: 1315 return self._do_call(_run_fn, feeds, fetches, targets, options, -> 1316 run_metadata) 1317 else: 1318 return self._do_call(_prun_fn, handle, feeds, fetches)
C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args) 1333 except KeyError: 1334 pass -> 1335 raise type(e)(node_def, op, message) 1336 1337 def _extend_graph(self):
InternalError: Blas GEMM launch failed : a.shape=(30, 64), b.shape=(64, 10), m=30, n=10, k=64 [[Node: dense2/MatMul = MatMul[T=DT_FLOAT, _class=["loc:@training/Nadam/gradients/dropout_2/cond/Merge_grad/cond_grad"], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dropout_2/cond/Merge, dense2/kernel/read)]] [[Node: loss/mul/_129 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1107_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Caused by op 'dense2/MatMul', defined at:
File "C:\ProgramData\anaconda3\envs\airsim\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "C:\ProgramData\anaconda3\envs\airsim\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\ProgramData\anaconda3\envs\airsim\lib\site-packages\ipykernel_launcher.py", line 16, in
InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(30, 64), b.shape=(64, 10), m=30, n=10, k=64 [[Node: dense2/MatMul = MatMul[T=DT_FLOAT, _class=["loc:@training/Nadam/gradients/dropout_2/cond/Merge_grad/cond_grad"], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dropout_2/cond/Merge, dense2/kernel/read)]] [[Node: loss/mul/_129 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1107_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
@mitchellspryn please help
@adshar
depencies.txt Here is the list of dependencies I have in my anaconda env:
I am not at MSFT currently, so I am not actively supporting this repo any more.
That said, I took a look at your stack trace. It looks like CUDA isn't installed properly. Relevant portion:
InternalError: Blas GEMM launch failed : a.shape=(30, 64), b.shape=(64, 10), m=30, n=10, k=64
[[Node: dense2/MatMul = MatMul[T=DT_FLOAT, _class=["loc:@training/Nadam/gradients/dropout_2/cond/Merge_grad/cond_grad"], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dropout_2/cond/Merge, dense2/kernel/read)]]
I'd check to see if you can run any keras training operation - e.g. try training a linear model on some random data points and see if the forward/backpropagation works properly. My guess is no, and that'll help you debug what the situation is with your cuda install.
I am not at MSFT currently, so I am not actively supporting this repo any more.
That said, I took a look at your stack trace. It looks like CUDA isn't installed properly. Relevant portion:
InternalError: Blas GEMM launch failed : a.shape=(30, 64), b.shape=(64, 10), m=30, n=10, k=64 [[Node: dense2/MatMul = MatMul[T=DT_FLOAT, _class=["loc:@training/Nadam/gradients/dropout_2/cond/Merge_grad/cond_grad"], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dropout_2/cond/Merge, dense2/kernel/read)]]
I'd check to see if you can run any keras training operation - e.g. try training a linear model on some random data points and see if the forward/backpropagation works properly. My guess is no, and that'll help you debug what the situation is with your cuda install.
Thank you for answering. I have tried to reinstall to check if it's something to do with cuda. I also tried by installing the cudatoolkit and cudann before install tensorflow by following these steps: conda install cudatoolkit=9.0 conda install cudnn=7.1.4=cuda9.0_0 conda install -c anaconda tensorflow-gpu=1.8.0 conda install -c anaconda keras-gpu=2.1.2 python -m pip install --upgrade pip conda update -n base conda pip install msgpack-rpc-python pip uninstall tornado conda install -c conda-forge tornado=4.5.3 conda install jupyter pip install matplotlib==2.1.2 pip install image pip install keras_tqdm conda install -c conda-forge opencv conda install pandas pip install --upgrade numpy==1.16.4 conda install scipy pip install opencv-python pip install --upgrade h5py==2.10.0 python -m ipykernel install --user
Still I have the same problem. Do you have any idea how I can solve this? I have really tried to look it up, but it seems many had the same problem, but no solutions that worked for me. As I am using this as a part of my master thesis, I have limited time as well.