simrdwn icon indicating copy to clipboard operation
simrdwn copied to clipboard

CUDA driver version is insufficient for CUDA runtime version

Open iboates opened this issue 5 years ago • 3 comments

I cannot get simrdwn to train. It is telling me (via tensorflow) that my CUDA driver version is insufficient for CUDA runtime version. I know this seems like it is not a problem with this repository specifically but it seems like everything is configured properly on my end so I am at a loss to explain this behaviour.

I tried this using the default repository configuration, but I was receiving this very same error. I only have CUDA 9.1 becase I changed the first line of the Dockerfile from

nvidia/cuda:9.0-devel-ubuntu16.04

to

nvidia/cuda:9.1-devel-ubuntu16.04

This is the error I get:

Traceback (most recent call last):
  File "/tensorflow/models/research/object_detection/model_main.py", line 109, in <module>
    tf.app.run()
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/tensorflow/models/research/object_detection/model_main.py", line 105, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1403, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 508, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 934, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 648, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1122, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1127, in _create_session
    return self._sess_creator.create_session()
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 805, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 571, in create_session
    init_fn=self._scaffold.init_fn)
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py", line 281, in prepare_session
    config=config)
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py", line 184, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1551, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/opt/conda/envs/simrdwn/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 676, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

This is the output of nvcc --version (run from inside the container):

Cuda compilation tools, release 9.1, V9.1.85 (Again, I know that the Dockerfile specified v9.0, but I was getting the same error and that was why I tried bumping it up)

This is the output of nvidia-smi (run from outside the container):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.116                Driver Version: 390.116                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce 845M        Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   52C    P0    N/A /  N/A |    167MiB /  2004MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1748      G   /usr/lib/xorg/Xorg                           166MiB |
+-----------------------------------------------------------------------------+

And according to the release notes, these should be compatible:

CUDA Toolkit      | Linux x86_64 Driver Version
CUDA 9.1 (9.1.85) | >= 390.46

So since I have driver version 390.116 and CUDA Toolkit version 9.1, I can't explain why the container keeps throwing me this error

Do you have any idea?

iboates avatar Jul 28 '19 11:07 iboates

You need to upgrade your graphics driver,my Driver Version: 418.39

younkun avatar Jul 30 '19 03:07 younkun

What is your graphics card model?

iboates avatar Jul 30 '19 19:07 iboates

What is your graphics card model?

p100

younkun avatar Jul 31 '19 06:07 younkun