mmvec icon indicating copy to clipboard operation
mmvec copied to clipboard

Error running on GPU: device renaming issue?

Open FranckLejzerowicz opened this issue 4 years ago • 2 comments

Hi,

So here's a command run on a gpu node in an interactiove slurm srun session:

$ rhapsody mmvec \
   --microbe-file A.biom \
   --metabolite-file B.biom  \
   --min-feature-count 5  \
   --epochs 20000 \
   --batch-size 1000  \
   --latent-dim 3  \
   --input-prior 1  \
   --learning-rate 1e-4  \
   --beta1 0.85 \
   --beta2 0.90  \
   --checkpoint-interval 60  \
   --summary-interval 60 \
   --arm-the-gpu  \
   --summary-dir gpu_1000_1e-4_20000  \
   --ranks-file gpu_1000_1e-4_20000/ranks.csv

The (long) error (sorry):


WARNING: Logging before flag parsing goes to stderr.
W0828 12:38:30.259999 140077172123456 deprecation_wrapper.py:119] From /home/flejzerowicz/rhapsody_ve_new/bin/rhapsody:156: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0828 12:38:30.262325 140077172123456 deprecation_wrapper.py:119] From /home/flejzerowicz/rhapsody_ve_new/bin/rhapsody:157: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-08-28 12:38:30.262596: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-08-28 12:38:30.273506: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-08-28 12:38:32.273961: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560b1e030b60 executing computations on platform CUDA. Devices:
2019-08-28 12:38:32.274039: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla V100-PCIE-32GB, Compute Capability 7.0
2019-08-28 12:38:32.291287: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz
2019-08-28 12:38:32.294314: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560b1d6caf10 executing computations on platform Host. Devices:
2019-08-28 12:38:32.294405: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-08-28 12:38:32.297357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:5e:00.0
2019-08-28 12:38:32.298520: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/slurm-18.08.0/lib::/home/flejzerowicz/local/lib:/home/flejzerowicz/local/lib64:/home/flejzerowicz/openssl/lib:/home/flejzerowicz/usr/lib/lib/:/home/flejzerowicz/local/lib:/home/flejzerowicz/local/lib64
2019-08-28 12:38:32.299494: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/slurm-18.08.0/lib::/home/flejzerowicz/local/lib:/home/flejzerowicz/local/lib64:/home/flejzerowicz/openssl/lib:/home/flejzerowicz/usr/lib/lib/:/home/flejzerowicz/local/lib:/home/flejzerowicz/local/lib64
2019-08-28 12:38:32.300329: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/slurm-18.08.0/lib::/home/flejzerowicz/local/lib:/home/flejzerowicz/local/lib64:/home/flejzerowicz/openssl/lib:/home/flejzerowicz/usr/lib/lib/:/home/flejzerowicz/local/lib:/home/flejzerowicz/local/lib64
2019-08-28 12:38:32.301209: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/slurm-18.08.0/lib::/home/flejzerowicz/local/lib:/home/flejzerowicz/local/lib64:/home/flejzerowicz/openssl/lib:/home/flejzerowicz/usr/lib/lib/:/home/flejzerowicz/local/lib:/home/flejzerowicz/local/lib64
2019-08-28 12:38:32.302105: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/slurm-18.08.0/lib::/home/flejzerowicz/local/lib:/home/flejzerowicz/local/lib64:/home/flejzerowicz/openssl/lib:/home/flejzerowicz/usr/lib/lib/:/home/flejzerowicz/local/lib:/home/flejzerowicz/local/lib64
2019-08-28 12:38:32.302962: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/slurm-18.08.0/lib::/home/flejzerowicz/local/lib:/home/flejzerowicz/local/lib64:/home/flejzerowicz/openssl/lib:/home/flejzerowicz/usr/lib/lib/:/home/flejzerowicz/local/lib:/home/flejzerowicz/local/lib64
2019-08-28 12:38:32.304020: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/slurm-18.08.0/lib::/home/flejzerowicz/local/lib:/home/flejzerowicz/local/lib64:/home/flejzerowicz/openssl/lib:/home/flejzerowicz/usr/lib/lib/:/home/flejzerowicz/local/lib:/home/flejzerowicz/local/lib64
2019-08-28 12:38:32.304122: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2019-08-28 12:38:32.304182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-28 12:38:32.304231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-08-28 12:38:32.304265: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
W0828 12:38:32.641206 140077172123456 deprecation_wrapper.py:119] From /home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/rhapsody/multimodal.py:94: The name tf.log is deprecated. Please use tf.math.log instead.

W0828 12:38:32.643565 140077172123456 deprecation.py:323] From /home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/rhapsody/multimodal.py:95: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.random.categorical` instead.
W0828 12:38:32.655179 140077172123456 deprecation_wrapper.py:119] From /home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/rhapsody/multimodal.py:106: The name tf.random_normal is deprecated. Please use tf.random.normal instead.

W0828 12:38:32.694295 140077172123456 deprecation.py:323] From /home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/rhapsody/multimodal.py:122: Normal.__init__ (from tensorflow.python.ops.distributions.normal) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
W0828 12:38:32.695811 140077172123456 deprecation.py:323] From /home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/tensorflow/python/ops/distributions/normal.py:160: Distribution.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
W0828 12:38:32.724381 140077172123456 deprecation.py:323] From /home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/rhapsody/multimodal.py:139: Multinomial.__init__ (from tensorflow.python.ops.distributions.multinomial) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
W0828 12:38:32.802299 140077172123456 deprecation_wrapper.py:119] From /home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/rhapsody/multimodal.py:187: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

W0828 12:38:32.805364 140077172123456 deprecation_wrapper.py:119] From /home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/rhapsody/multimodal.py:189: The name tf.summary.histogram is deprecated. Please use tf.compat.v1.summary.histogram instead.

W0828 12:38:32.810857 140077172123456 deprecation_wrapper.py:119] From /home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/rhapsody/multimodal.py:193: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

W0828 12:38:32.812450 140077172123456 deprecation_wrapper.py:119] From /home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/rhapsody/multimodal.py:195: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

W0828 12:38:32.851014 140077172123456 deprecation_wrapper.py:119] From /home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/rhapsody/multimodal.py:200: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

W0828 12:38:33.204426 140077172123456 deprecation.py:323] From /home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/tensorflow/python/ops/clip_ops.py:286: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0828 12:38:33.331943 140077172123456 deprecation_wrapper.py:119] From /home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/rhapsody/multimodal.py:210: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

Traceback (most recent call last):
  File "/home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1339, in _run_fn
    self._extend_graph()
  File "/home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1374, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation random_normal/RandomStandardNormal: {{node random_normal/RandomStandardNormal}}was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0 ]. Make sure the device specification refers to a valid device.
	 [[random_normal/RandomStandardNormal]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/flejzerowicz/rhapsody_ve_new/bin/rhapsody", line 221, in <module>
    rhapsody()
  File "/home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/flejzerowicz/rhapsody_ve_new/bin/rhapsody", line 168, in mmvec
    test_microbes_coo, test_metabolites_df.values)
  File "/home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/rhapsody/multimodal.py", line 210, in __call__
    tf.global_variables_initializer().run()
  File "/home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2679, in run
    _run_using_default_session(self, feed_dict, self.graph, session)
  File "/home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 5614, in _run_using_default_session
    session.run(operation, feed_dict)
  File "/home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/home/flejzerowicz/rhapsody_ve_new/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation random_normal/RandomStandardNormal: node random_normal/RandomStandardNormal (defined at /lib/python3.6/site-packages/rhapsody/multimodal.py:106) was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0 ]. Make sure the device specification refers to a valid device.
	 [[random_normal/RandomStandardNormal]]

Note the maybe relevant sinfo

$ sinfo -p gpu -N -o "%c %D %G %m %P"

CPUS NODES GRES MEMORY PARTITION
32 1 gpu:1 94208 gpu
32 1 gpu:1 94208 gpu

Any help greatly appreciated :) Thanks! Franck

FranckLejzerowicz avatar Aug 28 '19 19:08 FranckLejzerowicz