ngraph-tf icon indicating copy to clipboard operation
ngraph-tf copied to clipboard

Broken with deeplabv3+

Open wingated opened this issue 6 years ago • 1 comments

tl;dr: the code works fine without ngraph; with ngraph enabled, it dies with the errors show below.

Details:

Been trying to get ngraph working with Google's deeplab v3+, without any luck. The code is being run inside a docker container (the nvcr.io/nvidia/tensorflow:18.12-py3 image) on an nvidia dgx2 (16 GPUs).

Versions:

Python 3.5.2 (default, Nov 12 2018, 13:43:14) [GCC 5.4.0 20160609] on linux TensorFlow version installed: 1.12.0 (unknown) nGraph bridge built with: 1.12.0 (v1.12.0-0-ga6d8ffa)

The docker container was started with the following command line:

nvidia-docker run -it
--rm
--shm-size=1g
--ulimit memlock=-1
--ulimit stack=67108864
--privileged=true
-v /raid/wingated:/raid/wingated
-v /home/wingated:/home/wingated
-v /mnt/pccfs:/mnt/pccfs
nvcr.io/nvidia/tensorflow:18.12-py3

Here are the errors. I have no idea how to diagnose this. :)

[snip] INFO:tensorflow:Restoring parameters from /raid/wingated/cancer/deeplab_data/init_models/xception/model.ckpt INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Starting Session. INFO:tensorflow:Saving checkpoint to path /raid/wingated/cancer/deeplab_data/logs/model.ckpt INFO:tensorflow:Starting Queues. INFO:tensorflow:Error reported to Coordinator: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception yield File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 495, in run self.run_loop() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/supervisor.py", line 1034, in run_loop self._sv.global_step]) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster

wingated avatar Feb 10 '19 00:02 wingated

nGraph won't work with TensorFlow built for GPU. It will only work with TensorFlow built for CPU.

avijit-nervana avatar Feb 13 '19 13:02 avijit-nervana