DIGITS icon indicating copy to clipboard operation
DIGITS copied to clipboard

Loading pretrained tensorflow model

Open AWilcke opened this issue 7 years ago • 8 comments

Using DIGITS-6.0-rc, with the new tensorflow support, I am trying to port nets from TF-slim into DIGITS. I have implemented InceptionV1, and it trains from scratch in DIGITS, however when trying to load a pretrained model I get the following error.

2017-08-22 11:22:00 [INFO] Train batch size is 16 and validation batch size is 16
2017-08-22 11:22:00 [INFO] Training epochs to be completed for each validation : 1
2017-08-22 11:22:00 [INFO] Training epochs to be completed before taking a snapshot : 1.0
2017-08-22 11:22:00 [INFO] Model weights will be saved as snapshot_<EPOCH>_Model.ckpt
2017-08-22 11:22:00 [INFO] Loading mean tensor from /jobs/20170613-080203-895a/mean.binaryproto file
2017-08-22 11:22:00 [INFO] Loading label definitions from /jobs/20170613-080203-895a/labels.txt file
2017-08-22 11:22:00 [INFO] Found 69 classes
2017-08-22 11:22:00 [INFO] Found 4189 images in db /jobs/20170613-080203-895a/train_db
2017-08-22 11:22:00.593158: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-22 11:22:00.593188: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-22 11:22:00.593201: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-22 11:22:00.593213: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-22 11:22:00.593226: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-08-22 11:22:00.836922: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 970
major: 5 minor: 2 memoryClockRate (GHz) 1.253
pciBusID 0000:04:00.0
Total memory: 3.94GiB
Free memory: 3.88GiB
2017-08-22 11:22:00.836979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-08-22 11:22:00.836997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y
2017-08-22 11:22:00.837029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:04:00.0)
2017-08-22 11:22:02 [INFO] Optimizer:sgd
2017-08-22 11:22:03 [INFO] Found 1429 images in db /jobs/20170613-080203-895a/val_db
2017-08-22 11:22:03.471877: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:04:00.0)
2017-08-22 11:22:03.973740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:04:00.0)
2017-08-22 11:22:04 [INFO] Loading weights from pretrained model - /path/to/ckpt/inception_v1.ckpt
2017-08-22 11:22:05 [INFO] NOT restoring global_step -> global_step:0
2017-08-22 11:22:05 [INFO] Restoring 0 variable ops.
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/digits/tools/tensorflow/main.py", line 707, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/usr/local/lib/python2.7/dist-packages/digits/tools/tensorflow/main.py", line 544, in main
load_snapshot(sess, FLAGS.weights, tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES))
File "/usr/local/lib/python2.7/dist-packages/digits/tools/tensorflow/main.py", line 264, in load_snapshot
tf.train.Saver(vars_restore, max_to_keep=0, sharded=FLAGS.serving_export).restore(sess, weight_path)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1139, in __init__
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1161, in build
raise ValueError("No variables to save")
ValueError: No variables to save

Would it be possible to add support for restoring from non-DIGITS checkpoints?

AWilcke avatar Aug 22 '17 11:08 AWilcke

Tensorflow is a bit finnicky like that in a way that the weights are saved not in 1 but in 3 files. The latest commit on the master branch now takes in account of that and when you upload, it will process all the .ckpt files (data, index, meta). But by reference, it will still refer to .ckpt. Basically when you are loading a pre trained weight, ensure that the extension is just .ckpt and the backend will search for all those 3 files.

ethantang95 avatar Aug 25 '17 16:08 ethantang95

Does this still stand for the released v6.0.0? I see only options to upload Torch or Caffe pre-trained models.

ividal avatar Oct 10 '17 15:10 ividal

You should be able to upload Tensorflow pre-trained models now in V6

ethantang95 avatar Oct 10 '17 16:10 ethantang95

If it's meant to be alongside the options for Torch and Caffe, it doesn't seem to be there. Here's a screenshot of my options under "Pretrained models" > "Upload...", for Digits v6.0.0 (installed from the docker hub image: nvidia/digits:latest, which currently points to :6.0.0).

digits60_snapshot_pretrained

I'd expect a third "Tensorflow" option to enable loading a checkpoint. Is this incorrect? (And if so, how should one load the tensorflow pre-trained model?)

Thanks for your help!

ividal avatar Oct 13 '17 16:10 ividal

Seem issue. Wonder Digits when to support pre-trained tensorflow model?

michaeldong avatar Jan 17 '18 02:01 michaeldong

Hi. I am uploading a pretrained network on DIGITS . image And I am getting this error image

Can I get some help on this

Dexter123193 avatar May 13 '18 08:05 Dexter123193

Hi. I am uploading a pretrained network on DIGITS . image And I am getting this error image

Can I get some help on this

you should add a lables.txt files. the file can be empty, but must have. this can solve you problem

DFreeMind avatar Jun 18 '19 09:06 DFreeMind

If it's meant to be alongside the options for Torch and Caffe, it doesn't seem to be there. Here's a screenshot of my options under "Pretrained models" > "Upload...", for Digits v6.0.0 (installed from the docker hub image: nvidia/digits:latest, which currently points to :6.0.0).

digits60_snapshot_pretrained

I'd expect a third "Tensorflow" option to enable loading a checkpoint. Is this incorrect? (And if so, how should one load the tensorflow pre-trained model?)

Thanks for your help!

Any updates to this? I am seeing the same two options when trying to manually upload a pretrained Tensorflow model.

agoila avatar Aug 11 '19 01:08 agoila