MZSR icon indicating copy to clipboard operation
MZSR copied to clipboard

When I ran large-scale training code, I have some problems. Could you help me?

Open wh2333 opened this issue 4 years ago • 4 comments

(base) wit@wit:/media/wit/Data1/WH/MZSR-master-new/Large-Scale_Training$ python3 main.py --gpu 0 --trial 2 --step 0 Initialize Training Build Model MODEL Initialize weights MODEL Setting Train Configuration Model Params: 225 K ========== Reading Checkpoints ============ ============= Failed to find a checkpoint ============= ========== No model to load =========== Training Starts Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1278, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1263, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: Feature: image (data type: string) is required but could not be found. [[Node: ParseSingleExample/ParseSingleExample = ParseSingleExample[Tdense=[DT_STRING, DT_STRING], dense_keys=["image", "label"], dense_shapes=[[], []], num_sparse=0, sparse_keys=[], sparse_types=[]](arg0, ParseSingleExample/Const, ParseSingleExample/Const)]] [[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,64,64,3], [?,16,16,3]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "main.py", line 49, in main() File "main.py", line 46, in main Trainer.run() File "/media/wit/Data1/WH/MZSR-master-new/Large-Scale_Training/train.py", line 88, in run label_train_, input_train_ = sess.run([label_train, input_train]) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 877, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1100, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1272, in _do_run run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1291, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Feature: image (data type: string) is required but could not be found. [[Node: ParseSingleExample/ParseSingleExample = ParseSingleExample[Tdense=[DT_STRING, DT_STRING], dense_keys=["image", "label"], dense_shapes=[[], []], num_sparse=0, sparse_keys=[], sparse_types=[]](arg0, ParseSingleExample/Const, ParseSingleExample/Const)]] [[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,64,64,3], [?,16,16,3]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

wh2333 avatar Oct 27 '20 09:10 wh2333

Dear Sir, Amazing work ! Congratulation!! please , I have a question.can you kindly provide me with the full path I should insert of checkpoint the trained large scale training model to be able to use it as a pre-trained to meta transfer training? I'm waiting for your reply. Thanks in advance

BassantTolba1234 avatar Dec 14 '20 20:12 BassantTolba1234

Please @wh2333 I'm facing a problem when i load the pretrained model , specially when it reads the checkpoint this is the error .. how did you kindly solve it please ??

NotFoundError (see above for traceback): Key MODEL/conv7/kernel/Adam_3 not found in checkpoint [[Node: save/RestoreV2_69 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_69/tensor_names, save/RestoreV2_69/shape_and_slices)]]

BassantTolba1234 avatar Dec 21 '20 08:12 BassantTolba1234

Please can you kindly explain me how to calculate this weight loss ?

def get_loss_weights(self): loss_weights = tf.ones(shape=[self.TASK_ITER]) * (1.0/self.TASK_ITER) decay_rate = 1.0 / self.TASK_ITER / (10000 / 3) min_value= 0.03 / self.TASK_ITER

    loss_weights_pre = tf.maximum(loss_weights[:-1] - (tf.multiply(tf.to_float(self.global_step), decay_rate)), min_value)

    loss_weight_cur= tf.minimum(loss_weights[-1] + (tf.multiply(tf.to_float(self.global_step),(self.TASK_ITER- 1) * decay_rate)), 1.0 - ((self.TASK_ITER - 1) * min_value))
    loss_weights = tf.concat([[loss_weights_pre], [[loss_weight_cur]]], axis=1)
    return loss_weights

BassantTolba1234 avatar Jan 06 '21 10:01 BassantTolba1234

Please @wh2333 I'm facing a problem when i load the pretrained model , specially when it reads the checkpoint this is the error .. how did you kindly solve it please ??

NotFoundError (see above for traceback): Key MODEL/conv7/kernel/Adam_3 not found in checkpoint [[Node: save/RestoreV2_69 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_69/tensor_names, save/RestoreV2_69/shape_and_slices)]]

Hi, I met the same problem, and really want to know what you do to fix it

NothingToSay99 avatar Apr 09 '22 02:04 NothingToSay99