DeepCGH icon indicating copy to clipboard operation
DeepCGH copied to clipboard

kernel crashes when making a checkpoint

Open PCSimo opened this issue 3 months ago • 0 comments

I've been trying to run your demo code demo_deepcgh.py with: Spyder IDE (Win 10) - python 3.9 environment tensorflow 2.10.0 CUDA 11.8 cuDNN 9.0.0

The data and model get generated but as soon as one would start to train the model, tensorflow at some point states that a checkpoint is being made and the kernel resets at that point, without warning.

Is there a way to circumvent this error? I've followed tensorflows procedure of making and loading checkpoints as stated here and all went well. But as soon as I run the demo_deepcgh.py it just crashes.

Here is the console output. I've replaced my actual path with something simple (user and folders) but it is a quite nexted folder if you need to know:

Current working directory is:
C:\Users\user\folders\DeepCGH 

Data already exists.
Looking for trained models in:
C:\Users\user\folders\DeepCGH 

Model already exists.
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\user\\folders\\DeepCGH\\DeepCGH_Models\\Disks\\Model_Disk_SHP(512, 512, 3)_IF16_Dst0.005_WL1e-06_PS1.5e-05_CNTFalse_64', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
WARNING:tensorflow:From C:\Users\user\anaconda3\envs\tf\lib\site-packages\tensorflow\python\training\training_util.py:396: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:From C:\Users\user\anaconda3\envs\tf\lib\site-packages\keras\layers\normalization\batch_normalization.py:562: _colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\user\folders\DeepCGH\DeepCGH_Models\Disks\Model_Disk_SHP(512, 512, 3)_IF16_Dst0.005_WL1e-06_PS1.5e-05_CNTFalse_64\model.ckpt-4000
WARNING:tensorflow:From C:\Users\user\anaconda3\envs\tf\lib\site-packages\tensorflow\python\training\saver.py:1173: get_checkpoint_mtimes (from tensorflow.python.checkpoint.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 4000...
INFO:tensorflow:Saving checkpoints for 4000 into C:\Users\user\folders\DeepCGH\DeepCGH_Models\Disks\Model_Disk_SHP(512, 512, 3)_IF16_Dst0.005_WL1e-06_PS1.5e-05_CNTFalse_64\model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 4000...
Traceback (most recent call last):

  File ~\anaconda3\envs\tf\lib\site-packages\tensorflow\python\client\session.py:1378 in _do_call
    return fn(*args)

  File ~\anaconda3\envs\tf\lib\site-packages\tensorflow\python\client\session.py:1361 in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,

  File ~\anaconda3\envs\tf\lib\site-packages\tensorflow\python\client\session.py:1454 in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,

InvalidArgumentError: Conv2DBackpropFilter: input depth must be evenly divisible by filter depth
	 [[{{node Adam/gradients/gradients/model/conv2d_13/Conv2D_grad/Conv2DBackpropFilter}}]]

PCSimo avatar Mar 27 '24 12:03 PCSimo