DeepFaceLab Can't load saved model

Expected behavior

When running 6) train SAEHD.bat again after having ran it once and saved the model, I would expect it to continue training.

Actual behavior

Throws the following error:

Running trainer.

Choose one of saved models, or enter a name to create a new model.
[r] : rename
[d] : delete

[0] : AubreyGillian - latest
 :
0
Loading AubreyGillian_SAEHD model...

Choose one or several GPU idxs (separated by comma).

[CPU] : CPU
  [0] : GeForce RTX 2060

[0] Which GPU indexes to choose? :
0

Initializing models: 100%|###############################################################| 5/5 [00:12<00:00,  2.43s/it]
Loading samples: 100%|############################################################| 1045/1045 [00:08<00:00, 123.68it/s]
Loading samples: 100%|###############################################################| 617/617 [00:16<00:00, 38.24it/s]
================ Model Summary =================
==                                            ==
==            Model name: AubreyGillian_SAEHD ==
==                                            ==
==     Current iteration: 10368               ==
==                                            ==
==-------------- Model Options ---------------==
==                                            ==
==            resolution: 128                 ==
==             face_type: f                   ==
==     models_opt_on_gpu: True                ==
==                 archi: liae-ud             ==
==               ae_dims: 256                 ==
==                e_dims: 64                  ==
==                d_dims: 64                  ==
==           d_mask_dims: 22                  ==
==       masked_training: True                ==
==       eyes_mouth_prio: False               ==
==           uniform_yaw: False               ==
==             adabelief: True                ==
==            lr_dropout: n                   ==
==           random_warp: True                ==
==       true_face_power: 0.0                 ==
==      face_style_power: 0.0                 ==
==        bg_style_power: 0.0                 ==
==               ct_mode: none                ==
==              clipgrad: True                ==
==              pretrain: False               ==
==       autobackup_hour: 1                   ==
== write_preview_history: False               ==
==           target_iter: 0                   ==
==           random_flip: True                ==
==            batch_size: 20                  ==
==             gan_power: 0.0                 ==
==        gan_patch_size: 16                  ==
==              gan_dims: 16                  ==
==                                            ==
==---------------- Running On ----------------==
==                                            ==
==          Device index: 0                   ==
==                  Name: GeForce RTX 2060    ==
==                  VRAM: 6.00GB              ==
==                                            ==
================================================
Starting. Press "Enter" to stop training and save model.
Error: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[512,512,3,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node Conv2D_12 (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:99) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[concat_39/concat/_809]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[512,512,3,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node Conv2D_12 (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:99) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node Conv2D_12:
 Pad_12 (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:97)
 decoder/res0/conv1/weight/read (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:76)

Input Source operations connected to node Conv2D_12:
 Pad_12 (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:97)
 decoder/res0/conv1/weight/read (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:76)

Original stack trace for 'Conv2D_12':
  File "threading.py", line 884, in _bootstrap
  File "threading.py", line 916, in _bootstrap_inner
  File "threading.py", line 864, in run
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\mainscripts\Trainer.py", line 58, in trainerThread
    debug=debug,
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\ModelBase.py", line 189, in __init__
    self.on_initialize()
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\Model_SAEHD\Model.py", line 381, in on_initialize
    gpu_pred_src_src, gpu_pred_src_srcm = self.decoder(gpu_src_code)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\models\ModelBase.py", line 117, in __call__
    return self.forward(*args, **kwargs)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\archis\DeepFakeArchi.py", line 154, in forward
    x = self.res0(x)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\models\ModelBase.py", line 117, in __call__
    return self.forward(*args, **kwargs)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\archis\DeepFakeArchi.py", line 68, in forward
    x = self.conv1(inp)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\LayerBase.py", line 14, in __call__
    return self.forward(*args, **kwargs)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py", line 99, in forward
    x = tf.nn.conv2d(x, weight, self.strides, 'VALID', dilations=self.dilations, data_format=nn.data_format)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\util\dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 2279, in conv2d
    name=name)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 972, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 750, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\ops.py", line 3536, in _create_op_internal
    op_def=op_def)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\ops.py", line 1990, in __init__
    self._traceback = tf_stack.extract_stack()

Traceback (most recent call last):
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1375, in _do_call
    return fn(*args)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1360, in _run_fn
    target_list, run_metadata)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1453, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[512,512,3,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node Conv2D_12}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[concat_39/concat/_809]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[512,512,3,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node Conv2D_12}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\mainscripts\Trainer.py", line 130, in trainerThread
    iter, iter_time = model.train_one_iter()
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\ModelBase.py", line 462, in train_one_iter
    losses = self.onTrainOneIter()
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\Model_SAEHD\Model.py", line 678, in onTrainOneIter
    src_loss, dst_loss = self.src_dst_train (warped_src, target_src, target_srcm, target_srcm_em, warped_dst, target_dst, target_dstm, target_dstm_em)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\Model_SAEHD\Model.py", line 538, in src_dst_train
    self.target_dstm_em:target_dstm_em,
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 968, in run
    run_metadata_ptr)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1191, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1369, in _do_run
    run_metadata)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[512,512,3,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node Conv2D_12 (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:99) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[concat_39/concat/_809]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[512,512,3,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node Conv2D_12 (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:99) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node Conv2D_12:
 Pad_12 (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:97)
 decoder/res0/conv1/weight/read (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:76)

Input Source operations connected to node Conv2D_12:
 Pad_12 (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:97)
 decoder/res0/conv1/weight/read (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:76)

Original stack trace for 'Conv2D_12':
  File "threading.py", line 884, in _bootstrap
  File "threading.py", line 916, in _bootstrap_inner
  File "threading.py", line 864, in run
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\mainscripts\Trainer.py", line 58, in trainerThread
    debug=debug,
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\ModelBase.py", line 189, in __init__
    self.on_initialize()
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\Model_SAEHD\Model.py", line 381, in on_initialize
    gpu_pred_src_src, gpu_pred_src_srcm = self.decoder(gpu_src_code)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\models\ModelBase.py", line 117, in __call__
    return self.forward(*args, **kwargs)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\archis\DeepFakeArchi.py", line 154, in forward
    x = self.res0(x)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\models\ModelBase.py", line 117, in __call__
    return self.forward(*args, **kwargs)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\archis\DeepFakeArchi.py", line 68, in forward
    x = self.conv1(inp)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\LayerBase.py", line 14, in __call__
    return self.forward(*args, **kwargs)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py", line 99, in forward
    x = tf.nn.conv2d(x, weight, self.strides, 'VALID', dilations=self.dilations, data_format=nn.data_format)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\util\dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 2279, in conv2d
    name=name)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 972, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 750, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\ops.py", line 3536, in _create_op_internal
    op_def=op_def)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\ops.py", line 1990, in __init__
    self._traceback = tf_stack.extract_stack()

Steps to reproduce

Run 6) train SAEHD.bat once, then hit enter to stop and save, run 6) train SAEHD.bat again.

Other relevant information

Command lined used (if not specified in steps to reproduce): The content of 6) train SAEHD.bat:

@echo off
call _internal\setenv.bat

"%PYTHON_EXECUTABLE%" "%DFL_ROOT%\main.py" train ^
    --training-data-src-dir "%WORKSPACE%\data_src\aligned" ^
    --training-data-dst-dir "%WORKSPACE%\data_dst\aligned" ^
    --pretraining-data-dir "%INTERNAL%\pretrain_CelebA" ^
    --model-dir "%WORKSPACE%\model" ^
    --model SAEHD

pause

Operating system and version: Windows 10 Home [64-bit] (10.0.19042)
Python version: 3.9.1

Jan 08 '21 15:01 SonOfDiablo

Did you ever find the answer?

Jun 08 '23 22:06 joolstorrentecalo

I honestly can't remember...

Jun 10 '23 08:06 SonOfDiablo