DeepFaceLab icon indicating copy to clipboard operation
DeepFaceLab copied to clipboard

Can't load saved model

Open SonOfDiablo opened this issue 4 years ago • 2 comments

Expected behavior

When running 6) train SAEHD.bat again after having ran it once and saved the model, I would expect it to continue training.

Actual behavior

Throws the following error:

Running trainer.

Choose one of saved models, or enter a name to create a new model.
[r] : rename
[d] : delete

[0] : AubreyGillian - latest
 :
0
Loading AubreyGillian_SAEHD model...

Choose one or several GPU idxs (separated by comma).

[CPU] : CPU
  [0] : GeForce RTX 2060

[0] Which GPU indexes to choose? :
0

Initializing models: 100%|###############################################################| 5/5 [00:12<00:00,  2.43s/it]
Loading samples: 100%|############################################################| 1045/1045 [00:08<00:00, 123.68it/s]
Loading samples: 100%|###############################################################| 617/617 [00:16<00:00, 38.24it/s]
================ Model Summary =================
==                                            ==
==            Model name: AubreyGillian_SAEHD ==
==                                            ==
==     Current iteration: 10368               ==
==                                            ==
==-------------- Model Options ---------------==
==                                            ==
==            resolution: 128                 ==
==             face_type: f                   ==
==     models_opt_on_gpu: True                ==
==                 archi: liae-ud             ==
==               ae_dims: 256                 ==
==                e_dims: 64                  ==
==                d_dims: 64                  ==
==           d_mask_dims: 22                  ==
==       masked_training: True                ==
==       eyes_mouth_prio: False               ==
==           uniform_yaw: False               ==
==             adabelief: True                ==
==            lr_dropout: n                   ==
==           random_warp: True                ==
==       true_face_power: 0.0                 ==
==      face_style_power: 0.0                 ==
==        bg_style_power: 0.0                 ==
==               ct_mode: none                ==
==              clipgrad: True                ==
==              pretrain: False               ==
==       autobackup_hour: 1                   ==
== write_preview_history: False               ==
==           target_iter: 0                   ==
==           random_flip: True                ==
==            batch_size: 20                  ==
==             gan_power: 0.0                 ==
==        gan_patch_size: 16                  ==
==              gan_dims: 16                  ==
==                                            ==
==---------------- Running On ----------------==
==                                            ==
==          Device index: 0                   ==
==                  Name: GeForce RTX 2060    ==
==                  VRAM: 6.00GB              ==
==                                            ==
================================================
Starting. Press "Enter" to stop training and save model.
Error: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[512,512,3,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node Conv2D_12 (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:99) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[concat_39/concat/_809]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[512,512,3,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node Conv2D_12 (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:99) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node Conv2D_12:
 Pad_12 (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:97)
 decoder/res0/conv1/weight/read (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:76)

Input Source operations connected to node Conv2D_12:
 Pad_12 (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:97)
 decoder/res0/conv1/weight/read (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:76)

Original stack trace for 'Conv2D_12':
  File "threading.py", line 884, in _bootstrap
  File "threading.py", line 916, in _bootstrap_inner
  File "threading.py", line 864, in run
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\mainscripts\Trainer.py", line 58, in trainerThread
    debug=debug,
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\ModelBase.py", line 189, in __init__
    self.on_initialize()
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\Model_SAEHD\Model.py", line 381, in on_initialize
    gpu_pred_src_src, gpu_pred_src_srcm = self.decoder(gpu_src_code)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\models\ModelBase.py", line 117, in __call__
    return self.forward(*args, **kwargs)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\archis\DeepFakeArchi.py", line 154, in forward
    x = self.res0(x)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\models\ModelBase.py", line 117, in __call__
    return self.forward(*args, **kwargs)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\archis\DeepFakeArchi.py", line 68, in forward
    x = self.conv1(inp)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\LayerBase.py", line 14, in __call__
    return self.forward(*args, **kwargs)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py", line 99, in forward
    x = tf.nn.conv2d(x, weight, self.strides, 'VALID', dilations=self.dilations, data_format=nn.data_format)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\util\dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 2279, in conv2d
    name=name)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 972, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 750, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\ops.py", line 3536, in _create_op_internal
    op_def=op_def)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\ops.py", line 1990, in __init__
    self._traceback = tf_stack.extract_stack()

Traceback (most recent call last):
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1375, in _do_call
    return fn(*args)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1360, in _run_fn
    target_list, run_metadata)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1453, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[512,512,3,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node Conv2D_12}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[concat_39/concat/_809]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[512,512,3,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node Conv2D_12}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\mainscripts\Trainer.py", line 130, in trainerThread
    iter, iter_time = model.train_one_iter()
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\ModelBase.py", line 462, in train_one_iter
    losses = self.onTrainOneIter()
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\Model_SAEHD\Model.py", line 678, in onTrainOneIter
    src_loss, dst_loss = self.src_dst_train (warped_src, target_src, target_srcm, target_srcm_em, warped_dst, target_dst, target_dstm, target_dstm_em)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\Model_SAEHD\Model.py", line 538, in src_dst_train
    self.target_dstm_em:target_dstm_em,
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 968, in run
    run_metadata_ptr)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1191, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1369, in _do_run
    run_metadata)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\client\session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[512,512,3,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node Conv2D_12 (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:99) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[concat_39/concat/_809]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[512,512,3,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node Conv2D_12 (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:99) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node Conv2D_12:
 Pad_12 (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:97)
 decoder/res0/conv1/weight/read (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:76)

Input Source operations connected to node Conv2D_12:
 Pad_12 (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:97)
 decoder/res0/conv1/weight/read (defined at C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py:76)

Original stack trace for 'Conv2D_12':
  File "threading.py", line 884, in _bootstrap
  File "threading.py", line 916, in _bootstrap_inner
  File "threading.py", line 864, in run
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\mainscripts\Trainer.py", line 58, in trainerThread
    debug=debug,
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\ModelBase.py", line 189, in __init__
    self.on_initialize()
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\models\Model_SAEHD\Model.py", line 381, in on_initialize
    gpu_pred_src_src, gpu_pred_src_srcm = self.decoder(gpu_src_code)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\models\ModelBase.py", line 117, in __call__
    return self.forward(*args, **kwargs)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\archis\DeepFakeArchi.py", line 154, in forward
    x = self.res0(x)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\models\ModelBase.py", line 117, in __call__
    return self.forward(*args, **kwargs)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\archis\DeepFakeArchi.py", line 68, in forward
    x = self.conv1(inp)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\LayerBase.py", line 14, in __call__
    return self.forward(*args, **kwargs)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\DeepFaceLab\core\leras\layers\Conv2D.py", line 99, in forward
    x = tf.nn.conv2d(x, weight, self.strides, 'VALID', dilations=self.dilations, data_format=nn.data_format)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\util\dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 2279, in conv2d
    name=name)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 972, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 750, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\ops.py", line 3536, in _create_op_internal
    op_def=op_def)
  File "C:\Programs\DeepFaceLab_NVIDIA\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\ops.py", line 1990, in __init__
    self._traceback = tf_stack.extract_stack()

Steps to reproduce

Run 6) train SAEHD.bat once, then hit enter to stop and save, run 6) train SAEHD.bat again.

Other relevant information

  • Command lined used (if not specified in steps to reproduce): The content of 6) train SAEHD.bat:
@echo off
call _internal\setenv.bat

"%PYTHON_EXECUTABLE%" "%DFL_ROOT%\main.py" train ^
    --training-data-src-dir "%WORKSPACE%\data_src\aligned" ^
    --training-data-dst-dir "%WORKSPACE%\data_dst\aligned" ^
    --pretraining-data-dir "%INTERNAL%\pretrain_CelebA" ^
    --model-dir "%WORKSPACE%\model" ^
    --model SAEHD

pause
  • Operating system and version: Windows 10 Home [64-bit] (10.0.19042)
  • Python version: 3.9.1

SonOfDiablo avatar Jan 08 '21 15:01 SonOfDiablo

Did you ever find the answer?

joolstorrentecalo avatar Jun 08 '23 22:06 joolstorrentecalo

I honestly can't remember...

SonOfDiablo avatar Jun 10 '23 08:06 SonOfDiablo