DeepFaceLab_Linux icon indicating copy to clipboard operation
DeepFaceLab_Linux copied to clipboard

XSeg_train unable to run

Open kwokyto opened this issue 3 years ago • 3 comments

Environment

$ uname -a
Linux GPU-01 5.4.0-120-generic #136-Ubuntu SMP Fri Jun 10 13:40:48 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ nvidia-smi
Sat Jul  2 12:23:56 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:17:00.0 Off |                  N/A |
|  0%   35C    P8    12W / 250W |   1091MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:65:00.0 Off |                  N/A |
|  0%   30C    P8    11W / 250W |      8MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:66:00.0 Off |                  N/A |
|  0%   31C    P8    10W / 250W |   6690MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Steps to reproduce

I am following the guide from Druuzil Tech and Games here.

  • Copy data_src.mp4 into workspace
  • Copy the 1.7GB RTT model into workspace/model
  • Copy the 9GB RTM WF Faceset into workspace/data_dst/aligned
  • ./2_extract_image_from_data_src.sh
  • ./4_data_src_extract_faces_S3FD.sh
    • Faced a GPU error and downgrade tensorflow-gpu to 2.3.1 as per #20
  • ./5_XSeg_data_src_mask_apply.sh
  • ./5_XSeg_train.sh

Error Output

Loading samples: 100%|#########################################################################################| 25461/25461 [00:57<00:00, 439.62it/s]
Loaded 63012 packed faces from /data/home/kryan/DeepFaceLab_Linux/workspace/data_dst/aligned
Filtering: 100%|##############################################################################################| 88473/88473 [00:58<00:00, 1514.01it/s]
Using 278 segmented samples.
================== Model Summary ==================
==                                               ==
==        Model name: XSeg                       ==
==                                               ==
== Current iteration: 1                          ==
==                                               ==
==---------------- Model Options ----------------==
==                                               ==
==         face_type: wf                         ==
==          pretrain: False                      ==
==        batch_size: 8                          ==
==                                               ==
==----------------- Running On ------------------==
==                                               ==
==      Device index: 0                          ==
==              Name: NVIDIA GeForce GTX 1080 Ti ==
==              VRAM: 9.03GB                     ==
==                                               ==
===================================================
Starting. Press "Enter" to stop training and save model.
: cannot connect to X server .8308]
Error: DNN Backward Data function launch failure : input shape([8,32,258,258]) filter shape([3,3,32,1])
         [[node gradients/Conv2D_30_grad/Conv2DBackpropInput (defined at /DeepFaceLab_Linux/DeepFaceLab/core/leras/ops/__init__.py:55) ]]

Errors may have originated from an input operation.
Input Source operations connected to node gradients/Conv2D_30_grad/Conv2DBackpropInput:
 XSeg/out_conv/weight/read (defined at /DeepFaceLab_Linux/DeepFaceLab/core/leras/layers/Conv2D.py:61)

Original stack trace for 'gradients/Conv2D_30_grad/Conv2DBackpropInput':
  File "/.conda/envs/deepfacelab/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/.conda/envs/deepfacelab/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/.conda/envs/deepfacelab/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/DeepFaceLab_Linux/DeepFaceLab/mainscripts/Trainer.py", line 58, in trainerThread
    debug=debug)
  File "/DeepFaceLab_Linux/DeepFaceLab/models/Model_XSeg/Model.py", line 17, in __init__
    super().__init__(*args, force_model_class_name='XSeg', **kwargs)
  File "/DeepFaceLab_Linux/DeepFaceLab/models/ModelBase.py", line 193, in __init__
    self.on_initialize()
  File "/DeepFaceLab_Linux/DeepFaceLab/models/Model_XSeg/Model.py", line 118, in on_initialize
    gpu_loss_gvs += [ nn.gradients ( gpu_loss, self.model.get_weights() ) ]
  File "/DeepFaceLab_Linux/DeepFaceLab/core/leras/ops/__init__.py", line 55, in tf_gradients
    grads = gradients.gradients(loss, vars, colocate_gradients_with_ops=True )
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 172, in gradients
    unconnected_gradients)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py", line 669, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py", line 336, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py", line 669, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/nn_grad.py", line 596, in _Conv2DGrad
    data_format=data_format),
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1300, in conv2d_backprop_input
    name=name)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3485, in _create_op_internal
    op_def=op_def)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1949, in __init__
    self._traceback = tf_stack.extract_stack()

...which was originally created as op 'Conv2D_30', defined at:
  File "/.conda/envs/deepfacelab/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
[elided 4 identical lines from previous traceback]
  File "/DeepFaceLab_Linux/DeepFaceLab/models/ModelBase.py", line 193, in __init__
    self.on_initialize()
  File "/DeepFaceLab_Linux/DeepFaceLab/models/Model_XSeg/Model.py", line 103, in on_initialize
    gpu_pred_logits_t, gpu_pred_t = self.model.flow(gpu_input_t, pretrain=self.pretrain)
  File "/DeepFaceLab_Linux/DeepFaceLab/facelib/XSegNet.py", line 85, in flow
    return self.model(x, pretrain=pretrain)
  File "/DeepFaceLab_Linux/DeepFaceLab/core/leras/models/ModelBase.py", line 117, in __call__
    return self.forward(*args, **kwargs)
  File "/DeepFaceLab_Linux/DeepFaceLab/core/leras/models/XSeg.py", line 167, in forward
    logits = self.out_conv(x)
  File "/DeepFaceLab_Linux/DeepFaceLab/core/leras/layers/LayerBase.py", line 14, in __call__
    return self.forward(*args, **kwargs)
  File "/DeepFaceLab_Linux/DeepFaceLab/core/leras/layers/Conv2D.py", line 101, in forward
    x = tf.nn.conv2d(x, weight, strides, 'VALID', dilations=dilations, data_format=nn.data_format)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/nn_ops.py", line 2273, in conv2d
    name=name)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 979, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3485, in _create_op_internal
    op_def=op_def)

Traceback (most recent call last):
  File "/data/home/kryan/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/data/home/kryan/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/data/home/kryan/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: DNN Backward Data function launch failure : input shape([8,32,258,258]) filter shape([3,3,32,1])
         [[{{node gradients/Conv2D_30_grad/Conv2DBackpropInput}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/home/kryan/DeepFaceLab_Linux/DeepFaceLab/mainscripts/Trainer.py", line 129, in trainerThread
    iter, iter_time = model.train_one_iter()
  File "/data/home/kryan/DeepFaceLab_Linux/DeepFaceLab/models/ModelBase.py", line 474, in train_one_iter
    losses = self.onTrainOneIter()
  File "/data/home/kryan/DeepFaceLab_Linux/DeepFaceLab/models/Model_XSeg/Model.py", line 194, in onTrainOneIter
    loss = self.train (image_np, target_np)
  File "/data/home/kryan/DeepFaceLab_Linux/DeepFaceLab/models/Model_XSeg/Model.py", line 136, in train
    l, _ = nn.tf_sess.run ( [loss, loss_gv_op], feed_dict={self.model.input_t :input_np, self.model.target_t :target_np })
  File "/data/home/kryan/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 958, in run
    run_metadata_ptr)
  File "/data/home/kryan/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1181, in _run
    feed_dict_tensor, options, run_metadata)
  File "/data/home/kryan/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/data/home/kryan/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: DNN Backward Data function launch failure : input shape([8,32,258,258]) filter shape([3,3,32,1])
         [[node gradients/Conv2D_30_grad/Conv2DBackpropInput (defined at /DeepFaceLab_Linux/DeepFaceLab/core/leras/ops/__init__.py:55) ]]

Errors may have originated from an input operation.
Input Source operations connected to node gradients/Conv2D_30_grad/Conv2DBackpropInput:
 XSeg/out_conv/weight/read (defined at /DeepFaceLab_Linux/DeepFaceLab/core/leras/layers/Conv2D.py:61)

Original stack trace for 'gradients/Conv2D_30_grad/Conv2DBackpropInput':
  File "/.conda/envs/deepfacelab/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/.conda/envs/deepfacelab/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/.conda/envs/deepfacelab/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/DeepFaceLab_Linux/DeepFaceLab/mainscripts/Trainer.py", line 58, in trainerThread
    debug=debug)
  File "/DeepFaceLab_Linux/DeepFaceLab/models/Model_XSeg/Model.py", line 17, in __init__
    super().__init__(*args, force_model_class_name='XSeg', **kwargs)
  File "/DeepFaceLab_Linux/DeepFaceLab/models/ModelBase.py", line 193, in __init__
    self.on_initialize()
  File "/DeepFaceLab_Linux/DeepFaceLab/models/Model_XSeg/Model.py", line 118, in on_initialize
    gpu_loss_gvs += [ nn.gradients ( gpu_loss, self.model.get_weights() ) ]
  File "/DeepFaceLab_Linux/DeepFaceLab/core/leras/ops/__init__.py", line 55, in tf_gradients
    grads = gradients.gradients(loss, vars, colocate_gradients_with_ops=True )
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 172, in gradients
    unconnected_gradients)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py", line 669, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py", line 336, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py", line 669, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/nn_grad.py", line 596, in _Conv2DGrad
    data_format=data_format),
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1300, in conv2d_backprop_input
    name=name)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3485, in _create_op_internal
    op_def=op_def)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1949, in __init__
    self._traceback = tf_stack.extract_stack()

...which was originally created as op 'Conv2D_30', defined at:
  File "/.conda/envs/deepfacelab/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
[elided 4 identical lines from previous traceback]
  File "/DeepFaceLab_Linux/DeepFaceLab/models/ModelBase.py", line 193, in __init__
    self.on_initialize()
  File "/DeepFaceLab_Linux/DeepFaceLab/models/Model_XSeg/Model.py", line 103, in on_initialize
    gpu_pred_logits_t, gpu_pred_t = self.model.flow(gpu_input_t, pretrain=self.pretrain)
  File "/DeepFaceLab_Linux/DeepFaceLab/facelib/XSegNet.py", line 85, in flow
    return self.model(x, pretrain=pretrain)
  File "/DeepFaceLab_Linux/DeepFaceLab/core/leras/models/ModelBase.py", line 117, in __call__
    return self.forward(*args, **kwargs)
  File "/DeepFaceLab_Linux/DeepFaceLab/core/leras/models/XSeg.py", line 167, in forward
    logits = self.out_conv(x)
  File "/DeepFaceLab_Linux/DeepFaceLab/core/leras/layers/LayerBase.py", line 14, in __call__
    return self.forward(*args, **kwargs)
  File "/DeepFaceLab_Linux/DeepFaceLab/core/leras/layers/Conv2D.py", line 101, in forward
    x = tf.nn.conv2d(x, weight, strides, 'VALID', dilations=dilations, data_format=nn.data_format)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/nn_ops.py", line 2273, in conv2d
    name=name)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 979, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/.conda/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3485, in _create_op_internal
    op_def=op_def)

kwokyto avatar Jul 02 '22 04:07 kwokyto

@kwokyto Hi, where did you get the 1.7GB RTT model?

otsebriy avatar Jul 29 '22 10:07 otsebriy

@otsebriy i got them from here as per instructions from here

kwokyto avatar Jul 29 '22 15:07 kwokyto

conda create -n deepfacelab -c main python=3.7 cudnn=7.6.5 cudatoolkit=10.1.243

replace requirements_cuda.txt with this

tqdm numpy numexpr h5py==3.1.0 opencv-python==4.1.0.25 ffmpeg-python==0.1.17 scikit-image==0.14.2 scipy==1.4.1 colorama tensorflow-gpu==2.4.0 pyqt5 tf2onnx==1.9.3 ffmpeg

python -m pip install -r ./DeepFaceLab/requirements-cuda.txt

zabique avatar Aug 20 '22 21:08 zabique