DeepFaceLab icon indicating copy to clipboard operation
DeepFaceLab copied to clipboard

RTX 2060 Super - SAEHD training Memory Error even with Batch size 1

Open Stingrayseven opened this issue 3 years ago • 16 comments

Tried lowering all kinds of resolutions and batch size to no avail. Currently at batch size 1 it's still giving me a memory error. I don't think this should happen on my GPU.

Windows 10.0.18363 Build 18363 AMD Ryzen 5 3600 Nvidia RTX 2060 Super

CMD

===================== Model Summary ======================
==                                                      ==
==            Model name: new_SAEHD                     ==
==                                                      ==
==     Current iteration: 0                             ==
==                                                      ==
==------------------- Model Options --------------------==
==                                                      ==
==            resolution: 64                            ==
==             face_type: wf                            ==
==     models_opt_on_gpu: True                          ==
==                 archi: df                            ==
==               ae_dims: 64                            ==
==                e_dims: 32                            ==
==                d_dims: 32                            ==
==           d_mask_dims: 22                            ==
==       masked_training: True                          ==
==       eyes_mouth_prio: False                         ==
==           uniform_yaw: True                          ==
==         blur_out_mask: False                         ==
==             adabelief: True                          ==
==            lr_dropout: n                             ==
==           random_warp: False                         ==
==      random_hsv_power: 0.0                           ==
==       true_face_power: 0.0                           ==
==      face_style_power: 0.0                           ==
==        bg_style_power: 0.0                           ==
==               ct_mode: none                          ==
==              clipgrad: True                          ==
==              pretrain: True                          ==
==       autobackup_hour: 1                             ==
== write_preview_history: True                          ==
==           target_iter: 0                             ==
==       random_src_flip: True                          ==
==       random_dst_flip: True                          ==
==            batch_size: 1                             ==
==             gan_power: 0.0                           ==
==        gan_patch_size: 16                            ==
==              gan_dims: 16                            ==
==                                                      ==
==--------------------- Running On ---------------------==
==                                                      ==
==          Device index: 0                             ==
==                  Name: NVIDIA GeForce RTX 2060 SUPER ==
==                  VRAM: 6.17GB                        ==
==                                                      ==
==========================================================
Starting. Press "Enter" to stop training and save model.

Trying to do the first iteration. If an error occurs, reduce the model parameters.

!!!
Windows 10 users IMPORTANT notice. You should set this setting in order to work correctly.
https://i.imgur.com/B7cmDCB.jpg
!!!
Error:
Traceback (most recent call last):
  File "D:\Program Files (x86)\Everything Deepfake\DeepFaceLab\DeepFaceLab_NVIDIA_up_to_RTX2080Ti\_internal\DeepFaceLab\mainscripts\Trainer.py", line 159, in trainerThread
    model_save()
  File "D:\Program Files (x86)\Everything Deepfake\DeepFaceLab\DeepFaceLab_NVIDIA_up_to_RTX2080Ti\_internal\DeepFaceLab\mainscripts\Trainer.py", line 68, in model_save
    model.save()
  File "D:\Program Files (x86)\Everything Deepfake\DeepFaceLab\DeepFaceLab_NVIDIA_up_to_RTX2080Ti\_internal\DeepFaceLab\models\ModelBase.py", line 393, in save
    self.onSave()
  File "D:\Program Files (x86)\Everything Deepfake\DeepFaceLab\DeepFaceLab_NVIDIA_up_to_RTX2080Ti\_internal\DeepFaceLab\models\Model_SAEHD\Model.py", line 759, in onSave
    model.save_weights ( self.get_strpath_storage_for_file(filename) )
  File "D:\Program Files (x86)\Everything Deepfake\DeepFaceLab\DeepFaceLab_NVIDIA_up_to_RTX2080Ti\_internal\DeepFaceLab\core\leras\layers\Saveable.py", line 61, in save_weights
    d_dumped = pickle.dumps (d, 4)
MemoryError

Stingrayseven avatar Jan 07 '22 10:01 Stingrayseven

yeah, you could try without adabelief but this settings seems already pretty minimal. The last release seems to be broken, you could try to increase your pagefile size as it seems to fix the issue for some people. However, this should be fixed in the code as it is not normal to be required to allocate ten's of pagefile Go for a program that doesn't require that much RAM.

havetc avatar Jan 07 '22 20:01 havetc

I solved my issue by reducing the number of worker for the data loading. On my version, I edited DeepFaceLab_NVIDIA_up_to_RTX2080Ti_internal\DeepFaceLab\models\Model_SAEHD\Model.py by just changing the line 669 to

cpu_count = 4 #multiprocessing.cpu_count()

I suppose that for a bugged reason, each cpu worker use way too much ram, so limiting their number mitigate the problem. The solution would probably be to fix SampleGeneratorFace.py, in order to limit its ressource usage

havetc avatar Jan 08 '22 01:01 havetc

I solved my issue by reducing the number of worker for the data loading. On my version, I edited DeepFaceLab_NVIDIA_up_to_RTX2080Ti_internal\DeepFaceLab\models\Model_SAEHD\Model.py by just changing the line 669 to

cpu_count = 4 #multiprocessing.cpu_count()

I suppose that for a bugged reason, each cpu worker use way too much ram, so limiting their number mitigate the problem. The solution would probably be to fix SampleGeneratorFace.py, in order to limit its ressource usage

also, are you using DF arch intentionally ? ( most people use either DF-UD or Liae-UD variant )

the -D variant, enables to train higher res / dims with same hardware...

if you want try using DF-UD and see how that goes

varunpro avatar Jan 08 '22 05:01 varunpro

yeah, you could try without adabelief

That actually did the trick, thanks. I can even train on "higher" batch sizes/resolutions now for whatever reason

Stingrayseven avatar Jan 09 '22 07:01 Stingrayseven

so rtx 2080ti would work with this line edited ?

2blackbar avatar Feb 23 '22 23:02 2blackbar

I solved my issue by reducing the number of worker for the data loading. On my version, I edited DeepFaceLab_NVIDIA_up_to_RTX2080Ti_internal\DeepFaceLab\models\Model_SAEHD\Model.py by just changing the line 669 to cpu_count = 4 #multiprocessing.cpu_count() I suppose that for a bugged reason, each cpu worker use way too much ram, so limiting their number mitigate the problem. The solution would probably be to fix SampleGeneratorFace.py, in order to limit its ressource usage

also, are you using DF arch intentionally ? ( most people use either DF-UD or Liae-UD variant )

the -D variant, enables to train higher res / dims with same hardware...

if you want try using DF-UD and see how that goes

This also helped me out, I was getting array and memory errors even when disabling high VRAM consuming settings. Thank you so much for sharing.

PhantomX700 avatar Jun 05 '22 19:06 PhantomX700

same problem here with a Quadro RTX 8000, seems changing cpu_count to 4 fix this error

Monedon avatar Sep 15 '22 19:09 Monedon

This also solved the issue for me. I recommend trying it.

Mshriver2 avatar Oct 10 '22 11:10 Mshriver2

Changing the cpu_count also worked for me! Specs are an i5-12600k and a RTX 3070

mattwhitson avatar Nov 25 '22 07:11 mattwhitson

This solved the issue for me too.

Xeon E5 2690v3 RTX 3060 12gb

Changed cpu_cont to 12 and worked fine!

JeanFrancoZ avatar Feb 24 '23 05:02 JeanFrancoZ

I solved my issue by reducing the number of worker for the data loading. On my version, I edited DeepFaceLab_NVIDIA_up_to_RTX2080Ti_internal\DeepFaceLab\models\Model_SAEHD\Model.py by just changing the line 669 to

cpu_count = 4 #multiprocessing.cpu_count()

I suppose that for a bugged reason, each cpu worker use way too much ram, so limiting their number mitigate the problem. The solution would probably be to fix SampleGeneratorFace.py, in order to limit its ressource usage

Thank you! 👍 It works with 12! (AMD Ryzen 7 5800H + RTX 3070 Mobile)

zodchiy-ua avatar Mar 26 '23 19:03 zodchiy-ua

It seems to be cpu_count = min(multiprocessing.cpu_count(), 8) in the repository now, maybe the packaged version aren't up to date. Maybe it could be time to close the issue!

havetc avatar Mar 28 '23 00:03 havetc

I solved my issue by reducing the number of worker for the data loading. On my version, I edited DeepFaceLab_NVIDIA_up_to_RTX2080Ti_internal\DeepFaceLab\models\Model_SAEHD\Model.py by just changing the line 669 to

cpu_count = 4 #multiprocessing.cpu_count()

I suppose that for a bugged reason, each cpu worker use way too much ram, so limiting their number mitigate the problem. The solution would probably be to fix SampleGeneratorFace.py, in order to limit its ressource usage

i've signed up only to thank you. Thank you sir!

orspkbra avatar Apr 26 '23 23:04 orspkbra

how can we reduce number of worker (CPU) for xseg? cpu_count = min(multiprocessing.cpu_count(), 8) Xseg train loading samples and then nothing! It stucks. Anyone can kindly help me ?

orspkbra avatar Apr 30 '23 18:04 orspkbra

Issue solved / already answered (or it seems like user error), please close it.

joolstorrentecalo avatar Jun 08 '23 22:06 joolstorrentecalo

I solved my issue by reducing the number of worker for the data loading. On my version, I edited DeepFaceLab_NVIDIA_up_to_RTX2080Ti_internal\DeepFaceLab\models\Model_SAEHD\Model.py by just changing the line 669 to

cpu_count = 4 #multiprocessing.cpu_count()

I suppose that for a bugged reason, each cpu worker use way too much ram, so limiting their number mitigate the problem. The solution would probably be to fix SampleGeneratorFace.py, in order to limit its ressource usage

Thank you for this fix, worked like a charm! and damn CPUs age badly these days..

johns44-sys avatar Oct 03 '23 02:10 johns44-sys