MotionDirector Training the example code but Crashed

I am using 4090 graphic card to train, but it will crached half of the process. And it always crashed it 50 steps...... Example code:

python MotionDirector_train.py --config ./configs/config_single_video.yaml

Error Output:

(motiondirector) PS D:\Coding\AILearning\AI_Art_Technology_Demo\MotionDirector> python MotionDirector_train.py --config ./configs/config_single_video.yaml
Initializing the conversion map
D:\Applications\Miniconda3\envs\motiondirector\lib\site-packages\accelerate\accelerator.py:359: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
02/11/2024 14:26:18 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

{'rescale_betas_zero_snr', 'timestep_spacing'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into UNet3DConditionModel.
{'rescale_betas_zero_snr', 'timestep_spacing'} was not found in config. Values will be initialized to default values.
Caching Latents.:   0%|                                                                                                                       | 0/1 [00:00<?, ?it/s]D:\Applications\Miniconda3\envs\motiondirector\lib\site-packages\diffusers\models\attention_processor.py:1129: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  hidden_states = F.scaled_dot_product_attention(
{'rescale_betas_zero_snr', 'timestep_spacing'} was not found in config. Values will be initialized to default values.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:10<00:00,  4.91it/s]
Caching Latents.: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.88s/it]
02/11/2024 14:26:36 - INFO - __main__ - ***** Running training *****
02/11/2024 14:26:36 - INFO - __main__ -   Num examples = 1
02/11/2024 14:26:36 - INFO - __main__ -   Num Epochs = 150
02/11/2024 14:26:36 - INFO - __main__ -   Instantaneous batch size per device = 1
02/11/2024 14:26:36 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
02/11/2024 14:26:36 - INFO - __main__ -   Gradient Accumulation steps = 1
02/11/2024 14:26:36 - INFO - __main__ -   Total optimization steps = 150
Steps:  33%|███████████████████████████████████████▋                                                                               | 50/150 [00:46<01:31,  1.10it/s]
{'rescale_betas_zero_snr', 'timestep_spacing'} was not found in config. Values will be initialized to default values.
(motiondirector) PS D:\Coding\AILearning\AI_Art_Technology_Demo\MotionDirector>

Feb 11 '24 06:02 HildaM

I noticed that when the steps reach 50, memory consume nearly 32gb. But my PC only have 32GB memory. Does it mean training the lora require more than 32GB memory?

Feb 11 '24 06:02 HildaM

I didn't meet this issue, since I used a 24GB GPU card. Could you please provide more information? Like how many "validation_steps" did you set?

Feb 21 '24 03:02 ruizhaocv