MotionDirector
MotionDirector copied to clipboard
Training the example code but Crashed
I am using 4090 graphic card to train, but it will crached half of the process. And it always crashed it 50 steps...... Example code:
python MotionDirector_train.py --config ./configs/config_single_video.yaml
Error Output:
(motiondirector) PS D:\Coding\AILearning\AI_Art_Technology_Demo\MotionDirector> python MotionDirector_train.py --config ./configs/config_single_video.yaml
Initializing the conversion map
D:\Applications\Miniconda3\envs\motiondirector\lib\site-packages\accelerate\accelerator.py:359: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
02/11/2024 14:26:18 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
{'rescale_betas_zero_snr', 'timestep_spacing'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into UNet3DConditionModel.
{'rescale_betas_zero_snr', 'timestep_spacing'} was not found in config. Values will be initialized to default values.
Caching Latents.: 0%| | 0/1 [00:00<?, ?it/s]D:\Applications\Miniconda3\envs\motiondirector\lib\site-packages\diffusers\models\attention_processor.py:1129: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
hidden_states = F.scaled_dot_product_attention(
{'rescale_betas_zero_snr', 'timestep_spacing'} was not found in config. Values will be initialized to default values.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:10<00:00, 4.91it/s]
Caching Latents.: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.88s/it]
02/11/2024 14:26:36 - INFO - __main__ - ***** Running training *****
02/11/2024 14:26:36 - INFO - __main__ - Num examples = 1
02/11/2024 14:26:36 - INFO - __main__ - Num Epochs = 150
02/11/2024 14:26:36 - INFO - __main__ - Instantaneous batch size per device = 1
02/11/2024 14:26:36 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 1
02/11/2024 14:26:36 - INFO - __main__ - Gradient Accumulation steps = 1
02/11/2024 14:26:36 - INFO - __main__ - Total optimization steps = 150
Steps: 33%|███████████████████████████████████████▋ | 50/150 [00:46<01:31, 1.10it/s]
{'rescale_betas_zero_snr', 'timestep_spacing'} was not found in config. Values will be initialized to default values.
(motiondirector) PS D:\Coding\AILearning\AI_Art_Technology_Demo\MotionDirector>
I noticed that when the steps reach 50, memory consume nearly 32gb. But my PC only have 32GB memory. Does it mean training the lora require more than 32GB memory?
I didn't meet this issue, since I used a 24GB GPU card. Could you please provide more information? Like how many "validation_steps" did you set?