CogVideo
CogVideo copied to clipboard
[Help] Finetuning script hangs after checkpoint loading with no additional logs
I didn't modify the script, and I used the Disney dataset that was provided as an example.
python & torch version
Python 3.12.0 | packaged by Anaconda, Inc. | (main, Oct 2 2023, 17:29:18) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
2.4.1+cu121
>>> print(torch.cuda.is_available())
True
nvidia-smi & nvcc version
(cogvideo) root@kkhong-ttv-a100-80g-gpu:~# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
(cogvideo) root@kkhong-ttv-a100-80g-gpu:~# nvidia-smi
Mon Sep 23 18:24:34 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:A1:00.0 Off | Off |
| N/A 31C P0 63W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:A2:00.0 Off | Off |
| N/A 32C P0 56W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:B1:00.0 Off | Off |
| N/A 31C P0 60W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:B2:00.0 Off | Off |
| N/A 32C P0 62W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB On | 00000000:C1:00.0 Off | Off |
| N/A 31C P0 64W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB On | 00000000:C2:00.0 Off | Off |
| N/A 32C P0 59W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB On | 00000000:D1:00.0 Off | Off |
| N/A 31C P0 63W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB On | 00000000:D2:00.0 Off | Off |
| N/A 31C P0 62W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Log
(cogvideo) root@kkhong-ttv-a100-80g-gpu:~/CogVideo/finetune# bash finetune_single_rank.sh
[W923 18:12:29.862038611 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [kkhong-ttv-a100-80g-gpu]:29500 (errno: 97 - Address family not supported by protocol).
[W923 18:12:32.457827724 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [kkhong-ttv-a100-80g-gpu]:29500 (errno: 97 - Address family not supported by protocol).
[W923 18:12:33.257230351 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [kkhong-ttv-a100-80g-gpu]:29500 (errno: 97 - Address family not supported by protocol).
[W923 18:12:33.272857716 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [kkhong-ttv-a100-80g-gpu]:29500 (errno: 97 - Address family not supported by protocol).
[W923 18:12:33.435763180 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [kkhong-ttv-a100-80g-gpu]:29500 (errno: 97 - Address family not supported by protocol).
[W923 18:12:33.504450703 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [kkhong-ttv-a100-80g-gpu]:29500 (errno: 97 - Address family not supported by protocol).
[W923 18:12:33.506889312 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [kkhong-ttv-a100-80g-gpu]:29500 (errno: 97 - Address family not supported by protocol).
[W923 18:12:33.630879569 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [kkhong-ttv-a100-80g-gpu]:29500 (errno: 97 - Address family not supported by protocol).
[W923 18:12:33.633465902 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [kkhong-ttv-a100-80g-gpu]:29500 (errno: 97 - Address family not supported by protocol).
09/23/2024 18:12:33 - INFO - __main__ - Distributed environment: DistributedType.MULTI_CPU Backend: gloo
Num processes: 8
Process index: 1
Local process index: 1
Device: cpu:0
Mixed precision type: bf16
09/23/2024 18:12:33 - INFO - __main__ - Distributed environment: DistributedType.MULTI_CPU Backend: gloo
Num processes: 8
Process index: 0
Local process index: 0
Device: cpu:0
Mixed precision type: bf16
09/23/2024 18:12:33 - INFO - __main__ - Distributed environment: DistributedType.MULTI_CPU Backend: gloo
Num processes: 8
Process index: 3
Local process index: 3
Device: cpu:0
Mixed precision type: bf16
09/23/2024 18:12:33 - INFO - __main__ - Distributed environment: DistributedType.MULTI_CPU Backend: gloo
Num processes: 8
Process index: 5
Local process index: 5
Device: cpu:0
Mixed precision type: bf16
09/23/2024 18:12:33 - INFO - __main__ - Distributed environment: DistributedType.MULTI_CPU Backend: gloo
Num processes: 8
Process index: 2
Local process index: 2
Device: cpu:0
Mixed precision type: bf16
09/23/2024 18:12:33 - INFO - __main__ - Distributed environment: DistributedType.MULTI_CPU Backend: gloo
Num processes: 8
Process index: 6
Local process index: 6
Device: cpu:0
Mixed precision type: bf16
09/23/2024 18:12:33 - INFO - __main__ - Distributed environment: DistributedType.MULTI_CPU Backend: gloo
Num processes: 8
Process index: 4
Local process index: 4
Device: cpu:0
Mixed precision type: bf16
09/23/2024 18:12:33 - INFO - __main__ - Distributed environment: DistributedType.MULTI_CPU Backend: gloo
Num processes: 8
Process index: 7
Local process index: 7
Device: cpu:0
Mixed precision type: bf16
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 14691.08it/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 13819.78it/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 13842.59it/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 15391.94it/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 15857.48it/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11966.63it/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 15477.14it/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 15797.76it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00, 2.93s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00, 2.95s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.04s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.03s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.01s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.03s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.06s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.09s/it]
{'use_learned_positional_embeddings'} was not found in config. Values will be initialized to default values.
How long have you been waiting, how many datasets, I think this should be the time to process the datasets, can you check if htop shows any memory or CPU usage
0[| 0.7%] 4[ 0.0%] 8[|||||||||||| 56.6%] 12[ 0.0%] 16[ 0.0%] 20[ 0.0%] 24[||||||||||||||100.0%] 28[||||||||||||||100.0%] 32[||||||||||||||100.0%] 36[ 0.0%] 40[ 0.0%] 44[ 0.0%] 48[ 0.0%] 52[ 0.0%]
1[ 0.0%] 5[ 0.0%] 9[||||||||| 43.7%] 13[ 0.0%] 17[ 0.0%] 21[ 0.0%] 25[ 0.0%] 29[ 0.0%] 33[ 0.0%] 37[ 0.0%] 41[ 0.0%] 45[ 0.0%] 49[ 0.0%] 53[ 0.0%]
2[ 0.0%] 6[ 0.0%] 10[| 0.7%] 14[ 0.0%] 18[ 0.0%] 22[ 0.0%] 26[ 0.0%] 30[||||||||||||||100.0%] 34[ 0.0%] 38[||||||||||||||100.0%] 42[||||||||||||||100.0%] 46[ 0.0%] 50[ 0.0%] 54[| 0.7%]
3[| 0.7%] 7[ 0.0%] 11[ 0.0%] 15[ 0.0%] 19[ 0.0%] 23[||||||||||||||100.0%] 27[ 0.0%] 31[ 0.0%] 35[ 0.0%] 39[ 0.0%] 43[ 0.0%] 47[ 0.0%] 51[ 0.0%] 55[ 0.0%]
Mem[|||||||||||||||||||||||||||||||| 262G/1.88T] Tasks: 61, 194 thr; 9 running
Swp[ 0K/0K] Load average: 8.17 5.85 2.89
Uptime: 01:02:55
PID USER PRI NI VIRT RES SHR S CPU%▽MEM% TIME+ Command
6030 root 20 0 1246M 74484 42044 S 0.0 0.0 0:00.00 /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/node /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/out/bootstrap-fork --type=ptyHost --logsPath /root/.vscode-server/data/logs/20240923T191304
6558 root 20 0 972M 53488 39984 S 0.0 0.0 0:00.29 /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/node /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/extensions/json-language-features/server/dist/node/jsonServerMain --node-ipc --clientProcessId=5938
6559 root 20 0 972M 53488 39984 S 0.0 0.0 0:00.00 /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/node /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/extensions/json-language-features/server/dist/node/jsonServerMain --node-ipc --clientProcessId=5938
6560 root 20 0 972M 53488 39984 S 0.0 0.0 0:00.00 /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/node /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/extensions/json-language-features/server/dist/node/jsonServerMain --node-ipc --clientProcessId=5938
6561 root 20 0 972M 53488 39984 S 0.0 0.0 0:00.00 /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/node /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/extensions/json-language-features/server/dist/node/jsonServerMain --node-ipc --clientProcessId=5938
6562 root 20 0 972M 53488 39984 S 0.0 0.0 0:00.00 /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/node /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/extensions/json-language-features/server/dist/node/jsonServerMain --node-ipc --clientProcessId=5938
6563 root 20 0 972M 53488 39984 S 0.0 0.0 0:00.00 /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/node /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/extensions/json-language-features/server/dist/node/jsonServerMain --node-ipc --clientProcessId=5938
6564 root 20 0 972M 53488 39984 S 0.0 0.0 0:00.00 /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/node /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/extensions/json-language-features/server/dist/node/jsonServerMain --node-ipc --clientProcessId=5938
17450 root 20 0 7768 4608 3828 S 0.0 0.0 0:00.00 /bin/bash --init-file /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/out/vs/workbench/contrib/terminal/browser/media/shellIntegration-bash.sh
17451 root 20 0 1246M 74484 42044 S 0.0 0.0 0:00.00 /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/node /root/.vscode-server/cli/servers/Stable-38c31bc77e0dd6ae88a4e9cc93428cc27a56ba40/server/out/bootstrap-fork --type=ptyHost --logsPath /root/.vscode-server/data/logs/20240923T191304
18459 root 20 0 288M 20688 17784 S 0.0 0.0 0:00.03 /usr/libexec/packagekitd
18460 root 20 0 288M 20688 17784 S 0.0 0.0 0:00.00 /usr/libexec/packagekitd
18461 root 20 0 288M 20688 17784 S 0.0 0.0 0:00.00 /usr/libexec/packagekitd
35042 root 20 0 7504 3488 3216 S 0.0 0.0 0:00.00 bash finetune_single_rank.sh
35043 root 20 0 4901M 383M 162M S 0.0 0.0 0:06.60 /root/miniconda3/envs/cogvideo/bin/python /root/miniconda3/envs/cogvideo/bin/accelerate launch --config_file accelerate_config_machine_single.yaml --multi_gpu train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGenerat
35112 root 20 0 4901M 383M 162M S 0.0 0.0 0:00.00 /root/miniconda3/envs/cogvideo/bin/python /root/miniconda3/envs/cogvideo/bin/accelerate launch --config_file accelerate_config_machine_single.yaml --multi_gpu train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGenerat
35185 root 20 0 37.4G 33.3G 239M S 0.0 1.7 0:00.24 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35197 root 20 0 37.4G 33.3G 237M S 0.0 1.7 0:00.23 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35199 root 20 0 37.2G 33.0G 239M S 0.0 1.7 0:00.27 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35201 root 20 0 37.4G 33.3G 239M S 0.0 1.7 0:00.25 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35203 root 20 0 37.4G 33.3G 239M S 0.7 1.7 0:00.30 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35205 root 20 0 37.4G 33.3G 239M S 0.0 1.7 0:00.28 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35209 root 20 0 37.2G 33.1G 238M S 0.0 1.7 0:00.23 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35210 root 20 0 37.4G 33.3G 239M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35211 root 20 0 37.4G 33.3G 239M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35212 root 20 0 37.4G 33.3G 239M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35213 root 20 0 37.4G 33.3G 239M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35214 root 20 0 37.4G 33.3G 236M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35215 root 20 0 37.4G 33.3G 236M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35216 root 20 0 37.2G 33.1G 238M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35217 root 20 0 37.2G 33.1G 238M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35218 root 20 0 37.4G 33.3G 237M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35219 root 20 0 37.4G 33.3G 237M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35220 root 20 0 37.4G 33.3G 239M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35221 root 20 0 37.4G 33.3G 239M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35222 root 20 0 37.2G 33.0G 239M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35223 root 20 0 37.4G 33.3G 239M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35224 root 20 0 37.4G 33.3G 239M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35225 root 20 0 37.2G 33.0G 239M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35244 root 20 0 37.4G 33.3G 239M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35245 root 20 0 37.4G 33.3G 239M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35246 root 20 0 37.4G 33.3G 239M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35247 root 20 0 37.4G 33.3G 236M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35248 root 20 0 37.4G 33.3G 237M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35249 root 20 0 37.2G 33.1G 238M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35250 root 20 0 37.4G 33.3G 239M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
35251 root 20 0 37.2G 33.0G 239M S 0.0 1.7 0:00.00 /root/miniconda3/envs/cogvideo/bin/python -u train_cogvideox_lora.py --gradient_checkpointing --pretrained_model_name_or_path THUDM/CogVideoX-2b --cache_dir ~/.cache --enable_tiling --enable_slicing --instance_data_root /mnt/a/Disney-VideoGeneration-Dataset --caption_column prompt.txt --video_column videos.txt --validation_prompt DISNEY A black and white ani
38544 root 20 0 12500 8648 3492 R 0.7 0.0 0:03.09 htop
46304 root 20 0 5772 996 904 S 0.0 0.0 0:00.00 sleep 180
F1Help F2Setup F3SearchF4FilterF5Tree F6SortByF7Nice -F8Nice +F9Kill F10Quit
Clossing!!
I resolved the issue by modifying the CUDA_VISIBLE_DEVICES variable in finetune_single_rank.sh.
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
Clossing!! I resolved the issue by modifying the CUDA_VISIBLE_DEVICES variable in finetune_single_rank.sh.
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
能问下你每step训练的速度是多少?