Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

Training stage2 interruption

Open Little-devil1 opened this issue 9 months ago • 4 comments

Hello, I try to use/configs/coursing together on H100 / train/stage2.Py the configuration of the training, due to the memory limit, I will the stage2. 768 px in py batchsize is set to 1 (such as 81: (1.0, 1)), but breaks when the model is saved, as shown in the following figure. 您好, 我尝试在H100上使用/configs/diffusion/train/stage2.py的配置进行训练,由于显存限制, 我将stage2.py中768px的batchsize设置为1(如81: (1.0, 1)), 但是在保存模型时中断,如下图所示。

Image

Little-devil1 avatar Mar 17 '25 02:03 Little-devil1

could you share your revised stage2.py, any other modifications besides this file?

SimonWXW avatar Mar 18 '25 09:03 SimonWXW

@SimonWXW Since the video data used is all 768x and longer than 2 seconds, using the original batch size causes GPU memory overflow. Therefore, the batch size was adjusted to 49-129. The detailed information is as follows, and no other files were modified.

base = ["image.py"]

new config

grad_ckpt_settings = (8, 100) grad_ckpt_buffer_size = 25 * 1024**3 # 25GB plugin = "hybrid" plugin_config = dict( tp_size=1, pp_size=1, sp_size=4, sequence_parallelism_mode="ring_attn", enable_sequence_parallelism=True, static_graph=True, zero_stage=2, )

bucket_config = { "delete": True, "256px": { 1: (1.0, 28), 5: (1.0, 14), 9: (1.0, 14), 13: (1.0, 14), 17: (1.0, 14), 21: (1.0, 14), 25: (1.0, 14), 29: (1.0, 14), 33: (1.0, 14), 37: (1.0, 10), 41: (1.0, 10), 45: (1.0, 10), 49: (1.0, 10), 53: (1.0, 10), 57: (1.0, 10), 61: (1.0, 10), 65: (1.0, 10), 73: (1.0, 7), 77: (1.0, 7), 81: (1.0, 7), 85: (1.0, 7), 89: (1.0, 7), 93: (1.0, 7), 97: (1.0, 7), 101: (1.0, 6), 105: (1.0, 6), 109: (1.0, 6), 113: (1.0, 6), 117: (1.0, 6), 121: (1.0, 6), 125: (1.0, 6), 129: (1.0, 6), }, "768px": { 1: (1.0, 38), 5: (1.0, 6), 9: (1.0, 6), 13: (1.0, 6), 17: (1.0, 6), 21: (1.0, 6), 25: (1.0, 6), 29: (1.0, 6), 33: (1.0, 6), 37: (1.0, 4), 41: (1.0, 4), 45: (1.0, 4), 49: (1.0, 1), 53: (1.0, 1), 57: (1.0, 1), 61: (1.0, 1), 65: (1.0, 1), 69: (1.0, 1), 73: (1.0, 1), 77: (1.0, 1), 81: (1.0, 1), 85: (1.0, 1), 89: (1.0, 1), 93: (1.0, 1), 97: (1.0, 1), 101: (1.0, 1), 105: (1.0, 1), 109: (1.0, 1), 113: (1.0, 1), 117: (1.0, 1), 121: (1.0, 1), 125: (1.0, 1), 129: (1.0, 1), }, }

model = dict(grad_ckpt_settings=grad_ckpt_settings) lr = 5e-5 optim = dict(lr=lr) ckpt_every = 20 keep_n_latest = 20

Little-devil1 avatar Mar 20 '25 03:03 Little-devil1

thanks for sharing, we will try to solve in next week

SimonWXW avatar Mar 21 '25 05:03 SimonWXW

@SimonWXW Alright, thank you very much! 😊

Little-devil1 avatar Mar 21 '25 09:03 Little-devil1

Hey @Little-devil1 , this should have been resolved and you are welcome to pull the main branch and try again.

botbw avatar Mar 27 '25 01:03 botbw

Hello, after pulling the main branch, using the above configuration in 8XH100(80G) for training test, the above situation will still appear, is it the 8XH100(80G) memory problem? Image

Little-devil1 avatar Mar 28 '25 14:03 Little-devil1

Hello, after pulling the main branch, using the above configuration in 8XH100(80G) for training test, the above situation will still appear, is it the 8XH100(80G) memory problem? Image

@Little-devil1 Have you tried increasing parallelism degree, for example change sp to 8?

botbw avatar Mar 28 '25 14:03 botbw

Hello, after pulling the main branch, using the above configuration in 8XH100(80G) for training test, the above situation will still appear, is it the 8XH100(80G) memory problem?你好,在拉取主分支后,使用上述配置在 8XH100(80G)上进行训练测试,上述情况仍然出现,是 8XH100(80G)内存问题吗? Image

@Little-devil1 Have you tried increasing parallelism degree, for example change sp to 8?您尝试过增加并行度吗,例如将 sp 改为 8?

Hello, I tried to adjust the parallelism Sp to 8 today, but there was still a dimension mismatch in the following figure, which was reduced by half compared with yesterday, down to 24576, and then tried to reduce sp to 1, but the single card 8xH100(80G) out of memory and could not start training

Image

Little-devil1 avatar Mar 29 '25 09:03 Little-devil1

Hello, after pulling the main branch, using the above configuration in 8XH100(80G) for training test, the above situation will still appear, is it the 8XH100(80G) memory problem?你好,在拉取主分支后,使用上述配置在 8XH100(80G)上进行训练测试,上述情况仍然出现,是 8XH100(80G)内存问题吗? Image

@Little-devil1 Have you tried increasing parallelism degree, for example change sp to 8?您尝试过增加并行度吗,例如将 sp 改为 8?

Hello, I tried to adjust the parallelism Sp to 8 today, but there was still a dimension mismatch in the following figure, which was reduced by half compared with yesterday, down to 24576, and then tried to reduce sp to 1, but the single card 8xH100(80G) out of memory and could not start training

Image

@Little-devil1 Some updates:

sp = 4

I tried the config above with command:

torchrun --nproc_per_node 8 scripts/diffusion/train.py configs/diffusion/train/stage2.py --dataset.data-path $DATASET_CSV

Distributed info:

 'plugin_config': {'enable_sequence_parallelism': True,
                   'overlap_allgather': False,
                   'pp_size': 1,
                   'reduce_bucket_size_in_m': 128,
                   'sequence_parallelism_mode': 'ring_attn',
                   'sp_size': 4,
                   'static_graph': True,
                   'tp_size': 1,
                   'zero_stage': 2},
_base_ = ["image.py"]

# new config
grad_ckpt_settings = (8, 100)
grad_ckpt_buffer_size = 25 * 1024**3 # 25GB
plugin = "hybrid"
plugin_config = dict(
tp_size=1,
pp_size=1,
sp_size=4,
sequence_parallelism_mode="ring_attn",
enable_sequence_parallelism=True,
static_graph=True,
zero_stage=2,
)

bucket_config = {
"_delete_": True,
"256px": {
1: (1.0, 28),
5: (1.0, 14),
9: (1.0, 14),
13: (1.0, 14),
17: (1.0, 14),
21: (1.0, 14),
25: (1.0, 14),
29: (1.0, 14),
33: (1.0, 14),
37: (1.0, 10),
41: (1.0, 10),
45: (1.0, 10),
49: (1.0, 10),
53: (1.0, 10),
57: (1.0, 10),
61: (1.0, 10),
65: (1.0, 10),
73: (1.0, 7),
77: (1.0, 7),
81: (1.0, 7),
85: (1.0, 7),
89: (1.0, 7),
93: (1.0, 7),
97: (1.0, 7),
101: (1.0, 6),
105: (1.0, 6),
109: (1.0, 6),
113: (1.0, 6),
117: (1.0, 6),
121: (1.0, 6),
125: (1.0, 6),
129: (1.0, 6),
},
"768px": {
1: (1.0, 38),
5: (1.0, 6),
9: (1.0, 6),
13: (1.0, 6),
17: (1.0, 6),
21: (1.0, 6),
25: (1.0, 6),
29: (1.0, 6),
33: (1.0, 6),
37: (1.0, 4),
41: (1.0, 4),
45: (1.0, 4),
49: (1.0, 1),
53: (1.0, 1),
57: (1.0, 1),
61: (1.0, 1),
65: (1.0, 1),
69: (1.0, 1),
73: (1.0, 1),
77: (1.0, 1),
81: (1.0, 1),
85: (1.0, 1),
89: (1.0, 1),
93: (1.0, 1),
97: (1.0, 1),
101: (1.0, 1),
105: (1.0, 1),
109: (1.0, 1),
113: (1.0, 1),
117: (1.0, 1),
121: (1.0, 1),
125: (1.0, 1),
129: (1.0, 1),
},
}

model = dict(grad_ckpt_settings=grad_ckpt_settings)
lr = 5e-5
optim = dict(lr=lr)
ckpt_every = 1
keep_n_latest = 1

So I was able to proceed training as well as checkpoint saving with 8 GPUs, some memory statistics are:

[2025-03-30 15:25:35] CUDA memory usage at diffusion: 22.1 GB
[2025-03-30 15:25:35] No EMA model created.
[2025-03-30 15:25:35] CUDA memory usage at EMA: 22.1 GB
[2025-03-30 15:25:35] CUDA memory usage at autoencoder: 22.3 GB
[2025-03-30 15:27:23] CUDA memory usage at t5: 31.2 GB
[2025-03-30 15:27:25] CUDA memory usage at optimizer: 31.4 GB
[2025-03-30 15:27:26] CUDA memory usage at boost: 37.1 GB
[2025-03-30 15:34:13] CUDA max memory max memory allocated at final: 57.0 GB
[2025-03-30 15:34:13] CUDA max memory max memory reserved at final: 65.5 GB

Ideally, 8xH800 should be able to run the training script using sp=4 without OOM.

Also tried some other settings, which you can refer to:

  • sp = 8
[2025-03-30 15:49:29] CUDA max memory max memory allocated at final: 57.0 GB
[2025-03-30 15:49:29] CUDA max memory max memory reserved at final: 60.0 GB
  • sp = 1
[2025-03-30 16:10:33] CUDA max memory max memory allocated at final: 56.9 GB
[2025-03-30 16:10:33] CUDA max memory max memory reserved at final: 100.9 GB
  • sp = 4, tp = 2
[2025-03-30 16:24:10] CUDA max memory max memory allocated at final: 55.9 GB
[2025-03-30 16:24:10] CUDA max memory max memory reserved at final: 57.9 GB

botbw avatar Mar 30 '25 08:03 botbw

@botbw Thank you for your detailed explanation. Now that things are working, and everything can be saved successfully. Excellent!

Little-devil1 avatar Mar 31 '25 01:03 Little-devil1