Hello, I try to use/configs/coursing together on H100 / train/stage2.Py the configuration of the training, due to the memory limit, I will the stage2. 768 px in py batchsize is set to 1 (such as 81: (1.0, 1)), but breaks when the model is saved, as shown in the following figure. 您好, 我尝试在H100上使用/configs/diffusion/train/stage2.py的配置进行训练，由于显存限制，我将stage2.py中768px的batchsize设置为1(如81: (1.0, 1)), 但是在保存模型时中断,如下图所示。

Mar 17 '25 02:03 Little-devil1

could you share your revised stage2.py, any other modifications besides this file?

Mar 18 '25 09:03 SimonWXW

@SimonWXW Since the video data used is all 768x and longer than 2 seconds, using the original batch size causes GPU memory overflow. Therefore, the batch size was adjusted to 49-129. The detailed information is as follows, and no other files were modified.

base = ["image.py"]

new config

grad_ckpt_settings = (8, 100) grad_ckpt_buffer_size = 25 * 1024**3 # 25GB plugin = "hybrid" plugin_config = dict( tp_size=1, pp_size=1, sp_size=4, sequence_parallelism_mode="ring_attn", enable_sequence_parallelism=True, static_graph=True, zero_stage=2, )

bucket_config = { "delete": True, "256px": { 1: (1.0, 28), 5: (1.0, 14), 9: (1.0, 14), 13: (1.0, 14), 17: (1.0, 14), 21: (1.0, 14), 25: (1.0, 14), 29: (1.0, 14), 33: (1.0, 14), 37: (1.0, 10), 41: (1.0, 10), 45: (1.0, 10), 49: (1.0, 10), 53: (1.0, 10), 57: (1.0, 10), 61: (1.0, 10), 65: (1.0, 10), 73: (1.0, 7), 77: (1.0, 7), 81: (1.0, 7), 85: (1.0, 7), 89: (1.0, 7), 93: (1.0, 7), 97: (1.0, 7), 101: (1.0, 6), 105: (1.0, 6), 109: (1.0, 6), 113: (1.0, 6), 117: (1.0, 6), 121: (1.0, 6), 125: (1.0, 6), 129: (1.0, 6), }, "768px": { 1: (1.0, 38), 5: (1.0, 6), 9: (1.0, 6), 13: (1.0, 6), 17: (1.0, 6), 21: (1.0, 6), 25: (1.0, 6), 29: (1.0, 6), 33: (1.0, 6), 37: (1.0, 4), 41: (1.0, 4), 45: (1.0, 4), 49: (1.0, 1), 53: (1.0, 1), 57: (1.0, 1), 61: (1.0, 1), 65: (1.0, 1), 69: (1.0, 1), 73: (1.0, 1), 77: (1.0, 1), 81: (1.0, 1), 85: (1.0, 1), 89: (1.0, 1), 93: (1.0, 1), 97: (1.0, 1), 101: (1.0, 1), 105: (1.0, 1), 109: (1.0, 1), 113: (1.0, 1), 117: (1.0, 1), 121: (1.0, 1), 125: (1.0, 1), 129: (1.0, 1), }, }

model = dict(grad_ckpt_settings=grad_ckpt_settings) lr = 5e-5 optim = dict(lr=lr) ckpt_every = 20 keep_n_latest = 20

Mar 20 '25 03:03 Little-devil1

thanks for sharing, we will try to solve in next week

Mar 21 '25 05:03 SimonWXW

@SimonWXW Alright, thank you very much! 😊

Mar 21 '25 09:03 Little-devil1

Hey @Little-devil1 , this should have been resolved and you are welcome to pull the main branch and try again.

Mar 27 '25 01:03 botbw

Hello, after pulling the main branch, using the above configuration in 8XH100(80G) for training test, the above situation will still appear, is it the 8XH100(80G) memory problem?

Mar 28 '25 14:03 Little-devil1

Hello, after pulling the main branch, using the above configuration in 8XH100(80G) for training test, the above situation will still appear, is it the 8XH100(80G) memory problem?

@Little-devil1 Have you tried increasing parallelism degree, for example change sp to 8?

Mar 28 '25 14:03 botbw

Hello, after pulling the main branch, using the above configuration in 8XH100(80G) for training test, the above situation will still appear, is it the 8XH100(80G) memory problem?你好，在拉取主分支后，使用上述配置在 8XH100(80G)上进行训练测试，上述情况仍然出现，是 8XH100(80G)内存问题吗？

@Little-devil1 Have you tried increasing parallelism degree, for example change sp to 8?您尝试过增加并行度吗，例如将 sp 改为 8？

Hello, I tried to adjust the parallelism Sp to 8 today, but there was still a dimension mismatch in the following figure, which was reduced by half compared with yesterday, down to 24576, and then tried to reduce sp to 1, but the single card 8xH100(80G) out of memory and could not start training

Mar 29 '25 09:03 Little-devil1

Hello, after pulling the main branch, using the above configuration in 8XH100(80G) for training test, the above situation will still appear, is it the 8XH100(80G) memory problem?你好，在拉取主分支后，使用上述配置在 8XH100(80G)上进行训练测试，上述情况仍然出现，是 8XH100(80G)内存问题吗？

@Little-devil1 Have you tried increasing parallelism degree, for example change sp to 8?您尝试过增加并行度吗，例如将 sp 改为 8？

Hello, I tried to adjust the parallelism Sp to 8 today, but there was still a dimension mismatch in the following figure, which was reduced by half compared with yesterday, down to 24576, and then tried to reduce sp to 1, but the single card 8xH100(80G) out of memory and could not start training

@Little-devil1 Some updates:

sp = 4

I tried the config above with command:

torchrun --nproc_per_node 8 scripts/diffusion/train.py configs/diffusion/train/stage2.py --dataset.data-path $DATASET_CSV

Distributed info:

 'plugin_config': {'enable_sequence_parallelism': True,
                   'overlap_allgather': False,
                   'pp_size': 1,
                   'reduce_bucket_size_in_m': 128,
                   'sequence_parallelism_mode': 'ring_attn',
                   'sp_size': 4,
                   'static_graph': True,
                   'tp_size': 1,
                   'zero_stage': 2},

_base_ = ["image.py"]

# new config
grad_ckpt_settings = (8, 100)
grad_ckpt_buffer_size = 25 * 1024**3 # 25GB
plugin = "hybrid"
plugin_config = dict(
tp_size=1,
pp_size=1,
sp_size=4,
sequence_parallelism_mode="ring_attn",
enable_sequence_parallelism=True,
static_graph=True,
zero_stage=2,
)

bucket_config = {
"_delete_": True,
"256px": {
1: (1.0, 28),
5: (1.0, 14),
9: (1.0, 14),
13: (1.0, 14),
17: (1.0, 14),
21: (1.0, 14),
25: (1.0, 14),
29: (1.0, 14),
33: (1.0, 14),
37: (1.0, 10),
41: (1.0, 10),
45: (1.0, 10),
49: (1.0, 10),
53: (1.0, 10),
57: (1.0, 10),
61: (1.0, 10),
65: (1.0, 10),
73: (1.0, 7),
77: (1.0, 7),
81: (1.0, 7),
85: (1.0, 7),
89: (1.0, 7),
93: (1.0, 7),
97: (1.0, 7),
101: (1.0, 6),
105: (1.0, 6),
109: (1.0, 6),
113: (1.0, 6),
117: (1.0, 6),
121: (1.0, 6),
125: (1.0, 6),
129: (1.0, 6),
},
"768px": {
1: (1.0, 38),
5: (1.0, 6),
9: (1.0, 6),
13: (1.0, 6),
17: (1.0, 6),
21: (1.0, 6),
25: (1.0, 6),
29: (1.0, 6),
33: (1.0, 6),
37: (1.0, 4),
41: (1.0, 4),
45: (1.0, 4),
49: (1.0, 1),
53: (1.0, 1),
57: (1.0, 1),
61: (1.0, 1),
65: (1.0, 1),
69: (1.0, 1),
73: (1.0, 1),
77: (1.0, 1),
81: (1.0, 1),
85: (1.0, 1),
89: (1.0, 1),
93: (1.0, 1),
97: (1.0, 1),
101: (1.0, 1),
105: (1.0, 1),
109: (1.0, 1),
113: (1.0, 1),
117: (1.0, 1),
121: (1.0, 1),
125: (1.0, 1),
129: (1.0, 1),
},
}

model = dict(grad_ckpt_settings=grad_ckpt_settings)
lr = 5e-5
optim = dict(lr=lr)
ckpt_every = 1
keep_n_latest = 1

So I was able to proceed training as well as checkpoint saving with 8 GPUs, some memory statistics are:

[2025-03-30 15:25:35] CUDA memory usage at diffusion: 22.1 GB
[2025-03-30 15:25:35] No EMA model created.
[2025-03-30 15:25:35] CUDA memory usage at EMA: 22.1 GB
[2025-03-30 15:25:35] CUDA memory usage at autoencoder: 22.3 GB
[2025-03-30 15:27:23] CUDA memory usage at t5: 31.2 GB
[2025-03-30 15:27:25] CUDA memory usage at optimizer: 31.4 GB
[2025-03-30 15:27:26] CUDA memory usage at boost: 37.1 GB
[2025-03-30 15:34:13] CUDA max memory max memory allocated at final: 57.0 GB
[2025-03-30 15:34:13] CUDA max memory max memory reserved at final: 65.5 GB

Ideally, 8xH800 should be able to run the training script using sp=4 without OOM.

Also tried some other settings, which you can refer to:

sp = 8

[2025-03-30 15:49:29] CUDA max memory max memory allocated at final: 57.0 GB
[2025-03-30 15:49:29] CUDA max memory max memory reserved at final: 60.0 GB

sp = 1

[2025-03-30 16:10:33] CUDA max memory max memory allocated at final: 56.9 GB
[2025-03-30 16:10:33] CUDA max memory max memory reserved at final: 100.9 GB

sp = 4, tp = 2

[2025-03-30 16:24:10] CUDA max memory max memory allocated at final: 55.9 GB
[2025-03-30 16:24:10] CUDA max memory max memory reserved at final: 57.9 GB

Mar 30 '25 08:03 botbw

@botbw Thank you for your detailed explanation. Now that things are working, and everything can be saved successfully. Excellent!

Mar 31 '25 01:03 Little-devil1

Training stage2 interruption

new config

sp = 4