DiffSynth-Studio How to speed up the inference in stepvideo_text_to_video_low

pipe.enable_vram_management(num_persistent_param_in_dit=0) In the above code, if I set it to None, the video memory will explode. Then I set it to 0, the video memory occupies 7700Mb, and the reasoning speed of each step is 25s+25s. Then I set num_persistent_param_in_dit to a very large value, such as: num_persistent_param_in_dit=9000000000, then the video memory occupies 23Gb+, but the reasoning speed does not increase (not even a slight increase), and it is still 25s+25s? Why is this? How can I speed up the reasoning speed? Can anyone help me?

And I also found: stepvideo_text_to_video.py and stepvideo_text_to_video_low_vram.py seem to have no difference? Except num_persistent_param_in_dit, one is None and the other is 0? But why are their comments different? stepvideo_text_to_video.py comment: This model requires 80G VRAM. stepvideo_text_to_video_low_vram.py comment: This model requires 24G VRAM. Can anyone help me?

Feb 24 '25 07:02 tiga-dudu

Let me explain it to you.

What is num_persistent_param_in_dit?

Due to the enormous size of this model, we must use a layer-by-layer offloading approach for inference. However, offloading leads to frequent communication between memory and GPU memory, resulting in a computational speed loss. Therefore, we have reserved the parameter num_persistent_param_in_dit. When this parameter is set to 1,000,000,000, about 1 billion of the model parameters will persist in GPU memory, avoiding communication with system memory. Theoretically, the larger this value, the more GPU memory is required, but the faster the speed.

How much does num_persistent_param_in_dit affect speed?

This question depends on the communication speed between memory and GPU memory. For high-frequency multi-channel memory, the communication speed is quite fast, so num_persistent_param_in_dit has almost no noticeable impact on speed. However, for machines with slightly lesser configurations, the impact is more significant.

What is the difference between the two Python scripts?

The only difference is num_persistent_param_in_dit. Since stepvideo_text_to_video_low_vram.py enables offloading, it requires less GPU memory, which is why we labeled it as 24G.

Feb 25 '25 07:02 Artiprocher

让我向你解释一下。

什么？num_persistent_param_in_dit

由于这个模型的巨大规模，我们必须使用逐层卸载方法来进行推理。但是，卸载会导致内存和 GPU 内存之间频繁通信，从而导致计算速度下降。因此，我们保留了参数 .当该参数设置为 1,000,000,000 时，大约 10 亿个模型参数将保留在 GPU 内存中，避免与系统内存通信。理论上，此值越大，需要的 GPU 内存就越多，但速度越快。num_persistent_param_in_dit

速度有多大影响？num_persistent_param_in_dit

这个问题取决于内存和 GPU 内存之间的通信速度。对于高频多通道内存，通信速度相当快，因此对速度几乎没有明显影响。但是，对于配置稍小的计算机，影响更为显著。num_persistent_param_in_dit

两个 Python 脚本有什么区别？

唯一的区别是。由于支持卸载，因此它需要的 GPU 内存更少，这就是我们将其标记为 24G 的原因。num_persistent_param_in_dit``stepvideo_text_to_video_low_vram.py

Thanks for your answer. I have 24GB of video memory, but when I run the stepvideo_text_to_video_low_vram.py file, only 7GB of video memory is used. What should I do if I want to speed up the inference? Because the time for inference of 97 frames and video size 960*544 is about 30 minutes, which is a bit too long, and 17GB of video memory is idle (if the video memory is not exploded)

Feb 25 '25 09:02 tiga-dudu

Let me explain it to you.

What is num_persistent_param_in_dit?

Due to the enormous size of this model, we must use a layer-by-layer offloading approach for inference. However, offloading leads to frequent communication between memory and GPU memory, resulting in a computational speed loss. Therefore, we have reserved the parameter num_persistent_param_in_dit. When this parameter is set to 1,000,000,000, about 1 billion of the model parameters will persist in GPU memory, avoiding communication with system memory. Theoretically, the larger this value, the more GPU memory is required, but the faster the speed.

How much does num_persistent_param_in_dit affect speed?

This question depends on the communication speed between memory and GPU memory. For high-frequency multi-channel memory, the communication speed is quite fast, so num_persistent_param_in_dit has almost no noticeable impact on speed. However, for machines with slightly lesser configurations, the impact is more significant.

What is the difference between the two Python scripts?

The only difference is num_persistent_param_in_dit. Since stepvideo_text_to_video_low_vram.py enables offloading, it requires less GPU memory, which is why we labeled it as 24G.

Thanks for your answer！When I switch from 24GB video memory to 40GB video memory, the time taken for each generation step is reduced from about 32 seconds to 24 seconds, and the total time is reduced by 4 minutes. But the problem is that I currently have 8 graphics cards, and I want to use the remaining 7 graphics cards as well. How do I need to set it up?

Feb 26 '25 08:02 tensorflowt

Let me explain it to you.

What is num_persistent_param_in_dit?

Due to the enormous size of this model, we must use a layer-by-layer offloading approach for inference. However, offloading leads to frequent communication between memory and GPU memory, resulting in a computational speed loss. Therefore, we have reserved the parameter num_persistent_param_in_dit. When this parameter is set to 1,000,000,000, about 1 billion of the model parameters will persist in GPU memory, avoiding communication with system memory. Theoretically, the larger this value, the more GPU memory is required, but the faster the speed.

How much does num_persistent_param_in_dit affect speed?

This question depends on the communication speed between memory and GPU memory. For high-frequency multi-channel memory, the communication speed is quite fast, so num_persistent_param_in_dit has almost no noticeable impact on speed. However, for machines with slightly lesser configurations, the impact is more significant.

What is the difference between the two Python scripts?

The only difference is num_persistent_param_in_dit. Since stepvideo_text_to_video_low_vram.py enables offloading, it requires less GPU memory, which is why we labeled it as 24G.

Thanks for your answer！When I switch from 24GB video memory to 40GB video memory, the time taken for each generation step is reduced from about 32 seconds to 24 seconds, and the total time is reduced by 4 minutes. But the problem is that I currently have 8 graphics cards, and I want to use the remaining 7 graphics cards as well. How do I need to set it up?

My gpu resource information is 8x4090x24GiB and 8x4090x40GiB

Feb 26 '25 08:02 tensorflowt

@tiga-dudu Set num_persistent_param_in_dit=9000000000 is a right way for speed up. However, there is a limit to the acceleration effect. It seems that on your device, you can no longer accelerate by increasing VRAM usage.

Feb 26 '25 08:02 Artiprocher

@tensorflowt We sincerely apologize, but we currently do not support multi-GPU parallel processing. This feature is still under development, so please stay tuned.

Feb 26 '25 08:02 Artiprocher

How to speed up the inference in stepvideo_text_to_video_low_vram.py in Stepvideo model?