DiffSynth-Studio stepvideo 8*4090*24GiB inference oom!

At present, I use 8x4090x24GiB resources for DiffSynth-Studio/examples/stepvideo/stepvideo_text_to_video_low_vram.py code reasoning. According to the readme, the video memory resource only needs 24GB, but the actual test found that more is needed. I modified the stepvideo_text_to_video_low_vram.py code to support multi-card loading reasoning, but it still fails. The modified code is as follows. I hope to get help! Thanks！

import os
import torch
import torch.distributed as dist
from modelscope import snapshot_download
from diffsynth import ModelManager, StepVideoPipeline, save_video

# Initialize the distributed environment
def init_distributed_mode():
    dist.init_process_group(backend='nccl')
    local_rank = int(os.environ['LOCAL_RANK'])
    torch.cuda.set_device(local_rank)
    return local_rank

# Download models
# snapshot_download(model_id="stepfun-ai/stepvideo-t2v", cache_dir="models")

# Load the compiled attention for the LLM text encoder.
# If you encounter errors here. Please select other compiled file that matches your environment or delete this line.
torch.ops.load_library("/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/lib/liboptimus_ths-torch2.5-cu124.cpython-310-x86_64-linux-gnu.so")

# Initialize distributed mode
local_rank = init_distributed_mode()

# Load models
model_manager = ModelManager()
model_manager.load_models(
    ["/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/hunyuan_clip/clip_text_encoder/pytorch_model.bin"],
    torch_dtype=torch.float32, device=f"cuda:{local_rank}"
)
model_manager.load_models(
    [
        "/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/step_llm",
        [
            "/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/transformer/diffusion_pytorch_model-00001-of-00006.safetensors",
            "/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/transformer/diffusion_pytorch_model-00002-of-00006.safetensors",
            "/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/transformer/diffusion_pytorch_model-00003-of-00006.safetensors",
            "/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/transformer/diffusion_pytorch_model-00004-of-00006.safetensors",
            "/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/transformer/diffusion_pytorch_model-00005-of-00006.safetensors",
            "/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/transformer/diffusion_pytorch_model-00006-of-00006.safetensors",
        ]
    ],
    torch_dtype=torch.bfloat16, device=f"cuda:{local_rank}"
)
model_manager.load_models(
    ["/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/vae/vae_v2.safetensors"],
    torch_dtype=torch.bfloat16, device=f"cuda:{local_rank}"
)

# Create the pipeline
pipe = StepVideoPipeline.from_model_manager(model_manager, torch_dtype=torch.bfloat16, device=f"cuda:{local_rank}")

# Enable VRAM management
pipe.enable_vram_management(num_persistent_param_in_dit=0)

# Run!
if local_rank == 0:
    video = pipe(
        prompt="一名宇航员在月球上发现一块石碑，上面印有“stepfun”字样，闪闪发光。超高清、HDR 视频、环境光、杜比全景声、画面稳定、流畅动作、逼真的细节、专业级构图、超现实主义、自然、生动、超细节、清晰。",
        negative_prompt="画面暗、低分辨率、不良手、文本、缺少手指、多余的手指、裁剪、低质量、颗粒状、签名、水印、用户名、模糊。",
        num_inference_steps=30, cfg_scale=9, num_frames=51, seed=1,
        tiled=True, tile_size=(34, 34), tile_stride=(16, 16)
    )
    save_video(
        video, "video.mp4", fps=25, quality=5,
        ffmpeg_params=["-vf", "atadenoise=0a=0.1:0b=0.1:1a=0.1:1b=0.1"]
    )

# Clean up
dist.destroy_process_group()

error info:

The current problem is that loading the diffusion model with multiple cards will still cause oom！

Feb 24 '25 06:02 tensorflowt

24G VRAM is not enough to store the whole model. Please set device="cpu" when you are using model_manager, and set device="cuda:xxx" when you are creating the pipeline. So that our program can offload each layer.

Feb 25 '25 07:02 Artiprocher

24G VRAM is not enough to store the whole model. Please set device="cpu" when you are using model_manager, and set device="cuda:xxx" when you are creating the pipeline. So that our program can offload each layer.

I currently have a machine with 8 cards * 4090 * 40GiB. How can I configure it to achieve the fastest inference?

Feb 26 '25 07:02 tensorflowt

@tensorflowt We sincerely apologize, but we currently do not support multi-GPU parallel processing. This feature is still under development, so please stay tuned.

Feb 26 '25 08:02 Artiprocher

@tensorflowt We sincerely apologize, but we currently do not support multi-GPU parallel processing. This feature is still under development, so please stay tuned.

I am currently using 8 cards and 8 processes to generate 8 groups of tasks normally. The speed is not affected, but the CPU memory has increased exponentially. I understand that this part can be shared. Is it possible to load the memory of the public CPU part and load the GPU separately?

Feb 26 '25 09:02 tensorflowt