stepvideo 8*4090*24GiB inference oom!
At present, I use 8x4090x24GiB resources for DiffSynth-Studio/examples/stepvideo/stepvideo_text_to_video_low_vram.py code reasoning. According to the readme, the video memory resource only needs 24GB, but the actual test found that more is needed. I modified the stepvideo_text_to_video_low_vram.py code to support multi-card loading reasoning, but it still fails. The modified code is as follows. I hope to get help! Thanks!
import os
import torch
import torch.distributed as dist
from modelscope import snapshot_download
from diffsynth import ModelManager, StepVideoPipeline, save_video
# Initialize the distributed environment
def init_distributed_mode():
dist.init_process_group(backend='nccl')
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)
return local_rank
# Download models
# snapshot_download(model_id="stepfun-ai/stepvideo-t2v", cache_dir="models")
# Load the compiled attention for the LLM text encoder.
# If you encounter errors here. Please select other compiled file that matches your environment or delete this line.
torch.ops.load_library("/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/lib/liboptimus_ths-torch2.5-cu124.cpython-310-x86_64-linux-gnu.so")
# Initialize distributed mode
local_rank = init_distributed_mode()
# Load models
model_manager = ModelManager()
model_manager.load_models(
["/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/hunyuan_clip/clip_text_encoder/pytorch_model.bin"],
torch_dtype=torch.float32, device=f"cuda:{local_rank}"
)
model_manager.load_models(
[
"/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/step_llm",
[
"/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/transformer/diffusion_pytorch_model-00001-of-00006.safetensors",
"/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/transformer/diffusion_pytorch_model-00002-of-00006.safetensors",
"/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/transformer/diffusion_pytorch_model-00003-of-00006.safetensors",
"/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/transformer/diffusion_pytorch_model-00004-of-00006.safetensors",
"/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/transformer/diffusion_pytorch_model-00005-of-00006.safetensors",
"/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/transformer/diffusion_pytorch_model-00006-of-00006.safetensors",
]
],
torch_dtype=torch.bfloat16, device=f"cuda:{local_rank}"
)
model_manager.load_models(
["/data/code/DiffSynth-Studio/models/stepfun-ai/stepvideo-t2v/vae/vae_v2.safetensors"],
torch_dtype=torch.bfloat16, device=f"cuda:{local_rank}"
)
# Create the pipeline
pipe = StepVideoPipeline.from_model_manager(model_manager, torch_dtype=torch.bfloat16, device=f"cuda:{local_rank}")
# Enable VRAM management
pipe.enable_vram_management(num_persistent_param_in_dit=0)
# Run!
if local_rank == 0:
video = pipe(
prompt="一名宇航员在月球上发现一块石碑,上面印有“stepfun”字样,闪闪发光。超高清、HDR 视频、环境光、杜比全景声、画面稳定、流畅动作、逼真的细节、专业级构图、超现实主义、自然、生动、超细节、清晰。",
negative_prompt="画面暗、低分辨率、不良手、文本、缺少手指、多余的手指、裁剪、低质量、颗粒状、签名、水印、用户名、模糊。",
num_inference_steps=30, cfg_scale=9, num_frames=51, seed=1,
tiled=True, tile_size=(34, 34), tile_stride=(16, 16)
)
save_video(
video, "video.mp4", fps=25, quality=5,
ffmpeg_params=["-vf", "atadenoise=0a=0.1:0b=0.1:1a=0.1:1b=0.1"]
)
# Clean up
dist.destroy_process_group()
error info:
The current problem is that loading the diffusion model with multiple cards will still cause oom!
24G VRAM is not enough to store the whole model. Please set device="cpu" when you are using model_manager, and set device="cuda:xxx" when you are creating the pipeline. So that our program can offload each layer.
24G VRAM is not enough to store the whole model. Please set device="cpu" when you are using model_manager, and set device="cuda:xxx" when you are creating the pipeline. So that our program can offload each layer.
I currently have a machine with 8 cards * 4090 * 40GiB. How can I configure it to achieve the fastest inference?
@tensorflowt We sincerely apologize, but we currently do not support multi-GPU parallel processing. This feature is still under development, so please stay tuned.
@tensorflowt We sincerely apologize, but we currently do not support multi-GPU parallel processing. This feature is still under development, so please stay tuned.
I am currently using 8 cards and 8 processes to generate 8 groups of tasks normally. The speed is not affected, but the CPU memory has increased exponentially. I understand that this part can be shared. Is it possible to load the memory of the public CPU part and load the GPU separately?