diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

HunyuanVideoImageToVideoPipeline memory leak

Open openSourcerer9000 opened this issue 5 months ago • 6 comments

Describe the bug

RuntimeError: Invalid buffer size: 635.13 GB on Apple Silicon with mps, when the readme claims 60GB. Smaller res possible but still with 10x the memory claimed. What's going on here? Is it leaking memory?

Reproducible from boilerplate example code:

from diffusers import HunyuanVideoImageToVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import load_image, export_to_video

# Available checkpoints: "hunyuanvideo-community/HunyuanVideo-I2V" and "hunyuanvideo-community/HunyuanVideo-I2V-33ch"
model_id = "hunyuanvideo-community/HunyuanVideo-I2V"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id, subfolder="transformer", torch_dtype=torch.bfloat16
)
pipe = HunyuanVideoImageToVideoPipeline.from_pretrained(
    model_id, transformer=transformer, torch_dtype=torch.float16
)
pipe.vae.enable_tiling()
pipe.to("mps")

prompt = "A man with short gray hair plays a red electric guitar."
image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png"
)

output = pipe(image=image, prompt=prompt).frames[0]
export_to_video(output, "output.mp4", fps=15)

Mac sequoia 15.5

Reproduction

see above

Logs


System Info

0.34.0.dev0 py3.11 macos sequoia.15.5

Who can help?

No response

openSourcerer9000 avatar Jun 07 '25 00:06 openSourcerer9000

Hi @openSourcerer9000 unfortunately MPS sometimes lacks efficient kernels for certain operations, which could be the reason for the memory spike. Does the stack trace mention where in the inference process it OOMs?

DN6 avatar Jun 10 '25 17:06 DN6

No, just that it wants 600GB memory. Do y'all have more info on memory benchmarks? The official repo mentions 60GB but not which optimizations they may be using. With a multi model system it's hard to gauge how much memory would actually be expected.

I also don't see a difference when loading a Q4 version of the transformer! Loading it first from a file and then specifying it in the pipeline.

On Tue, Jun 10, 2025, 11:46 AM Dhruv Nair @.***> wrote:

DN6 left a comment (huggingface/diffusers#11676) https://github.com/huggingface/diffusers/issues/11676#issuecomment-2960129244

Hi @openSourcerer9000 https://github.com/openSourcerer9000 unfortunately MPS sometimes lacks efficient kernels for certain operations, which could be the reason for the memory spike. Does the stack trace mention where in the inference process it OOMs?

— Reply to this email directly, view it on GitHub https://github.com/huggingface/diffusers/issues/11676#issuecomment-2960129244, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOYQCMWAEOFBDY27WDXU4DT3C4KWVAVCNFSM6AAAAAB6ZH5EQOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSNRQGEZDSMRUGQ . You are receiving this because you were mentioned.Message ID: @.***>

openSourcerer9000 avatar Jun 10 '25 19:06 openSourcerer9000

hi @openSourcerer9000 we don't run memory benchmark on mps and this is likely only specific to mps sharing a complete stack trace would help the community to chime in with solutions

yiyixuxu avatar Jun 10 '25 22:06 yiyixuxu

Any memory benchmarks are good (cuda or mps). The published numbers are only from the official hunyuan repo, and these things on arxiv are never peer reviewed anyway so like any marketing it should be taken with a grain of salt.

Ran it again, there is actually a traceback:

python

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[16], line 27

---> 27 output = pipe(image=img, 
     28               prompt=prompt,
     29               prompt_2=prompt,
     30               num_inference_steps=20,
     31               width=1280,
     32               height=720,
     33               num_frames=129,
     34             # #   guidance_scale=7.5,
     35             #   negative_prompt="low quality, worst quality, low resolution, blurry, deformed, distorted, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, out of frame",
     36               ).frames[0]

File /opt/anaconda3/envs/streem/lib/python3.11/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File /opt/anaconda3/envs/streem/lib/python3.11/site-packages/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py:914, in HunyuanVideoImageToVideoPipeline.__call__(self, image, prompt, prompt_2, negative_prompt, negative_prompt_2, height, width, num_frames, num_inference_steps, sigmas, true_cfg_scale, guidance_scale, num_videos_per_prompt, generator, latents, prompt_embeds, pooled_prompt_embeds, prompt_attention_mask, negative_prompt_embeds, negative_pooled_prompt_embeds, negative_prompt_attention_mask, output_type, return_dict, attention_kwargs, callback_on_step_end, callback_on_step_end_tensor_inputs, prompt_template, max_sequence_length, image_embed_interleave)
    911 elif image_condition_type == "token_replace":
    912     latent_model_input = torch.cat([image_latents, latents[:, :, 1:]], dim=2).to(transformer_dtype)
--> 914 noise_pred = self.transformer(
    915     hidden_states=latent_model_input,
    916     timestep=timestep,
    917     encoder_hidden_states=prompt_embeds,
    918     encoder_attention_mask=prompt_attention_mask,
    919     pooled_projections=pooled_prompt_embeds,
    920     guidance=guidance,
    921     attention_kwargs=attention_kwargs,
    922     return_dict=False,
    923 )[0]
    925 if do_true_cfg:
    926     neg_noise_pred = self.transformer(
    927         hidden_states=latent_model_input,
    928         timestep=timestep,
   (...)    934         return_dict=False,
    935     )[0]

File /opt/anaconda3/envs/streem/lib/python3.11/site-packages/torch/nn/modules/module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs)
   1749     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1750 else:
-> 1751     return self._call_impl(*args, **kwargs)

File /opt/anaconda3/envs/streem/lib/python3.11/site-packages/torch/nn/modules/module.py:1762, in Module._call_impl(self, *args, **kwargs)
   1757 # If we don't have any hooks, we want to skip the rest of the logic in
   1758 # this function, and just call forward.
   1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1760         or _global_backward_pre_hooks or _global_backward_hooks
   1761         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1762     return forward_call(*args, **kwargs)
   1764 result = None
   1765 called_always_called_hooks = set()

File /opt/anaconda3/envs/streem/lib/python3.11/site-packages/diffusers/models/transformers/transformer_hunyuan_video.py:1109, in HunyuanVideoTransformer3DModel.forward(self, hidden_states, timestep, encoder_hidden_states, encoder_attention_mask, pooled_projections, guidance, attention_kwargs, return_dict)
   1107 else:
   1108     for block in self.transformer_blocks:
-> 1109         hidden_states, encoder_hidden_states = block(
   1110             hidden_states,
   1111             encoder_hidden_states,
   1112             temb,
   1113             attention_mask,
   1114             image_rotary_emb,
   1115             token_replace_emb,
   1116             first_frame_num_tokens,
   1117         )
   1119     for block in self.single_transformer_blocks:
   1120         hidden_states, encoder_hidden_states = block(
   1121             hidden_states,
   1122             encoder_hidden_states,
   (...)   1127             first_frame_num_tokens,
   1128         )

File /opt/anaconda3/envs/streem/lib/python3.11/site-packages/torch/nn/modules/module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs)
   1749     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1750 else:
-> 1751     return self._call_impl(*args, **kwargs)

File /opt/anaconda3/envs/streem/lib/python3.11/site-packages/torch/nn/modules/module.py:1762, in Module._call_impl(self, *args, **kwargs)
   1757 # If we don't have any hooks, we want to skip the rest of the logic in
   1758 # this function, and just call forward.
   1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1760         or _global_backward_pre_hooks or _global_backward_hooks
   1761         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1762     return forward_call(*args, **kwargs)
   1764 result = None
   1765 called_always_called_hooks = set()

File /opt/anaconda3/envs/streem/lib/python3.11/site-packages/diffusers/models/transformers/transformer_hunyuan_video.py:789, in HunyuanVideoTokenReplaceTransformerBlock.forward(self, hidden_states, encoder_hidden_states, temb, attention_mask, freqs_cis, token_replace_emb, num_tokens)
    784 norm_encoder_hidden_states, c_gate_msa, c_shift_mlp, c_scale_mlp, c_gate_mlp = self.norm1_context(
    785     encoder_hidden_states, emb=temb
    786 )
    788 # 2. Joint attention
--> 789 attn_output, context_attn_output = self.attn(
    790     hidden_states=norm_hidden_states,
    791     encoder_hidden_states=norm_encoder_hidden_states,
    792     attention_mask=attention_mask,
    793     image_rotary_emb=freqs_cis,
    794 )
    796 # 3. Modulation and residual connection
    797 hidden_states_zero = hidden_states[:, :num_tokens] + attn_output[:, :num_tokens] * tr_gate_msa.unsqueeze(1)

File /opt/anaconda3/envs/streem/lib/python3.11/site-packages/torch/nn/modules/module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs)
   1749     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1750 else:
-> 1751     return self._call_impl(*args, **kwargs)

File /opt/anaconda3/envs/streem/lib/python3.11/site-packages/torch/nn/modules/module.py:1762, in Module._call_impl(self, *args, **kwargs)
   1757 # If we don't have any hooks, we want to skip the rest of the logic in
   1758 # this function, and just call forward.
   1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1760         or _global_backward_pre_hooks or _global_backward_hooks
   1761         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1762     return forward_call(*args, **kwargs)
   1764 result = None
   1765 called_always_called_hooks = set()

File /opt/anaconda3/envs/streem/lib/python3.11/site-packages/diffusers/models/attention_processor.py:605, in Attention.forward(self, hidden_states, encoder_hidden_states, attention_mask, **cross_attention_kwargs)
    600     logger.warning(
    601         f"cross_attention_kwargs {unused_kwargs} are not expected by {self.processor.__class__.__name__} and will be ignored."
    602     )
    603 cross_attention_kwargs = {k: w for k, w in cross_attention_kwargs.items() if k in attn_parameters}
--> 605 return self.processor(
    606     self,
    607     hidden_states,
    608     encoder_hidden_states=encoder_hidden_states,
    609     attention_mask=attention_mask,
    610     **cross_attention_kwargs,
    611 )

File /opt/anaconda3/envs/streem/lib/python3.11/site-packages/diffusers/models/transformers/transformer_hunyuan_video.py:120, in HunyuanVideoAttnProcessor2_0.__call__(self, attn, hidden_states, encoder_hidden_states, attention_mask, image_rotary_emb)
    117     value = torch.cat([value, encoder_value], dim=2)
    119 # 5. Attention
--> 120 hidden_states = F.scaled_dot_product_attention(
    121     query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
    122 )
    123 hidden_states = hidden_states.transpose(1, 2).flatten(2, 3)
    124 hidden_states = hidden_states.to(query.dtype)

RuntimeError: Invalid buffer size: 635.13 GB

openSourcerer9000 avatar Jun 11 '25 02:06 openSourcerer9000

ok this is a known issue with mps, same as https://github.com/huggingface/diffusers/issues/9972

yiyixuxu avatar Jun 11 '25 21:06 yiyixuxu

Hm MLX is higher maintenance to generate random tensors. It's possible MPS requires extra boilerplate too to manage memory correctly. https://github.com/ml-explore/mlx/issues/2254#event-18086129505

openSourcerer9000 avatar Jun 11 '25 22:06 openSourcerer9000