MuseTalk 实时推理4090d爆显存问题

(venv) nayota@dell-Precision-3660:~/source/MuseTalk$ sh inference.sh v1.5 realtime please download ffmpeg-static and export to FFMPEG_PATH. For example: export FFMPEG_PATH=/musetalk/ffmpeg-4.4-amd64-static Loads checkpoint by local backend from path: ./models/dwpose/dw-ll_ucoco_384.pth cuda start /home/nayota/source/MuseTalk/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py:125: UserWarning: Decorating classes is deprecated and will be disabled in future versions. You should only decorate functions or methods. To preserve the current behavior of class decoration, you can directly decorate the __init__ method and nothing else. warnings.warn("Decorating classes is deprecated and will be disabled in " load unet model from ./models/musetalkV15/unet.pth {'avator_1': {'preparation': True, 'bbox_shift': 5, 'video_path': 'data/video/yongen.mp4', 'audio_clips': {'audio_0': 'data/audio/yongen.wav'}}} avator_1 exists, Do you want to re-create it ? (y/n)y

creating avator: avator_1

preparing data materials ... ... extracting landmarks... reading images... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 259/259 [00:02<00:00, 113.76it/s] get key_landmark and face bounding boxes with the default value 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 259/259 [00:08<00:00, 29.02it/s] bbox_shift parameter adjustment************** Total frame:「259」 Manually adjust range : [ -21~23 ] , the current value: 0

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 518/518 [00:17<00:00, 29.40it/s] Inferring using: data/audio/yongen.wav start inference 2025-04-11 19:44:35.014889: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2025-04-11 19:44:35.032564: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2025-04-11 19:44:35.349526: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT processing audio:data/audio/yongen.wav costs 1122.328281402588ms 200 0%| | 0/8 [00:02<?, ?it/s] Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/nayota/source/MuseTalk/scripts/realtime_inference.py", line 387, in avatar.inference(audio_path, File "/home/nayota/source/MuseTalk/scripts/realtime_inference.py", line 269, in inference recon = vae.decode_latents(pred_latents) File "/home/nayota/source/MuseTalk/musetalk/models/vae.py", line 103, in decode_latents image = self.vae.decode(latents.to(self.vae.dtype)).sample File "/home/nayota/source/MuseTalk/venv/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper return method(self, *args, **kwargs) File "/home/nayota/source/MuseTalk/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl.py", line 304, in decode decoded = self._decode(z).sample File "/home/nayota/source/MuseTalk/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl.py", line 275, in _decode dec = self.decoder(z) File "/home/nayota/source/MuseTalk/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/nayota/source/MuseTalk/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/vae.py", line 338, in forward sample = up_block(sample, latent_embeds) File "/home/nayota/source/MuseTalk/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/nayota/source/MuseTalk/venv/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 2737, in forward hidden_states = resnet(hidden_states, temb=temb) File "/home/nayota/source/MuseTalk/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/nayota/source/MuseTalk/venv/lib/python3.10/site-packages/diffusers/models/resnet.py", line 346, in forward hidden_states = self.conv1(hidden_states) File "/home/nayota/source/MuseTalk/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/nayota/source/MuseTalk/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/nayota/source/MuseTalk/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward return F.conv2d(input, weight, bias, self.stride, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 400.00 MiB (GPU 0; 23.64 GiB total capacity; 22.15 GiB already allocated; 302.38 MiB free; 22.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 用的官方示例视频为什么会爆显存？帧数太高了吗？

Apr 11 '25 11:04 codestart-zhu

@codestart-zhu 您可以把batch_size调小一些

Apr 11 '25 14:04 zzzweakman

@zzzweakman 您好，请问是要改哪个文件呢，麻烦指导一下谢谢

Apr 11 '25 15:04 codestart-zhu

@zzzweakman 您好，已经调整了batch_size大小，改成15现在占用在20g左右，想问一下实时推理的好像只有图片是实时生成的，音频是最后生成出来的

Apr 12 '25 02:04 codestart-zhu

@zzzweakman 您好，已经调整了batch_size大小，改成15现在占用在20g左右，想问一下实时推理的好像只有图片是实时生成的，音频是最后生成出来的

是的，因为代码里有合成视频这一步，会将声音和图像序列合成视频

Apr 12 '25 15:04 zzzweakman

@zzzweakman 您好，已经调整了batch_size大小，改成15现在占用在20g左右，想问一下实时推理的好像只有图片是实时生成的，音频是最后生成出来的

4090D的运算能力，你哪怕改成4或者2都能满足你的试试推理要求。显存占用我测试下来，能做到11G。也就是3080,4080,5080都能跑，但是瓶颈不在显存，在GPU算力，因为GPU占用率爆了，多开满足不了实时推理性能

May 25 '25 06:05 wanlichina