While generating a 10-second video with CogVideoX-1.5, encountered (OOM) on machine with 80 GB of GPU memory.
System Info / 系統信息
Package Version
------------------------ -----------
accelerate 1.1.1
aiofiles 23.2.1
aiohappyeyeballs 2.4.3
aiohttp 3.10.10
aiosignal 1.3.1
annotated-types 0.7.0
antlr4-python3-runtime 4.9.3
anyio 4.6.2.post1
async-timeout 4.0.3
attrs 24.2.0
beartype 0.19.0
boto3 1.35.57
botocore 1.35.57
braceexpand 0.1.7
certifi 2024.8.30
charset-normalizer 3.4.0
click 8.1.7
cpm-kernels 1.0.11
datasets 3.1.0
decorator 4.4.2
deepspeed 0.15.4
diffusers 0.31.0
dill 0.3.8
distro 1.9.0
docker-pycreds 0.4.0
einops 0.8.0
exceptiongroup 1.2.2
fastapi 0.115.4
ffmpy 0.4.0
filelock 3.16.1
frozenlist 1.5.0
fsspec 2024.9.0
gitdb 4.0.11
GitPython 3.1.43
gradio 5.5.0
gradio_client 1.4.2
h11 0.14.0
hjson 3.1.0
httpcore 1.0.6
httpx 0.27.2
huggingface-hub 0.26.2
idna 3.10
imageio 2.36.0
imageio-ffmpeg 0.5.1
importlib_metadata 8.5.0
Jinja2 3.1.4
jiter 0.7.0
jmespath 1.0.1
kornia 0.7.4
kornia_rs 0.1.7
lightning-utilities 0.11.8
markdown-it-py 3.0.0
MarkupSafe 2.1.5
mdurl 0.1.2
moviepy 1.0.3
mpmath 1.3.0
msgpack 1.1.0
multidict 6.1.0
multiprocess 0.70.16
networkx 3.4.2
ninja 1.11.1.1
numpy 1.26.0
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
omegaconf 2.3.0
openai 1.54.3
orjson 3.10.11
packaging 24.2
pandas 2.2.3
pillow 11.0.0
pip 24.2
platformdirs 4.3.6
proglog 0.1.10
propcache 0.2.0
protobuf 5.28.3
psutil 6.1.0
py-cpuinfo 9.0.0
pyarrow 18.0.0
pydantic 2.9.2
pydantic_core 2.23.4
pydub 0.25.1
Pygments 2.18.0
python-dateutil 2.9.0.post0
python-multipart 0.0.12
pytorch-lightning 2.4.0
pytz 2024.2
PyYAML 6.0.2
regex 2024.11.6
requests 2.32.3
rich 13.9.4
ruff 0.7.3
s3transfer 0.10.3
safehttpx 0.1.1
safetensors 0.4.5
scikit-video 1.1.11
scipy 1.14.1
semantic-version 2.10.0
sentencepiece 0.2.0
sentry-sdk 2.18.0
setproctitle 1.3.3
setuptools 75.1.0
shellingham 1.5.4
six 1.16.0
smmap 5.0.1
sniffio 1.3.1
starlette 0.41.2
SwissArmyTransformer 0.4.12
sympy 1.13.1
tensorboardX 2.6.2.2
tokenizers 0.20.3
tomlkit 0.12.0
torch 2.5.1
torchmetrics 1.5.2
torchvision 0.20.1
tqdm 4.67.0
transformers 4.46.2
triton 3.1.0
typer 0.13.0
typing_extensions 4.12.2
tzdata 2024.2
urllib3 2.2.3
uvicorn 0.32.0
wandb 0.18.6
webdataset 0.2.100
websockets 12.0
wheel 0.44.0
xxhash 3.5.0
yarl 1.17.1
zipp 3.21.0
Information / 问题信息
- [X] The official example scripts / 官方的示例脚本
- [X] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
I only changed the model path in sat/configs/cogvideox1.5_5b.yaml and sat/configs/inference.yaml according to the SAT workflow and performed inference on a single 80G GPU on 10-second video.
# inference.yaml
args:
# image2video: False # True for image2video, False for text2video
latent_channels: 16
mode: inference
load: "MY_PATH" # This is for Full model without lora adapter
batch_size: 1
input_type: txt
input_file: configs/test.txt
sampling_image_size: [768, 1360] # remove this for I2V
sampling_num_frames: 42 # 42 for 10 seconds and 22 for 5 seconds
sampling_fps: 16
bf16: True
output_dir: cogvideox1.5
force_inference: True
but I got OOM when
236 samples_x = model.decode_first_stage(samples_z).to(torch.float32)
The OOM info is here:
############################## Sampling setting ##############################
Sampler: VPSDEDPMPP2MSampler
Discretization: ZeroSNRDDPMDiscretization
Guider: DynamicCFG
Sampling with VPSDEDPMPP2MSampler for 51 steps: 98%|█████████▊| 50/51 [43:58<00:52, 52.76s/it]
/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/conv.py:720: UserWarning: cuDNN cannot be used for large non-batch-splittable convolutions if the V8 API is not enabled or before cuDNN version 9.3+. Consider upgrading cuDNN and/or enabling the V8 API for better efficiency. (Triggered internally at ../aten/src/ATen/native/Convolution.cpp:430.)
return F.conv3d(
0it [44:19, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/sample_video_ori.py", line 262, in <module>
[rank0]: sampling_main(args, model_cls=SATVideoDiffusionEngine)
[rank0]: File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/sample_video_ori.py", line 236, in sampling_main
[rank0]: samples_x = model.decode_first_stage(samples_z).to(torch.float32)
[rank0]: File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/diffusion_video.py", line 198, in decode_first_stage
[rank0]: recon = self.first_stage_model.decode(
[rank0]: File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/autoencoder.py", line 620, in decode
[rank0]: x = super().decode(z, use_cp=use_cp, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/autoencoder.py", line 214, in decode
[rank0]: x = self.decoder(z, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/cp_enc_dec.py", line 960, in forward
[rank0]: h = self.up[i_level].block[i_block](
[rank0]: File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/cp_enc_dec.py", line 681, in forward
[rank0]: h = self.conv1(h, clear_cache=clear_fake_cp_cache)
[rank0]: File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/cp_enc_dec.py", line 391, in forward
[rank0]: output_parallel = self.conv(input_parallel)
[rank0]: File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/utils.py", line 88, in forward
[rank0]: output = torch.cat(output_chunks, dim=2)
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.98 GiB. GPU 0 has a total capacity of 79.32 GiB of which 2.24 GiB is free. Including non-PyTorch memory, this process has 77.08 GiB memory in use. Of the allocated memory 69.85 GiB is allocated by PyTorch, and 5.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W1112 23:23:31.317344620 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
Expected behavior / 期待表现
Find the possible causes and ensure the 10-second inference goes well.
And per 10-second video generation costs 40 min. Is this a reasonable duration in your experience? Looking forward to your reply!
I also tried to add PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truein inference.sh, but got error:
############################## Sampling setting ##############################
Sampler: VPSDEDPMPP2MSampler
Discretization: ZeroSNRDDPMDiscretization
Guider: DynamicCFG
Sampling with VPSDEDPMPP2MSampler for 51 steps: 98%|█████████▊| 50/51 [44:23<00:53, 53.27s/it]
/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/conv.py:720: UserWarning: cuDNN cannot be used for large non-batch-splittable convolutions if the V8 API is not enabled or before cuDNN version 9.3+. Consider upgrading cuDNN and/or enabling the V8 API for better efficiency. (Triggered internally at ../aten/src/ATen/native/Convolution.cpp:430.)
return F.conv3d(
0it [44:42, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/sample_video_ori.py", line 262, in <module>
[rank0]: sampling_main(args, model_cls=SATVideoDiffusionEngine)
[rank0]: File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/sample_video_ori.py", line 236, in sampling_main
[rank0]: samples_x = model.decode_first_stage(samples_z).to(torch.float32)
[rank0]: File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/diffusion_video.py", line 198, in decode_first_stage
[rank0]: recon = self.first_stage_model.decode(
[rank0]: File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/autoencoder.py", line 620, in decode
[rank0]: x = super().decode(z, use_cp=use_cp, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/autoencoder.py", line 214, in decode
[rank0]: x = self.decoder(z, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/cp_enc_dec.py", line 960, in forward
[rank0]: h = self.up[i_level].block[i_block](
[rank0]: File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/cp_enc_dec.py", line 676, in forward
[rank0]: h = self.norm1(h, zq, clear_fake_cp_cache=clear_fake_cp_cache, fake_cp=fake_cp)
[rank0]: File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/cp_enc_dec.py", line 482, in forward
[rank0]: new_f = norm_f * self.conv_y(zq) + self.conv_b(zq)
[rank0]: RuntimeError: CUDA driver error: invalid argument
[rank0]:[W1113 00:22:35.424437131 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
This is normal, it indeed takes 40 minutes to generate a 10-second video, we will look into this issue(OOM) shortly and release a version of diffusers with significantly reduced memory usage. However, this time (40 minutes) is normal.
@zRzRzRzRzRzRzR @DZY-irene @zigchang any update about this? I still got OOM error
We are aware of this issue, but I have been working on the diffusers version recently. You can check the latest PR; we are in the final stage, and the diffusers version only requires a minimum of 9GB of GPU memory.