CogVideo While generating a 10-second video with CogVideoX-1.5, encountered (OOM) on machine with 80 GB of GPU memory.

System Info / 系統信息

Package                  Version
------------------------ -----------
accelerate               1.1.1
aiofiles                 23.2.1
aiohappyeyeballs         2.4.3
aiohttp                  3.10.10
aiosignal                1.3.1
annotated-types          0.7.0
antlr4-python3-runtime   4.9.3
anyio                    4.6.2.post1
async-timeout            4.0.3
attrs                    24.2.0
beartype                 0.19.0
boto3                    1.35.57
botocore                 1.35.57
braceexpand              0.1.7
certifi                  2024.8.30
charset-normalizer       3.4.0
click                    8.1.7
cpm-kernels              1.0.11
datasets                 3.1.0
decorator                4.4.2
deepspeed                0.15.4
diffusers                0.31.0
dill                     0.3.8
distro                   1.9.0
docker-pycreds           0.4.0
einops                   0.8.0
exceptiongroup           1.2.2
fastapi                  0.115.4
ffmpy                    0.4.0
filelock                 3.16.1
frozenlist               1.5.0
fsspec                   2024.9.0
gitdb                    4.0.11
GitPython                3.1.43
gradio                   5.5.0
gradio_client            1.4.2
h11                      0.14.0
hjson                    3.1.0
httpcore                 1.0.6
httpx                    0.27.2
huggingface-hub          0.26.2
idna                     3.10
imageio                  2.36.0
imageio-ffmpeg           0.5.1
importlib_metadata       8.5.0
Jinja2                   3.1.4
jiter                    0.7.0
jmespath                 1.0.1
kornia                   0.7.4
kornia_rs                0.1.7
lightning-utilities      0.11.8
markdown-it-py           3.0.0
MarkupSafe               2.1.5
mdurl                    0.1.2
moviepy                  1.0.3
mpmath                   1.3.0
msgpack                  1.1.0
multidict                6.1.0
multiprocess             0.70.16
networkx                 3.4.2
ninja                    1.11.1.1
numpy                    1.26.0
nvidia-cublas-cu12       12.4.5.8
nvidia-cuda-cupti-cu12   12.4.127
nvidia-cuda-nvrtc-cu12   12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12        9.1.0.70
nvidia-cufft-cu12        11.2.1.3
nvidia-curand-cu12       10.3.5.147
nvidia-cusolver-cu12     11.6.1.9
nvidia-cusparse-cu12     12.3.1.170
nvidia-nccl-cu12         2.21.5
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.4.127
omegaconf                2.3.0
openai                   1.54.3
orjson                   3.10.11
packaging                24.2
pandas                   2.2.3
pillow                   11.0.0
pip                      24.2
platformdirs             4.3.6
proglog                  0.1.10
propcache                0.2.0
protobuf                 5.28.3
psutil                   6.1.0
py-cpuinfo               9.0.0
pyarrow                  18.0.0
pydantic                 2.9.2
pydantic_core            2.23.4
pydub                    0.25.1
Pygments                 2.18.0
python-dateutil          2.9.0.post0
python-multipart         0.0.12
pytorch-lightning        2.4.0
pytz                     2024.2
PyYAML                   6.0.2
regex                    2024.11.6
requests                 2.32.3
rich                     13.9.4
ruff                     0.7.3
s3transfer               0.10.3
safehttpx                0.1.1
safetensors              0.4.5
scikit-video             1.1.11
scipy                    1.14.1
semantic-version         2.10.0
sentencepiece            0.2.0
sentry-sdk               2.18.0
setproctitle             1.3.3
setuptools               75.1.0
shellingham              1.5.4
six                      1.16.0
smmap                    5.0.1
sniffio                  1.3.1
starlette                0.41.2
SwissArmyTransformer     0.4.12
sympy                    1.13.1
tensorboardX             2.6.2.2
tokenizers               0.20.3
tomlkit                  0.12.0
torch                    2.5.1
torchmetrics             1.5.2
torchvision              0.20.1
tqdm                     4.67.0
transformers             4.46.2
triton                   3.1.0
typer                    0.13.0
typing_extensions        4.12.2
tzdata                   2024.2
urllib3                  2.2.3
uvicorn                  0.32.0
wandb                    0.18.6
webdataset               0.2.100
websockets               12.0
wheel                    0.44.0
xxhash                   3.5.0
yarl                     1.17.1
zipp                     3.21.0

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[X] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

I only changed the model path in sat/configs/cogvideox1.5_5b.yaml and sat/configs/inference.yaml according to the SAT workflow and performed inference on a single 80G GPU on 10-second video.

# inference.yaml
args:
#  image2video: False  # True for image2video, False for text2video
  latent_channels: 16
  mode: inference
  load: "MY_PATH" # This is for Full model without lora adapter
  batch_size: 1
  input_type: txt
  input_file: configs/test.txt 
  sampling_image_size: [768, 1360] # remove this for I2V
  sampling_num_frames: 42  # 42 for 10 seconds and 22 for 5 seconds
  sampling_fps: 16
  bf16: True
  output_dir: cogvideox1.5
  force_inference: True

but I got OOM when

236                    samples_x = model.decode_first_stage(samples_z).to(torch.float32)

The OOM info is here:

##############################  Sampling setting  ##############################
Sampler: VPSDEDPMPP2MSampler
Discretization: ZeroSNRDDPMDiscretization
Guider: DynamicCFG
Sampling with VPSDEDPMPP2MSampler for 51 steps:  98%|█████████▊| 50/51 [43:58<00:52, 52.76s/it]
/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/conv.py:720: UserWarning: cuDNN cannot be used for large non-batch-splittable convolutions if the V8 API is not enabled or before cuDNN version 9.3+. Consider upgrading cuDNN and/or enabling the V8 API for better efficiency. (Triggered internally at ../aten/src/ATen/native/Convolution.cpp:430.)
  return F.conv3d(
0it [44:19, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/sample_video_ori.py", line 262, in <module>
[rank0]:     sampling_main(args, model_cls=SATVideoDiffusionEngine)
[rank0]:   File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/sample_video_ori.py", line 236, in sampling_main
[rank0]:     samples_x = model.decode_first_stage(samples_z).to(torch.float32)
[rank0]:   File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/diffusion_video.py", line 198, in decode_first_stage
[rank0]:     recon = self.first_stage_model.decode(
[rank0]:   File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/autoencoder.py", line 620, in decode
[rank0]:     x = super().decode(z, use_cp=use_cp, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/autoencoder.py", line 214, in decode
[rank0]:     x = self.decoder(z, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/cp_enc_dec.py", line 960, in forward
[rank0]:     h = self.up[i_level].block[i_block](
[rank0]:   File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/cp_enc_dec.py", line 681, in forward
[rank0]:     h = self.conv1(h, clear_cache=clear_fake_cp_cache)
[rank0]:   File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/cp_enc_dec.py", line 391, in forward
[rank0]:     output_parallel = self.conv(input_parallel)
[rank0]:   File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/utils.py", line 88, in forward
[rank0]:     output = torch.cat(output_chunks, dim=2)
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.98 GiB. GPU 0 has a total capacity of 79.32 GiB of which 2.24 GiB is free. Including non-PyTorch memory, this process has 77.08 GiB memory in use. Of the allocated memory 69.85 GiB is allocated by PyTorch, and 5.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W1112 23:23:31.317344620 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

Expected behavior / 期待表现

Find the possible causes and ensure the 10-second inference goes well.

Nov 12 '24 15:11 DZY-irene

And per 10-second video generation costs 40 min. Is this a reasonable duration in your experience? Looking forward to your reply!

Nov 12 '24 15:11 DZY-irene

I also tried to add PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truein inference.sh, but got error:

##############################  Sampling setting  ##############################
Sampler: VPSDEDPMPP2MSampler
Discretization: ZeroSNRDDPMDiscretization
Guider: DynamicCFG
Sampling with VPSDEDPMPP2MSampler for 51 steps:  98%|█████████▊| 50/51 [44:23<00:53, 53.27s/it]
/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/conv.py:720: UserWarning: cuDNN cannot be used for large non-batch-splittable convolutions if the V8 API is not enabled or before cuDNN version 9.3+. Consider upgrading cuDNN and/or enabling the V8 API for better efficiency. (Triggered internally at ../aten/src/ATen/native/Convolution.cpp:430.)
  return F.conv3d(
0it [44:42, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/sample_video_ori.py", line 262, in <module>
[rank0]:     sampling_main(args, model_cls=SATVideoDiffusionEngine)
[rank0]:   File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/sample_video_ori.py", line 236, in sampling_main
[rank0]:     samples_x = model.decode_first_stage(samples_z).to(torch.float32)
[rank0]:   File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/diffusion_video.py", line 198, in decode_first_stage
[rank0]:     recon = self.first_stage_model.decode(
[rank0]:   File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/autoencoder.py", line 620, in decode
[rank0]:     x = super().decode(z, use_cp=use_cp, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/autoencoder.py", line 214, in decode
[rank0]:     x = self.decoder(z, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/cp_enc_dec.py", line 960, in forward
[rank0]:     h = self.up[i_level].block[i_block](
[rank0]:   File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/cp_enc_dec.py", line 676, in forward
[rank0]:     h = self.norm1(h, zq, clear_fake_cp_cache=clear_fake_cp_cache, fake_cp=fake_cp)
[rank0]:   File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/mnt/petrelfs/dongziyue/video/CogVideo/sat/vae_modules/cp_enc_dec.py", line 482, in forward
[rank0]:     new_f = norm_f * self.conv_y(zq) + self.conv_b(zq)
[rank0]: RuntimeError: CUDA driver error: invalid argument
[rank0]:[W1113 00:22:35.424437131 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

Nov 12 '24 17:11 DZY-irene

Totally the same OOM error

Nov 12 '24 18:11 zigchang

This is normal, it indeed takes 40 minutes to generate a 10-second video, we will look into this issue(OOM) shortly and release a version of diffusers with significantly reduced memory usage. However, this time (40 minutes) is normal.

Nov 13 '24 01:11 zRzRzRzRzRzRzR

@zRzRzRzRzRzRzR @DZY-irene @zigchang any update about this? I still got OOM error

Nov 15 '24 21:11 Florenyci

We are aware of this issue, but I have been working on the diffusers version recently. You can check the latest PR; we are in the final stage, and the diffusers version only requires a minimum of 9GB of GPU memory.

Nov 16 '24 02:11 zRzRzRzRzRzRzR