onediff icon indicating copy to clipboard operation
onediff copied to clipboard

[Bug] Support to compile CogVideoXPipeline

Open loretoparisi opened this issue 5 months ago • 1 comments

Your current environment information

Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OneFlow version: none
Nexfort version: none
OneDiff version: none
OneDiffX version: none

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 22 2023, 10:22:35)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.10.219-208.866.amzn2.x86_64-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A10G
GPU 1: NVIDIA A10G
GPU 2: NVIDIA A10G
GPU 3: NVIDIA A10G

Nvidia driver version: 535.183.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
Address sizes:                        48 bits physical, 48 bits virtual
CPU(s):                               48
On-line CPU(s) list:                  0-47
Thread(s) per core:                   2
Core(s) per socket:                   24
Socket(s):                            1
NUMA node(s):                         1
Vendor ID:                            AuthenticAMD
CPU family:                           23
Model:                                49
Model name:                           AMD EPYC 7R32
Stepping:                             0
CPU MHz:                              3096.929
BogoMIPS:                             5599.34
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            768 KiB
L1i cache:                            768 KiB
L2 cache:                             12 MiB
L3 cache:                             96 MiB
NUMA node0 CPU(s):                    0-47
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow:   Mitigation; safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid

Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] onnx==1.14.1
[pip3] onnxruntime-gpu==1.14.1
[pip3] pytorch-lightning==1.9.5
[pip3] sentence-transformers==2.2.2
[pip3] torch==2.4.0
[pip3] torchao==0.3.1
[pip3] torchmetrics==1.4.1
[pip3] torchtune==0.2.1
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.0
[pip3] triton==3.0.0
[conda] Could not collect

🐛 Describe the bug

When attempting to save a CogVideoXPipeline using the official demo script modified adding onediffx

installed as

pip install -r requirements.txt
# Added **experimental** support for onediff, this reduced sampling time by ~40% for me, reaching 4.23 s/it on 4090 with 49 frames. 
# This requires using Linux, torch 2.4.0, onediff and nexfort installation:
pip install --pre onediff onediffx
pip install nexfort

where the requiremets are

diffusers>=0.30.1 #git+https://github.com/huggingface/diffusers.git@main#egg=diffusers is suggested
transformers>=4.44.0  # The development team is working on version 0.44.2
accelerate>=0.33.0 #git+https://github.com/huggingface/accelerate.git@main#egg=accelerate is suggested
sentencepiece>=0.2.0 # T5 used
SwissArmyTransformer>=0.4.12
numpy
torch>=2.4.0 # Tested in 2.2 2.3 2.4 and 2.5, The development team is working on version 2.4.0.
torchvision>=0.19.0 # The development team is working on version 0.19.0.
gradio>=4.42.0 # For HF gradio demo
streamlit>=1.37.1 # For streamlit web demo
imageio==2.34.2 # For diffusers inference export video
imageio-ffmpeg==0.5.1 # For diffusers inference export video
openai>=1.42.0 # For prompt refiner
moviepy==1.0.3 # For export video
pillow==9.5.0
torchao==0.4.0

It will fail with an error:


# load pip
pipe = CogVideoXPipeline.from_pretrained(
            model_path,
            text_encoder=text_encoder,
            transformer=transformer,
            vae=vae,
            torch_dtype=dtype,
        ).to(device)
        pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")

#  optmizations enabled
pipe.enable_model_cpu_offload(device=device)
pipe.enable_sequential_cpu_offload(device=device)
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()

# generate
video = pipe(
        prompt=prompt,
        num_videos_per_prompt=num_videos_per_prompt,
        num_inference_steps=num_inference_steps,
        num_frames=num_frames,
        use_dynamic_cfg=True,  ## This id used for DPM Sechduler, for DDIM scheduler, it should be False
        guidance_scale=guidance_scale,
        generator=torch.Generator(device=device).manual_seed(42)
    ).frames[0]
    
 # save compiled pipeline if supported by compile_backend
if compile_backend == "onediff":
    save_pipe(pipe, dir="cached_pipe", overwrite=True)

The error was 'T5EncoderModel' object has no attribute '_deployable_module_dpl_graph', while the folder cached_pipe has been created but it is empty.

Additionally Torch Dynamo metrics here (at first run before the error)

I0830 11:29:18.335000 140500382687232 torch/_dynamo/utils.py:335] TorchDynamo compilation metrics:
I0830 11:29:18.335000 140500382687232 torch/_dynamo/utils.py:335] Function, Runtimes (s)
V0830 11:29:18.335000 140500382687232 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats constrain_symbol_range: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V0830 11:29:18.335000 140500382687232 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats evaluate_expr: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
V0830 11:29:18.336000 140500382687232 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _simplify_floor_div: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V0830 11:29:18.336000 140500382687232 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _maybe_guard_rel: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
V0830 11:29:18.336000 140500382687232 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _find: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V0830 11:29:18.336000 140500382687232 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats has_hint: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
V0830 11:29:18.336000 140500382687232 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats size_hint: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
V0830 11:29:18.336000 140500382687232 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats simplify: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V0830 11:29:18.336000 140500382687232 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _update_divisible: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V0830 11:29:18.336000 140500382687232 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats replace: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V0830 11:29:18.336000 140500382687232 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _maybe_evaluate_static: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V0830 11:29:18.336000 140500382687232 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats get_implications: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V0830 11:29:18.336000 140500382687232 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats get_axioms: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
V0830 11:29:18.336000 140500382687232 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats safe_expand: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
V0830 11:29:18.337000 140500382687232 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats uninteresting_files: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)

loretoparisi avatar Aug 30 '24 09:08 loretoparisi