Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

errors happened in the inference process

Open reich208github opened this issue 1 year ago • 3 comments
trafficstars

hi, guys

after i run the inference command:

torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x256x256.py --ckpt-path OpenSora-v1-HQ-16x256x256.pth --prompt-path ./assets/texts/t2v_samples.txt

errors are prompted, and the completed outputs are like follows:

[2024-04-27 14:23:05,034] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. Config (path: configs/opensora/inference/16x256x256.py): {'num_frames': 16, 'fps': 8, 'image_size': (256, 256), 'model': {'type': 'STDiT-XL/2', 'space_scale': 0.5, 'time_scale': 1.0, 'enable_flashattn': True, 'enable_layernorm_kernel': True, 'from_pretrained': 'OpenSora-v1-HQ-16x256x256.pth'}, 'vae': {'type': 'VideoAutoencoderKL', 'from_pretrained': 'stabilityai/sd-vae-ft-ema', 'micro_batch_size': 4}, 'text_encoder': {'type': 't5', 'from_pretrained': 'DeepFloyd/t5-v1_1-xxl', 'model_max_length': 120}, 'scheduler': {'type': 'iddpm', 'num_sampling_steps': 100, 'cfg_scale': 7.0}, 'dtype': 'fp16', 'batch_size': 1, 'seed': 42, 'prompt_path': './assets/texts/t2v_samples.txt', 'save_dir': './outputs/samples/', 'multi_resolution': False} /root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: config is deprecated and will be removed soon. warnings.warn("config is deprecated and will be removed soon.") [04/27/24 14:23:14] INFO colossalai - colossalai - INFO: /root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/colossalai/initialize.py:67 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, world size: 1
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.get(instance, owner)() Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:28<00:00, 14.11s/it] Missing keys: [] Unexpected keys: [] 0%| | 0/100 [00:03<?, ?it/s] Traceback (most recent call last): File "/home/yilinchen/Open-Sora/scripts/inference.py", line 112, in main() File "/home/yilinchen/Open-Sora/scripts/inference.py", line 93, in main samples = scheduler.sample( File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/init.py", line 72, in sample samples = self.p_sample_loop( File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 434, in p_sample_loop for sample in self.p_sample_loop_progressive( File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 485, in p_sample_loop_progressive out = self.p_sample( File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 388, in p_sample out = self.p_mean_variance( File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/respace.py", line 94, in p_mean_variance return super().p_mean_variance(self._wrap_model(model), *args, **kwargs) File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 267, in p_mean_variance model_output = model(x, t, **model_kwargs) File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/respace.py", line 127, in call return self.model(x, new_ts, **kwargs) File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/init.py", line 89, in forward_with_cfg model_out = model.forward(combined, timestep, y, **kwargs) File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/models/stdit/stdit.py", line 267, in forward x = auto_grad_checkpoint(block, x, y, t0, y_lens, tpe) File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/acceleration/checkpoint.py", line 24, in auto_grad_checkpoint return module(*args, **kwargs) File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/models/stdit/stdit.py", line 98, in forward x_s = self.attn(x_s) File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/models/layers/blocks.py", line 152, in forward from flash_attn import flash_attn_func File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/flash_attn/init.py", line 3, in from flash_attn.flash_attn_interface import ( File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 10, in import flash_attn_2_cuda as flash_attn_cuda ImportError: /root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2 [2024-04-27 14:24:15,201] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 14511) of binary: /root/anaconda3/envs/env_open_sora/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/env_open_sora/bin/torchrun", line 8, in sys.exit(main()) File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/inference.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-04-27_14:24:15 host : yilinchen-X10SRA rank : 0 (local_rank: 0) exitcode : 1 (pid: 14511) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

my installed softwares related with opensora-1.0.0 are like these:

apex 0.1 flash-attn 2.5.6 ninja 1.11.1.1 torch 2.1.2+cu121 torchaudio 2.1.2+cu121 torchvision 0.16.2+cu121 xformers 0.0.23.post1 packaging 24.0

the cuda version and pytorch related cuda version are same, both of them are 12.1:

(env_open_sora) root@yilinchen-X10SRA:/home/yilinchen/Open-Sora# nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0 (env_open_sora) root@yilinchen-X10SRA:/home/yilinchen/Open-Sora# python Python 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import torch print(torch.version.cuda) 12.1

my gpu card is rtx 4090:

Product Name : NVIDIA GeForce RTX 4090 Product Brand : GeForce Product Architecture : Ada Lovelace

the installation of opensora-1.0.0 is successful:

Successfully installed opensora-1.0.0

all files are in the directory of Open-Sora, in addition to that, i also put the downloaded .pth files in Open-Sora directly but any sub-directory of Open-Sora, these .pth files includes:

OpenSora-v1-16x256x256.pth
OpenSora-v1-HQ-16x512x512.pth OpenSora-v1-HQ-16x256x256.pth

as the problems mentioned above, could any guys help me to fix them

thanks a lot~

reich208github avatar Apr 27 '24 07:04 reich208github

Can you pip install --upgrade flash-attn --no-build-isolation?

JThh avatar Apr 28 '24 12:04 JThh

Can you pip install --upgrade flash-attn --no-build-isolation?

ok, after i run it to upgrade flash-attn-2.5.6 to flash-attn-2.5.8, videos can be created now! thank you so much, my friend!

reich208github avatar Apr 28 '24 16:04 reich208github

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar May 06 '24 01:05 github-actions[bot]