Open-Sora
Open-Sora copied to clipboard
errors happened in the inference process
hi, guys
after i run the inference command:
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x256x256.py --ckpt-path OpenSora-v1-HQ-16x256x256.pth --prompt-path ./assets/texts/t2v_samples.txt
errors are prompted, and the completed outputs are like follows:
[2024-04-27 14:23:05,034] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
Config (path: configs/opensora/inference/16x256x256.py): {'num_frames': 16, 'fps': 8, 'image_size': (256, 256), 'model': {'type': 'STDiT-XL/2', 'space_scale': 0.5, 'time_scale': 1.0, 'enable_flashattn': True, 'enable_layernorm_kernel': True, 'from_pretrained': 'OpenSora-v1-HQ-16x256x256.pth'}, 'vae': {'type': 'VideoAutoencoderKL', 'from_pretrained': 'stabilityai/sd-vae-ft-ema', 'micro_batch_size': 4}, 'text_encoder': {'type': 't5', 'from_pretrained': 'DeepFloyd/t5-v1_1-xxl', 'model_max_length': 120}, 'scheduler': {'type': 'iddpm', 'num_sampling_steps': 100, 'cfg_scale': 7.0}, 'dtype': 'fp16', 'batch_size': 1, 'seed': 42, 'prompt_path': './assets/texts/t2v_samples.txt', 'save_dir': './outputs/samples/', 'multi_resolution': False}
/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: config is deprecated and will be removed soon.
warnings.warn("config is deprecated and will be removed soon.")
[04/27/24 14:23:14] INFO colossalai - colossalai - INFO: /root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/colossalai/initialize.py:67 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, world size: 1
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:28<00:00, 14.11s/it]
Missing keys: []
Unexpected keys: []
0%| | 0/100 [00:03<?, ?it/s]
Traceback (most recent call last):
File "/home/yilinchen/Open-Sora/scripts/inference.py", line 112, in
main()
File "/home/yilinchen/Open-Sora/scripts/inference.py", line 93, in main
samples = scheduler.sample(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/init.py", line 72, in sample
samples = self.p_sample_loop(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 434, in p_sample_loop
for sample in self.p_sample_loop_progressive(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 485, in p_sample_loop_progressive
out = self.p_sample(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 388, in p_sample
out = self.p_mean_variance(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/respace.py", line 94, in p_mean_variance
return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 267, in p_mean_variance
model_output = model(x, t, **model_kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/respace.py", line 127, in call
return self.model(x, new_ts, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/init.py", line 89, in forward_with_cfg
model_out = model.forward(combined, timestep, y, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/models/stdit/stdit.py", line 267, in forward
x = auto_grad_checkpoint(block, x, y, t0, y_lens, tpe)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/acceleration/checkpoint.py", line 24, in auto_grad_checkpoint
return module(*args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/models/stdit/stdit.py", line 98, in forward
x_s = self.attn(x_s)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/models/layers/blocks.py", line 152, in forward
from flash_attn import flash_attn_func
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 10, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2
[2024-04-27 14:24:15,201] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 14511) of binary: /root/anaconda3/envs/env_open_sora/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/env_open_sora/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
scripts/inference.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2024-04-27_14:24:15 host : yilinchen-X10SRA rank : 0 (local_rank: 0) exitcode : 1 (pid: 14511) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
my installed softwares related with opensora-1.0.0 are like these:
apex 0.1 flash-attn 2.5.6 ninja 1.11.1.1 torch 2.1.2+cu121 torchaudio 2.1.2+cu121 torchvision 0.16.2+cu121 xformers 0.0.23.post1 packaging 24.0
the cuda version and pytorch related cuda version are same, both of them are 12.1:
(env_open_sora) root@yilinchen-X10SRA:/home/yilinchen/Open-Sora# nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0 (env_open_sora) root@yilinchen-X10SRA:/home/yilinchen/Open-Sora# python Python 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.
import torch print(torch.version.cuda) 12.1
my gpu card is rtx 4090:
Product Name : NVIDIA GeForce RTX 4090 Product Brand : GeForce Product Architecture : Ada Lovelace
the installation of opensora-1.0.0 is successful:
Successfully installed opensora-1.0.0
all files are in the directory of Open-Sora, in addition to that, i also put the downloaded .pth files in Open-Sora directly but any sub-directory of Open-Sora, these .pth files includes:
OpenSora-v1-16x256x256.pth
OpenSora-v1-HQ-16x512x512.pth
OpenSora-v1-HQ-16x256x256.pth
as the problems mentioned above, could any guys help me to fix them
thanks a lot~
Can you pip install --upgrade flash-attn --no-build-isolation?
Can you
pip install --upgrade flash-attn --no-build-isolation?
ok, after i run it to upgrade flash-attn-2.5.6 to flash-attn-2.5.8, videos can be created now! thank you so much, my friend!
This issue is stale because it has been open for 7 days with no activity.