Open-Sora
Open-Sora copied to clipboard
T5LayerNorm error
Hi guys,
I'm finally able to start the training, but I'm encountering these errors:
AssertionError: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48
and the following as well:
RuntimeError: Failed to replace layer_norm of type T5LayerNorm with T5LayerNorm with the exception: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48. Please check your model configuration or sharding policy, you can set up an issue for us to help you as well.
E0817 18:46:55.563000 134001257882240 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 7268) of binary: /usr/bin/python3
Would you have any idea what could be done here by chance ?
Here is the command, followed by the log trace:
torchrun --standalone --nproc_per_node 1 -m scripts.train \
configs/opensora-v1-2/train/stage1.py \
--data-path {ROOT_META}/meta_clips_caption_cleaned.csv \
--ckpt-path {MODEL_OUTPUT}/my_sora.pt
/usr/local/lib/python3.10/dist-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/usr/local/lib/python3.10/dist-packages/torch/utils/_pytree.py:300: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
[2024-08-17 18:45:05] Experiment directory created at outputs/005-STDiT3-XL-2
[2024-08-17 18:45:05] Training configuration:
{'adam_eps': 1e-15,
'bucket_config': {'1024': {1: (0.05, 36)},
'1080p': {1: (0.1, 5)},
'144p': {1: (1.0, 475),
51: (1.0, 51),
102: ((1.0, 0.33), 27),
204: ((1.0, 0.1), 13),
408: ((1.0, 0.1), 6)},
'2048': {1: (0.1, 5)},
'240p': {1: (0.3, 297),
51: (0.4, 20),
102: ((0.4, 0.33), 10),
204: ((0.4, 0.1), 5),
408: ((0.4, 0.1), 2)},
'256': {1: (0.4, 297),
51: (0.5, 20),
102: ((0.5, 0.33), 10),
204: ((0.5, 0.1), 5),
408: ((0.5, 0.1), 2)},
'360p': {1: (0.2, 141),
51: (0.15, 8),
102: ((0.15, 0.33), 4),
204: ((0.15, 0.1), 2),
408: ((0.15, 0.1), 1)},
'480p': {1: (0.1, 89)},
'512': {1: (0.1, 141)},
'720p': {1: (0.05, 36)}},
'ckpt_every': 200,
'config': 'configs/opensora-v1-2/train/stage1.py',
'dataset': {'data_path': '/content/drive/MyDrive/Open-Sora/opensora/data/meta/meta_clips_caption_cleaned.csv',
'transform_name': 'resize_crop',
'type': 'VariableVideoTextDataset'},
'dtype': 'bf16',
'ema_decay': 0.99,
'epochs': 1000,
'grad_checkpoint': True,
'grad_clip': 1.0,
'load': None,
'log_every': 10,
'lr': 0.0001,
'mask_ratios': {'image_head': 0.05,
'image_head_tail': 0.025,
'image_random': 0.025,
'image_tail': 0.025,
'intepolate': 0.005,
'quarter_head': 0.005,
'quarter_head_tail': 0.005,
'quarter_random': 0.005,
'quarter_tail': 0.005,
'random': 0.05},
'model': {'enable_flash_attn': True,
'enable_layernorm_kernel': True,
'freeze_y_embedder': True,
'from_pretrained': '/content/drive/MyDrive/Open-Sora/opensora/output/my_sora.pt',
'qk_norm': True,
'type': 'STDiT3-XL/2'},
'num_bucket_build_workers': 16,
'num_workers': 8,
'outputs': 'outputs',
'plugin': 'zero2',
'record_time': False,
'scheduler': {'sample_method': 'logit-normal',
'type': 'rflow',
'use_timestep_transform': True},
'seed': 42,
'start_from_scratch': False,
'text_encoder': {'from_pretrained': 'DeepFloyd/t5-v1_1-xxl',
'model_max_length': 300,
'shardformer': True,
'type': 't5'},
'vae': {'from_pretrained': 'hpcai-tech/OpenSora-VAE-v1.2',
'micro_batch_size': 4,
'micro_frame_size': 17,
'type': 'OpenSoraVAE_V1_2'},
'wandb': False,
'warmup_steps': 1000}
2024-08-17 18:45:05.718442: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-17 18:45:05.740201: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-17 18:45:05.746831: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-17 18:45:06.875860: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2024-08-17 18:45:07] Building dataset...
[2024-08-17 18:45:07] Dataset contains 941 samples.
[2024-08-17 18:45:07] Number of buckets: 626
INFO: Pandarallel will run on 16 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[2024-08-17 18:45:07] Building buckets...
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = os.fork()
[2024-08-17 18:45:08] Bucket Info:
[2024-08-17 18:45:08] Bucket [#sample, #batch] by aspect ratio:
{'0.56': [614, 13]}
[2024-08-17 18:45:08] Image Bucket [#sample, #batch] by HxWxT:
{}
[2024-08-17 18:45:08] Video Bucket [#sample, #batch] by HxWxT:
{('144p', 408): [1, 0],
('144p', 204): [10, 0],
('144p', 102): [126, 4],
('144p', 51): [477, 9]}
[2024-08-17 18:45:08] #training batch: 13, #training sample: 614, #non empty bucket: 4
[2024-08-17 18:45:08] Building models...
[2024-08-17 18:45:08] WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.4.0+cu121 with CUDA 1201 (you have 2.3.0+cu121)
Python 3.10.14 (you have 3.10.12)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_2d.py:34: FutureWarning: `Transformer2DModelOutput` is deprecated and will be removed in version 1.0.0. Importing `Transformer2DModelOutput` from `diffusers.models.transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.modeling_outputs import Transformer2DModelOutput`, instead.
deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
tokenizer_config.json: 100% 1.86k/1.86k [00:00<00:00, 12.5MB/s]
config.json: 100% 752/752 [00:00<00:00, 5.11MB/s]
spiece.model: 100% 792k/792k [00:00<00:00, 42.9MB/s]
special_tokens_map.json: 100% 1.79k/1.79k [00:00<00:00, 13.0MB/s]
pytorch_model.bin.index.json: 100% 20.0k/20.0k [00:00<00:00, 70.0MB/s]
Downloading shards: 0% 0/2 [00:00<?, ?it/s]
pytorch_model-00001-of-00002.bin: 0% 0.00/9.45G [00:00<?, ?B/s]
pytorch_model-00001-of-00002.bin: 0% 31.5M/9.45G [00:00<00:34, 270MB/s]
pytorch_model-00001-of-00002.bin: 1% 83.9M/9.45G [00:00<00:24, 389MB/s]
...
pytorch_model-00002-of-00002.bin: 99% 9.52G/9.60G [00:37<00:01, 60.6MB/s]
pytorch_model-00002-of-00002.bin: 100% 9.60G/9.60G [00:38<00:00, 252MB/s]
Downloading shards: 100% 2/2 [01:07<00:00, 33.70s/it]
Loading checkpoint shards: 100% 2/2 [00:23<00:00, 11.86s/it]
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 197, in _replace_sub_module
[rank0]: replace_layer = target_module.from_native_module(
[rank0]: File "/content/drive/MyDrive/Open-Sora/opensora/opensora/acceleration/shardformer/modeling/t5.py", line 31, in from_native_module
[rank0]: assert module.__class__.__name__ == "FusedRMSNorm", (
[rank0]: AssertionError: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48
[rank0]: During handling of the above exception, another exception occurred:
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/content/drive/MyDrive/Open-Sora/opensora/scripts/train.py", line 412, in <module>
[rank0]: main()
[rank0]: File "/content/drive/MyDrive/Open-Sora/opensora/scripts/train.py", line 118, in main
[rank0]: text_encoder = build_module(cfg.get("text_encoder", None), MODELS, device=device, dtype=dtype)
[rank0]: File "/content/drive/MyDrive/Open-Sora/opensora/opensora/registry.py", line 24, in build_module
[rank0]: return builder.build(cfg)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py", line 570, in build
[rank0]: return self.build_func(cfg, *args, **kwargs, registry=self)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank0]: obj = obj_cls(**args) # type: ignore
[rank0]: File "/content/drive/MyDrive/Open-Sora/opensora/opensora/models/text_encoder/t5.py", line 164, in __init__
[rank0]: self.shardformer_t5()
[rank0]: File "/content/drive/MyDrive/Open-Sora/opensora/opensora/models/text_encoder/t5.py", line 183, in shardformer_t5
[rank0]: optim_model, _ = shard_former.optimize(self.t5.model, policy=T5EncoderPolicy())
[rank0]: File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/shardformer.py", line 55, in optimize
[rank0]: shared_params = sharder.shard()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 43, in shard
[rank0]: self._replace_module(include=held_layers)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 67, in _replace_module
[rank0]: self._recursive_replace_layer(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer
[rank0]: self._recursive_replace_layer(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer
[rank0]: self._recursive_replace_layer(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer
[rank0]: self._recursive_replace_layer(
[rank0]: [Previous line repeated 2 more times]
[rank0]: File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 112, in _recursive_replace_layer
[rank0]: self._replace_sub_module(module, sub_module_replacement, include)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 201, in _replace_sub_module
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: Failed to replace layer_norm of type T5LayerNorm with T5LayerNorm with the exception: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48. Please check your model configuration or sharding policy, you can set up an issue for us to help you as well.
E0817 18:46:55.563000 134001257882240 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 7268) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts.train FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-08-17_18:46:55
host : 33ca7b3d91f9
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 7268)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================