Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

T5LayerNorm error

Open lweingart opened this issue 6 months ago • 0 comments

Hi guys,

I'm finally able to start the training, but I'm encountering these errors:

AssertionError: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers

and the following as well:

RuntimeError: Failed to replace layer_norm of type T5LayerNorm with T5LayerNorm with the exception: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers Please check your model configuration or sharding policy, you can set up an issue for us to help you as well.
E0817 18:46:55.563000 134001257882240 torch/distributed/elastic/multiprocessing/] failed (exitcode: 1) local_rank: 0 (pid: 7268) of binary: /usr/bin/python3

Would you have any idea what could be done here by chance ?

Here is the command, followed by the log trace:

torchrun --standalone --nproc_per_node 1 -m scripts.train \
    configs/opensora-v1-2/train/ \
    --data-path {ROOT_META}/meta_clips_caption_cleaned.csv \
    --ckpt-path {MODEL_OUTPUT}/
/usr/local/lib/python3.10/dist-packages/colossalai/pipeline/schedule/ UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/usr/local/lib/python3.10/dist-packages/torch/utils/ UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/layer/ UserWarning: Please install apex from source ( to use the fused layernorm kernel
  warnings.warn("Please install apex from source ( to use the fused layernorm kernel")
[2024-08-17 18:45:05] Experiment directory created at outputs/005-STDiT3-XL-2
[2024-08-17 18:45:05] Training configuration:
 {'adam_eps': 1e-15,
 'bucket_config': {'1024': {1: (0.05, 36)},
                   '1080p': {1: (0.1, 5)},
                   '144p': {1: (1.0, 475),
                            51: (1.0, 51),
                            102: ((1.0, 0.33), 27),
                            204: ((1.0, 0.1), 13),
                            408: ((1.0, 0.1), 6)},
                   '2048': {1: (0.1, 5)},
                   '240p': {1: (0.3, 297),
                            51: (0.4, 20),
                            102: ((0.4, 0.33), 10),
                            204: ((0.4, 0.1), 5),
                            408: ((0.4, 0.1), 2)},
                   '256': {1: (0.4, 297),
                           51: (0.5, 20),
                           102: ((0.5, 0.33), 10),
                           204: ((0.5, 0.1), 5),
                           408: ((0.5, 0.1), 2)},
                   '360p': {1: (0.2, 141),
                            51: (0.15, 8),
                            102: ((0.15, 0.33), 4),
                            204: ((0.15, 0.1), 2),
                            408: ((0.15, 0.1), 1)},
                   '480p': {1: (0.1, 89)},
                   '512': {1: (0.1, 141)},
                   '720p': {1: (0.05, 36)}},
 'ckpt_every': 200,
 'config': 'configs/opensora-v1-2/train/',
 'dataset': {'data_path': '/content/drive/MyDrive/Open-Sora/opensora/data/meta/meta_clips_caption_cleaned.csv',
             'transform_name': 'resize_crop',
             'type': 'VariableVideoTextDataset'},
 'dtype': 'bf16',
 'ema_decay': 0.99,
 'epochs': 1000,
 'grad_checkpoint': True,
 'grad_clip': 1.0,
 'load': None,
 'log_every': 10,
 'lr': 0.0001,
 'mask_ratios': {'image_head': 0.05,
                 'image_head_tail': 0.025,
                 'image_random': 0.025,
                 'image_tail': 0.025,
                 'intepolate': 0.005,
                 'quarter_head': 0.005,
                 'quarter_head_tail': 0.005,
                 'quarter_random': 0.005,
                 'quarter_tail': 0.005,
                 'random': 0.05},
 'model': {'enable_flash_attn': True,
           'enable_layernorm_kernel': True,
           'freeze_y_embedder': True,
           'from_pretrained': '/content/drive/MyDrive/Open-Sora/opensora/output/',
           'qk_norm': True,
           'type': 'STDiT3-XL/2'},
 'num_bucket_build_workers': 16,
 'num_workers': 8,
 'outputs': 'outputs',
 'plugin': 'zero2',
 'record_time': False,
 'scheduler': {'sample_method': 'logit-normal',
               'type': 'rflow',
               'use_timestep_transform': True},
 'seed': 42,
 'start_from_scratch': False,
 'text_encoder': {'from_pretrained': 'DeepFloyd/t5-v1_1-xxl',
                  'model_max_length': 300,
                  'shardformer': True,
                  'type': 't5'},
 'vae': {'from_pretrained': 'hpcai-tech/OpenSora-VAE-v1.2',
         'micro_batch_size': 4,
         'micro_frame_size': 17,
         'type': 'OpenSoraVAE_V1_2'},
 'wandb': False,
 'warmup_steps': 1000}
2024-08-17 18:45:05.718442: E external/local_xla/xla/stream_executor/cuda/] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-17 18:45:05.740201: E external/local_xla/xla/stream_executor/cuda/] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-17 18:45:05.746831: E external/local_xla/xla/stream_executor/cuda/] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-17 18:45:06.875860: W tensorflow/compiler/tf2tensorrt/utils/] TF-TRT Warning: Could not find TensorRT
[2024-08-17 18:45:07] Building dataset...
[2024-08-17 18:45:07] Dataset contains 941 samples.
[2024-08-17 18:45:07] Number of buckets: 626
INFO: Pandarallel will run on 16 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[2024-08-17 18:45:07] Building buckets...
/usr/lib/python3.10/multiprocessing/ RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. = os.fork()
[2024-08-17 18:45:08] Bucket Info:
[2024-08-17 18:45:08] Bucket [#sample, #batch] by aspect ratio:
{'0.56': [614, 13]}
[2024-08-17 18:45:08] Image Bucket [#sample, #batch] by HxWxT:
[2024-08-17 18:45:08] Video Bucket [#sample, #batch] by HxWxT:
{('144p', 408): [1, 0],
 ('144p', 204): [10, 0],
 ('144p', 102): [126, 4],
 ('144p', 51): [477, 9]}
[2024-08-17 18:45:08] #training batch: 13, #training sample: 614, #non empty bucket: 4
[2024-08-17 18:45:08] Building models...
[2024-08-17 18:45:08] WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.4.0+cu121 with CUDA 1201 (you have 2.3.0+cu121)
    Python  3.10.14 (you have 3.10.12)
  Please reinstall xformers (see
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/ FutureWarning: `Transformer2DModelOutput` is deprecated and will be removed in version 1.0.0. Importing `Transformer2DModelOutput` from `diffusers.models.transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.modeling_outputs import Transformer2DModelOutput`, instead.
  deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/ FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
tokenizer_config.json: 100% 1.86k/1.86k [00:00<00:00, 12.5MB/s]
config.json: 100% 752/752 [00:00<00:00, 5.11MB/s]
spiece.model: 100% 792k/792k [00:00<00:00, 42.9MB/s]
special_tokens_map.json: 100% 1.79k/1.79k [00:00<00:00, 13.0MB/s]
pytorch_model.bin.index.json: 100% 20.0k/20.0k [00:00<00:00, 70.0MB/s]
Downloading shards:   0% 0/2 [00:00<?, ?it/s]
pytorch_model-00001-of-00002.bin:   0% 0.00/9.45G [00:00<?, ?B/s]
pytorch_model-00001-of-00002.bin:   0% 31.5M/9.45G [00:00<00:34, 270MB/s]
pytorch_model-00001-of-00002.bin:   1% 83.9M/9.45G [00:00<00:24, 389MB/s]
pytorch_model-00002-of-00002.bin:  99% 9.52G/9.60G [00:37<00:01, 60.6MB/s]
pytorch_model-00002-of-00002.bin: 100% 9.60G/9.60G [00:38<00:00, 252MB/s] 
Downloading shards: 100% 2/2 [01:07<00:00, 33.70s/it]
Loading checkpoint shards: 100% 2/2 [00:23<00:00, 11.86s/it]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/", line 197, in _replace_sub_module
[rank0]:     replace_layer = target_module.from_native_module(
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/opensora/acceleration/shardformer/modeling/", line 31, in from_native_module
[rank0]:     assert module.__class__.__name__ == "FusedRMSNorm", (
[rank0]: AssertionError: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/scripts/", line 412, in <module>
[rank0]:     main()
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/scripts/", line 118, in main
[rank0]:     text_encoder = build_module(cfg.get("text_encoder", None), MODELS, device=device, dtype=dtype)
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/opensora/", line 24, in build_module
[rank0]:     return
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/", line 570, in build
[rank0]:     return self.build_func(cfg, *args, **kwargs, registry=self)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/", line 121, in build_from_cfg
[rank0]:     obj = obj_cls(**args)  # type: ignore
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/opensora/models/text_encoder/", line 164, in __init__
[rank0]:     self.shardformer_t5()
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/opensora/models/text_encoder/", line 183, in shardformer_t5
[rank0]:     optim_model, _ = shard_former.optimize(self.t5.model, policy=T5EncoderPolicy())
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/", line 55, in optimize
[rank0]:     shared_params = sharder.shard()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/", line 43, in shard
[rank0]:     self._replace_module(include=held_layers)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/", line 67, in _replace_module
[rank0]:     self._recursive_replace_layer(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/", line 115, in _recursive_replace_layer
[rank0]:     self._recursive_replace_layer(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/", line 115, in _recursive_replace_layer
[rank0]:     self._recursive_replace_layer(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/", line 115, in _recursive_replace_layer
[rank0]:     self._recursive_replace_layer(
[rank0]:   [Previous line repeated 2 more times]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/", line 112, in _recursive_replace_layer
[rank0]:     self._replace_sub_module(module, sub_module_replacement, include)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/", line 201, in _replace_sub_module
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: Failed to replace layer_norm of type T5LayerNorm with T5LayerNorm with the exception: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers Please check your model configuration or sharding policy, you can set up an issue for us to help you as well.
E0817 18:46:55.563000 134001257882240 torch/distributed/elastic/multiprocessing/] failed (exitcode: 1) local_rank: 0 (pid: 7268) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/", line 879, in main
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/", line 870, in run
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/", line 263, in launch_agent
    raise ChildFailedError(
scripts.train FAILED
Root Cause (first observed failure):
  time      : 2024-08-17_18:46:55
  host      : 33ca7b3d91f9
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 7268)
  error_file: <N/A>
  traceback : To enable traceback see:

lweingart avatar Aug 17 '24 18:08 lweingart