Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

T5LayerNorm error

Open lweingart opened this issue 6 months ago • 0 comments

Hi guys,

I'm finally able to start the training, but I'm encountering these errors:

AssertionError: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48

and the following as well:

RuntimeError: Failed to replace layer_norm of type T5LayerNorm with T5LayerNorm with the exception: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48. Please check your model configuration or sharding policy, you can set up an issue for us to help you as well.
E0817 18:46:55.563000 134001257882240 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 7268) of binary: /usr/bin/python3

Would you have any idea what could be done here by chance ?

Here is the command, followed by the log trace:

torchrun --standalone --nproc_per_node 1 -m scripts.train \
    configs/opensora-v1-2/train/stage1.py \
    --data-path {ROOT_META}/meta_clips_caption_cleaned.csv \
    --ckpt-path {MODEL_OUTPUT}/my_sora.pt
/usr/local/lib/python3.10/dist-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/usr/local/lib/python3.10/dist-packages/torch/utils/_pytree.py:300: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
[2024-08-17 18:45:05] Experiment directory created at outputs/005-STDiT3-XL-2
[2024-08-17 18:45:05] Training configuration:
 {'adam_eps': 1e-15,
 'bucket_config': {'1024': {1: (0.05, 36)},
                   '1080p': {1: (0.1, 5)},
                   '144p': {1: (1.0, 475),
                            51: (1.0, 51),
                            102: ((1.0, 0.33), 27),
                            204: ((1.0, 0.1), 13),
                            408: ((1.0, 0.1), 6)},
                   '2048': {1: (0.1, 5)},
                   '240p': {1: (0.3, 297),
                            51: (0.4, 20),
                            102: ((0.4, 0.33), 10),
                            204: ((0.4, 0.1), 5),
                            408: ((0.4, 0.1), 2)},
                   '256': {1: (0.4, 297),
                           51: (0.5, 20),
                           102: ((0.5, 0.33), 10),
                           204: ((0.5, 0.1), 5),
                           408: ((0.5, 0.1), 2)},
                   '360p': {1: (0.2, 141),
                            51: (0.15, 8),
                            102: ((0.15, 0.33), 4),
                            204: ((0.15, 0.1), 2),
                            408: ((0.15, 0.1), 1)},
                   '480p': {1: (0.1, 89)},
                   '512': {1: (0.1, 141)},
                   '720p': {1: (0.05, 36)}},
 'ckpt_every': 200,
 'config': 'configs/opensora-v1-2/train/stage1.py',
 'dataset': {'data_path': '/content/drive/MyDrive/Open-Sora/opensora/data/meta/meta_clips_caption_cleaned.csv',
             'transform_name': 'resize_crop',
             'type': 'VariableVideoTextDataset'},
 'dtype': 'bf16',
 'ema_decay': 0.99,
 'epochs': 1000,
 'grad_checkpoint': True,
 'grad_clip': 1.0,
 'load': None,
 'log_every': 10,
 'lr': 0.0001,
 'mask_ratios': {'image_head': 0.05,
                 'image_head_tail': 0.025,
                 'image_random': 0.025,
                 'image_tail': 0.025,
                 'intepolate': 0.005,
                 'quarter_head': 0.005,
                 'quarter_head_tail': 0.005,
                 'quarter_random': 0.005,
                 'quarter_tail': 0.005,
                 'random': 0.05},
 'model': {'enable_flash_attn': True,
           'enable_layernorm_kernel': True,
           'freeze_y_embedder': True,
           'from_pretrained': '/content/drive/MyDrive/Open-Sora/opensora/output/my_sora.pt',
           'qk_norm': True,
           'type': 'STDiT3-XL/2'},
 'num_bucket_build_workers': 16,
 'num_workers': 8,
 'outputs': 'outputs',
 'plugin': 'zero2',
 'record_time': False,
 'scheduler': {'sample_method': 'logit-normal',
               'type': 'rflow',
               'use_timestep_transform': True},
 'seed': 42,
 'start_from_scratch': False,
 'text_encoder': {'from_pretrained': 'DeepFloyd/t5-v1_1-xxl',
                  'model_max_length': 300,
                  'shardformer': True,
                  'type': 't5'},
 'vae': {'from_pretrained': 'hpcai-tech/OpenSora-VAE-v1.2',
         'micro_batch_size': 4,
         'micro_frame_size': 17,
         'type': 'OpenSoraVAE_V1_2'},
 'wandb': False,
 'warmup_steps': 1000}
2024-08-17 18:45:05.718442: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-17 18:45:05.740201: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-17 18:45:05.746831: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-17 18:45:06.875860: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2024-08-17 18:45:07] Building dataset...
[2024-08-17 18:45:07] Dataset contains 941 samples.
[2024-08-17 18:45:07] Number of buckets: 626
INFO: Pandarallel will run on 16 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[2024-08-17 18:45:07] Building buckets...
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
[2024-08-17 18:45:08] Bucket Info:
[2024-08-17 18:45:08] Bucket [#sample, #batch] by aspect ratio:
{'0.56': [614, 13]}
[2024-08-17 18:45:08] Image Bucket [#sample, #batch] by HxWxT:
{}
[2024-08-17 18:45:08] Video Bucket [#sample, #batch] by HxWxT:
{('144p', 408): [1, 0],
 ('144p', 204): [10, 0],
 ('144p', 102): [126, 4],
 ('144p', 51): [477, 9]}
[2024-08-17 18:45:08] #training batch: 13, #training sample: 614, #non empty bucket: 4
[2024-08-17 18:45:08] Building models...
[2024-08-17 18:45:08] WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.4.0+cu121 with CUDA 1201 (you have 2.3.0+cu121)
    Python  3.10.14 (you have 3.10.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_2d.py:34: FutureWarning: `Transformer2DModelOutput` is deprecated and will be removed in version 1.0.0. Importing `Transformer2DModelOutput` from `diffusers.models.transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.modeling_outputs import Transformer2DModelOutput`, instead.
  deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
tokenizer_config.json: 100% 1.86k/1.86k [00:00<00:00, 12.5MB/s]
config.json: 100% 752/752 [00:00<00:00, 5.11MB/s]
spiece.model: 100% 792k/792k [00:00<00:00, 42.9MB/s]
special_tokens_map.json: 100% 1.79k/1.79k [00:00<00:00, 13.0MB/s]
pytorch_model.bin.index.json: 100% 20.0k/20.0k [00:00<00:00, 70.0MB/s]
Downloading shards:   0% 0/2 [00:00<?, ?it/s]
pytorch_model-00001-of-00002.bin:   0% 0.00/9.45G [00:00<?, ?B/s]
pytorch_model-00001-of-00002.bin:   0% 31.5M/9.45G [00:00<00:34, 270MB/s]
pytorch_model-00001-of-00002.bin:   1% 83.9M/9.45G [00:00<00:24, 389MB/s]
...
pytorch_model-00002-of-00002.bin:  99% 9.52G/9.60G [00:37<00:01, 60.6MB/s]
pytorch_model-00002-of-00002.bin: 100% 9.60G/9.60G [00:38<00:00, 252MB/s] 
Downloading shards: 100% 2/2 [01:07<00:00, 33.70s/it]
Loading checkpoint shards: 100% 2/2 [00:23<00:00, 11.86s/it]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 197, in _replace_sub_module
[rank0]:     replace_layer = target_module.from_native_module(
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/opensora/acceleration/shardformer/modeling/t5.py", line 31, in from_native_module
[rank0]:     assert module.__class__.__name__ == "FusedRMSNorm", (
[rank0]: AssertionError: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/scripts/train.py", line 412, in <module>
[rank0]:     main()
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/scripts/train.py", line 118, in main
[rank0]:     text_encoder = build_module(cfg.get("text_encoder", None), MODELS, device=device, dtype=dtype)
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/opensora/registry.py", line 24, in build_module
[rank0]:     return builder.build(cfg)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py", line 570, in build
[rank0]:     return self.build_func(cfg, *args, **kwargs, registry=self)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank0]:     obj = obj_cls(**args)  # type: ignore
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/opensora/models/text_encoder/t5.py", line 164, in __init__
[rank0]:     self.shardformer_t5()
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/opensora/models/text_encoder/t5.py", line 183, in shardformer_t5
[rank0]:     optim_model, _ = shard_former.optimize(self.t5.model, policy=T5EncoderPolicy())
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/shardformer.py", line 55, in optimize
[rank0]:     shared_params = sharder.shard()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 43, in shard
[rank0]:     self._replace_module(include=held_layers)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 67, in _replace_module
[rank0]:     self._recursive_replace_layer(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer
[rank0]:     self._recursive_replace_layer(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer
[rank0]:     self._recursive_replace_layer(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer
[rank0]:     self._recursive_replace_layer(
[rank0]:   [Previous line repeated 2 more times]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 112, in _recursive_replace_layer
[rank0]:     self._replace_sub_module(module, sub_module_replacement, include)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 201, in _replace_sub_module
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: Failed to replace layer_norm of type T5LayerNorm with T5LayerNorm with the exception: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48. Please check your model configuration or sharding policy, you can set up an issue for us to help you as well.
E0817 18:46:55.563000 134001257882240 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 7268) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts.train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-17_18:46:55
  host      : 33ca7b3d91f9
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 7268)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

lweingart avatar Aug 17 '24 18:08 lweingart