Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

T5LayerNorm Recovering Error

Open White-YC opened this issue 1 year ago • 4 comments

Traceback (most recent call last): File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 197, in _replace_sub_module replace_layer = target_module.from_native_module( File "/mnt/volumes/perception/jinbu/yangbaihan/opensora2/Open-Sora/opensora/acceleration/shardformer/modeling/t5.py", line 31, in from_native_module assert module.class.name == "FusedRMSNorm", ( AssertionError: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/mnt/volumes/perception/jinbu/yangbaihan/opensora2/Open-Sora/scripts/train.py", line 287, in main() File "/mnt/volumes/perception/jinbu/yangbaihan/opensora2/Open-Sora/scripts/train.py", line 132, in main text_encoder = build_module(cfg.text_encoder, MODELS, device=device) # T5 must be fp32 File "/mnt/volumes/perception/jinbu/yangbaihan/opensora2/Open-Sora/opensora/registry.py", line 22, in build_module return builder.build(cfg) File "/root/anaconda3/lib/python3.9/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/root/anaconda3/lib/python3.9/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/mnt/volumes/perception/jinbu/yangbaihan/opensora2/Open-Sora/opensora/models/text_encoder/t5.py", line 287, in init self.shardformer_t5() File "/mnt/volumes/perception/jinbu/yangbaihan/opensora2/Open-Sora/opensora/models/text_encoder/t5.py", line 306, in shardformer_t5 optim_model, _ = shard_former.optimize(self.t5.model, policy=T5EncoderPolicy()) File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/shardformer.py", line 54, in optimize shared_params = sharder.shard() File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 43, in shard self._replace_module(include=held_layers) File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 67, in _replace_module self._recursive_replace_layer( File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer self._recursive_replace_layer( File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer self._recursive_replace_layer( File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer self._recursive_replace_layer( [Previous line repeated 2 more times] File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 112, in _recursive_replace_layer self._replace_sub_module(module, sub_module_replacement, include) File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 201, in _replace_sub_module raise RuntimeError( RuntimeError: Failed to replace layer_norm of type T5LayerNorm with T5LayerNorm with the exception: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48. Please check your model configuration or sharding policy, you can set up an issue for us to help you as well. [2024-03-29 19:59:25,619] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 6407) of binary: /root/anaconda3/bin/python Traceback (most recent call last): File "/root/anaconda3/bin/torchrun", line 8, in sys.exit(main()) File "/root/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/root/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/root/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/root/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

White-YC avatar Mar 29 '24 20:03 White-YC

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Apr 06 '24 01:04 github-actions[bot]

Same error, is there any way to fix it?

Lanhaoran avatar Apr 09 '24 14:04 Lanhaoran

Download the latest version apex. DO NOT use apex 22.04 dev. Solution: wget https://github.com/NVIDIA/apex.git cd apex pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

COST-97 avatar Apr 11 '24 03:04 COST-97

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Apr 20 '24 01:04 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Apr 27 '24 01:04 github-actions[bot]

pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

@COST-97 好像不太管用,新版本的编译直接出现:

Building wheel for apex (pyproject.toml): finished with status 'error'
  ERROR: Failed building wheel for apex
Failed to build apex
ERROR: Could not build wheels for apex, which is required to install pyproject.toml-based projects

torch版本、cuda、nvcc版本可以提供一下吗?我用的cuda12.2, torch2.2.2

AlphaNext avatar Apr 28 '24 10:04 AlphaNext

Download the latest version apex. DO NOT use apex 22.04 dev. Solution: wget https://github.com/NVIDIA/apex.git cd apex pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

您好 请问您怎么解决的

fenghe12 avatar Jun 05 '24 13:06 fenghe12

pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

@COST-97 好像不太管用,新版本的编译直接出现:

Building wheel for apex (pyproject.toml): finished with status 'error'
  ERROR: Failed building wheel for apex
Failed to build apex
ERROR: Could not build wheels for apex, which is required to install pyproject.toml-based projects

torch版本、cuda、nvcc版本可以提供一下吗?我用的cuda12.2, torch2.2.2

您解决了吗

fenghe12 avatar Jun 05 '24 13:06 fenghe12