Traceback (most recent call last):
File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 197, in _replace_sub_module
replace_layer = target_module.from_native_module(
File "/mnt/volumes/perception/jinbu/yangbaihan/opensora2/Open-Sora/opensora/acceleration/shardformer/modeling/t5.py", line 31, in from_native_module
assert module.class.name == "FusedRMSNorm", (
AssertionError: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/volumes/perception/jinbu/yangbaihan/opensora2/Open-Sora/scripts/train.py", line 287, in
main()
File "/mnt/volumes/perception/jinbu/yangbaihan/opensora2/Open-Sora/scripts/train.py", line 132, in main
text_encoder = build_module(cfg.text_encoder, MODELS, device=device) # T5 must be fp32
File "/mnt/volumes/perception/jinbu/yangbaihan/opensora2/Open-Sora/opensora/registry.py", line 22, in build_module
return builder.build(cfg)
File "/root/anaconda3/lib/python3.9/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/root/anaconda3/lib/python3.9/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/mnt/volumes/perception/jinbu/yangbaihan/opensora2/Open-Sora/opensora/models/text_encoder/t5.py", line 287, in init
self.shardformer_t5()
File "/mnt/volumes/perception/jinbu/yangbaihan/opensora2/Open-Sora/opensora/models/text_encoder/t5.py", line 306, in shardformer_t5
optim_model, _ = shard_former.optimize(self.t5.model, policy=T5EncoderPolicy())
File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/shardformer.py", line 54, in optimize
shared_params = sharder.shard()
File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 43, in shard
self._replace_module(include=held_layers)
File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 67, in _replace_module
self._recursive_replace_layer(
File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer
self._recursive_replace_layer(
File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer
self._recursive_replace_layer(
File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer
self._recursive_replace_layer(
[Previous line repeated 2 more times]
File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 112, in _recursive_replace_layer
self._replace_sub_module(module, sub_module_replacement, include)
File "/root/anaconda3/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 201, in _replace_sub_module
raise RuntimeError(
RuntimeError: Failed to replace layer_norm of type T5LayerNorm with T5LayerNorm with the exception: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48. Please check your model configuration or sharding policy, you can set up an issue for us to help you as well.
[2024-03-29 19:59:25,619] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 6407) of binary: /root/anaconda3/bin/python
Traceback (most recent call last):
File "/root/anaconda3/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
This issue is stale because it has been open for 7 days with no activity.
Same error, is there any way to fix it?
Download the latest version apex. DO NOT use apex 22.04 dev.
Solution:
wget https://github.com/NVIDIA/apex.git
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
@COST-97 好像不太管用,新版本的编译直接出现:
Building wheel for apex (pyproject.toml): finished with status 'error'
ERROR: Failed building wheel for apex
Failed to build apex
ERROR: Could not build wheels for apex, which is required to install pyproject.toml-based projects
torch版本、cuda、nvcc版本可以提供一下吗?我用的cuda12.2, torch2.2.2
Download the latest version apex. DO NOT use apex 22.04 dev. Solution: wget https://github.com/NVIDIA/apex.git cd apex pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
您好 请问您怎么解决的
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
@COST-97 好像不太管用,新版本的编译直接出现:
Building wheel for apex (pyproject.toml): finished with status 'error'
ERROR: Failed building wheel for apex
Failed to build apex
ERROR: Could not build wheels for apex, which is required to install pyproject.toml-based projects
torch版本、cuda、nvcc版本可以提供一下吗?我用的cuda12.2, torch2.2.2
您解决了吗