CogVideo icon indicating copy to clipboard operation
CogVideo copied to clipboard

RuntimeError: Error building extension 'fused_ema_adam'

Open AlphaNext opened this issue 1 year ago • 7 comments

System Info / 系統信息

  • 代码版本:CogVideo commit id 版本 354c906f8160084bbdf1f1c42b3b292d509fe24b
  • CUDA12.2,Torch2.4.0, GCC=11.x
  • 环境:从sat目录执行pip install -r requirements.txt
  • sat 下的sft微调

Information / 问题信息

  • [ ] The official example scripts / 官方的示例脚本
  • [ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

错误日志

Deleting key loss.discriminator.blocks.6.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.to_logits.0.weight from state_dict.
Deleting key loss.discriminator.to_logits.0.bias from state_dict.
Deleting key loss.discriminator.to_logits.3.weight from state_dict.
Deleting key loss.discriminator.to_logits.3.bias from state_dict.
Missing keys:  []
Unexpected keys:  []
Restored from /home/video/models/CogVideoX-2b-sat/vae/3d-vae.pt
[2024-09-12 20:35:00,319] [INFO] [RANK 0]  > number of parameters on model parallel rank 0: 6705808355
[2024-09-12 20:35:40,928] [INFO] [RANK 0] global rank 0 is loading checkpoint /home/video/models/CogVideoX-2b-sat/transformer/1/mp_rank_00_model_states.pt
/opt/conda/lib/python3.10/site-packages/sat/training/model_io.py:286: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  sd = torch.load(checkpoint_name, map_location='cpu')
[2024-09-12 20:35:44,101] [INFO] [RANK 0] > successfully loaded /home/video/models/CogVideoX-2b-sat/transformer/1/mp_rank_00_model_states.pt
[2024-09-12 20:35:44,814] [INFO] [RANK 0] ***** Total trainable parameters: 1693783872 *****
[2024-09-12 20:35:44,815] [INFO] [RANK 0] [<class 'sat.ops.layernorm.LayerNorm'>, <class 'torch.nn.modules.normalization.LayerNorm'>, <class 'sat.ops.layernorm.RMSNorm'>] is set to no_weight_decay
[2024-09-12 20:35:44,819] [INFO] [RANK 0] Syncing initialized parameters...
[2024-09-12 20:35:44,999] [INFO] [RANK 0] Finished syncing initialized parameters.
[2024-09-12 20:35:45,000] [INFO] [RANK 0] Using optimizer sat.ops.FusedEmaAdam from sat.
[2024-09-12 20:35:45,000] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
[2024-09-12 20:35:45,001] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2024-09-12 20:35:45,112] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/fused_ema_adam/build.ninja...
/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Building extension module fused_ema_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF fused_ema_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_ema_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/sat/ops/csrc/includes -I/opt/conda/lib/python3.10/site-packages/sat/ops/csrc/adam -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /opt/conda/lib/python3.10/site-packages/sat/ops/csrc/adam/fused_ema_adam_frontend.cpp -o fused_ema_adam_frontend.o 
FAILED: fused_ema_adam_frontend.o 
c++ -MMD -MF fused_ema_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_ema_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/sat/ops/csrc/includes -I/opt/conda/lib/python3.10/site-packages/sat/ops/csrc/adam -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /opt/conda/lib/python3.10/site-packages/sat/ops/csrc/adam/fused_ema_adam_frontend.cpp -o fused_ema_adam_frontend.o 
In file included from /opt/conda/lib/python3.10/site-packages/torch/include/ATen/core/TensorBase.h:14,
                 from /opt/conda/lib/python3.10/site-packages/torch/include/ATen/core/TensorBody.h:38,
                 from /opt/conda/lib/python3.10/site-packages/torch/include/ATen/core/Tensor.h:3,
                 from /opt/conda/lib/python3.10/site-packages/torch/include/ATen/Tensor.h:3,
                 from /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/autograd/function_hook.h:3,
                 from /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/autograd/cpp_hook.h:2,
                 from /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/autograd/variable.h:6,
                 from /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/autograd/autograd.h:3,
                 from /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include/torch/autograd.h:3,
                 from /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include/torch/all.h:7,
                 from /opt/conda/lib/python3.10/site-packages/torch/include/torch/extension.h:5,
                 from /opt/conda/lib/python3.10/site-packages/sat/ops/csrc/adam/fused_ema_adam_frontend.cpp:6:
/opt/conda/lib/python3.10/site-packages/torch/include/c10/util/C++17.h:13:2: error: #error "You're trying to build PyTorch with a too old version of GCC. We need GCC 9 or later."
 #error \
  ^~~~~
ninja: build stopped: subcommand failed.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2105, in _run_ninja_build
[rank0]:     subprocess.run(
[rank0]:   File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
[rank0]:     raise CalledProcessError(retcode, process.args,
[rank0]: subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/video/CogVideoX-354c906/sat/train_video.py", line 223, in <module>
[rank0]:     training_main(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 116, in training_main
[rank0]:     model, optimizer = setup_model_untrainable_params_and_optimizer(args, model)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 225, in setup_model_untrainable_params_and_optimizer
[rank0]:     model, optimizer, _, _ = deepspeed.initialize(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/__init__.py", line 181, in initialize
[rank0]:     engine = DeepSpeedEngine(args=args,
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 306, in __init__
[rank0]:     self._configure_optimizer(optimizer, model_parameters)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_optimizer
[rank0]:     basic_optimizer = client_optimizer(model_parameters)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/sat/ops/__init__.py", line 33, in <lambda>
[rank0]:     '__call__': lambda self, *args, **kwargs: getattr(import_module(self.path), self.name)(*args, **kwargs),
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/sat/ops/fused_ema_adam.py", line 86, in __init__
[rank0]:     fused_ema_adam_cuda = FusedEmaAdamBuilder().jit_load()
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/sat/ops/ops_builder/builder.py", line 480, in jit_load
[rank0]:     op_module = load(name=self.name,
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1312, in load
[rank0]:     return _jit_compile(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1722, in _jit_compile
[rank0]:     _write_ninja_file_and_build_library(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1834, in _write_ninja_file_and_build_library
[rank0]:     _run_ninja_build(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2121, in _run_ninja_build
[rank0]:     raise RuntimeError(message) from e
[rank0]: RuntimeError: Error building extension 'fused_ema_adam'
E0912 20:36:06.406000 139966916675392 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 1026) of binary: /opt/conda/bin/python3.10
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

Expected behavior / 期待表现

Solve this error!

AlphaNext avatar Sep 12 '24 12:09 AlphaNext

你是怎么安装sat的,直接pipinstall 最新的试试呢,这个是在运行代码的时候报错的,sat版本是0.4.12吗

zRzRzRzRzRzRzR avatar Sep 12 '24 14:09 zRzRzRzRzRzRzR

你是怎么安装sat的,直接pipinstall 最新的试试呢,这个是在运行代码的时候报错的,sat版本是0.4.12吗

@zRzRzRzRzRzRzR sat 直接安装的requirements.txt里的那个版本,是0.4.12,这应该是最新版本的sat,是在运行代码报错的。

AlphaNext avatar Sep 13 '24 02:09 AlphaNext

应该是你cuda和torch 没有完全匹配,我刚才cuda 11.5就出现跟你一个错误,虽然你写的是12.2,但是确保你的torch 和cuda 已经compile 我换成cuda12.2就正常了

zRzRzRzRzRzRzR avatar Sep 13 '24 07:09 zRzRzRzRzRzRzR

应该是你cuda和torch 没有完全匹配,我刚才cuda 11.5就出现跟你一个错误,虽然你写的是12.2,但是确保你的torch 和cuda 已经compile 我换成cuda12.2就正常了

cuda和Torch是匹配的,还试了torch2.4.0, cuda11.8的还是会有这种问题 下面这个环境验证的输出:

>>> import torch
>>> print(torch.__version__)
2.4.0+cu121
>>>
>>> print(torch.cuda.is_available())
True
>>> exit()

AlphaNext avatar Sep 13 '24 09:09 AlphaNext

那只能注释掉所有有关ema的部分了,因为只有可能是因为torch cuda cudnn没对上,其他情况不应该出现这个问题

你试试你能不能 from torch.cuda.amp import autocast 如果不行大概率就是环境设置问题

zRzRzRzRzRzRzR avatar Sep 13 '24 11:09 zRzRzRzRzRzRzR

那只能注释掉所有有关ema的部分了,因为只有可能是因为torch cuda cudnn没对上,其他情况不应该出现这个问题

你试试你能不能 from torch.cuda.amp import autocast 如果不行大概率就是环境设置问题

在哪块加这句代码?另外你上面测试的是哪个commit id的代码

AlphaNext avatar Sep 13 '24 12:09 AlphaNext

那只能注释掉所有有关ema的部分了,因为只有可能是因为torch cuda cudnn没对上,其他情况不应该出现这个问题

你试试你能不能 from torch.cuda.amp import autocast 如果不行大概率就是环境设置问题

Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> from torch.cuda.amp import autocast
>>> from torch.cuda.amp import autocast
>>>

AlphaNext avatar Sep 13 '24 12:09 AlphaNext

那只能注释掉所有有关ema的部分了,因为只有可能是因为torch cuda cudnn没对上,其他情况不应该出现这个问题 你试试你能不能 from torch.cuda.amp import autocast 如果不行大概率就是环境设置问题

在哪块加这句代码?另外你上面测试的是哪个commit id的代码·

现在的main分支 3fb5631b7651e5cc5d83b4aad9bd008da15c6040

zRzRzRzRzRzRzR avatar Sep 14 '24 04:09 zRzRzRzRzRzRzR

那只能注释掉所有有关ema的部分了,因为只有可能是因为torch cuda cudnn没对上,其他情况不应该出现这个问题 你试试你能不能 from torch.cuda.amp import autocast 如果不行大概率就是环境设置问题

在哪块加这句代码?另外你上面测试的是哪个commit id的代码·

现在的main分支 3fb5631

解决了,很有可能是cuda的有些东西对GCC有依赖,比如:cuda-compiler、cuda-nvcc,更新GCC需要把原来的GCC完全清理,我只是新安装了GCC11,并改了GCC的软连接,不过用下面这种方式即可解决:

$ export CC=/usr/bin/gcc
$ export CXX=/usr/bin/g++
$ export CUDA_ROOT=/usr/local/cuda
$ ln -s /usr/bin/gcc $CUDA_ROOT/bin/gcc
$ ln -s /usr/bin/g++ $CUDA_ROOT/bin/g++

https://github.com/NVlabs/instant-ngp/issues/119#issuecomment-1034701258

AlphaNext avatar Sep 14 '24 08:09 AlphaNext