InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

[Bug] fused_adam问题

Open 1518630367 opened this issue 1 year ago • 3 comments

Checklist

  • [ ] 1. I have searched related issues but cannot get the expected help.
  • [ ] 2. The bug has not been fixed in the latest version.
  • [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/adam -isystem /root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/include -isystem /root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/include/TH -isystem /root/miniconda3/envs/internvl/lib/python3.9/site-packages/torch/include/THC -isystem /root/miniconda3/envs/internvl/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90,code=compute_90 -DBF16_AVAILABLE -std=c++17 -c /root/miniconda3/envs/internvl/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o FAILED: multi_tensor_adam.cuda.o 完全按照requirement配置环境会出现下面的问题

Reproduction

shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_res_finetune_continue_lora.sh

Environment

2.3.1+cu121

Error traceback

No response

1518630367 avatar Jul 24 '24 05:07 1518630367

我也是这个问题开启deepspeed之后显示这个错误

bang123-box avatar Jul 31 '24 02:07 bang123-box

请问您安装的deepspeed的版本是0.10.0还是0.13.5呢,这个应该是deepspeed本身的问题,您可以去他们仓库的issue里检索一下看看能不能找到解决方案

czczup avatar Aug 09 '24 05:08 czczup

just uninstall apex... it works fine

shiva-vardhineedi avatar Aug 11 '24 09:08 shiva-vardhineedi

这个错误也可能是MPFR版本的问题,低版本会报错,4.1.0不会报错

qishisuren123 avatar Sep 15 '24 13:09 qishisuren123