onediff
onediff copied to clipboard
关于添加环境 ONEFLOW_CONV_ALLOW_HALF_PRECISION_ACCUMULATION 报错
Describe the bug
使用 下面的环境变量后报错 os.environ['ONEFLOW_CONV_ALLOW_HALF_PRECISION_ACCUMULATION'] = '0' os.environ['ONEFLOW_MATMUL_ALLOW_HALF_PRECISION_ACCUMULATION'] = '0'
错误信息如下:
F20240326 18:44:23.906044 1387 fused_matmul_bias_kernel.cu:84] Check failed: cublasLtMatmul( cuda_stream->cublas_lt_handle(), matmul_cache->operation_desc, &sp_alpha, weight->dptr(), matmul_cache->cublas_a_desc, x->dptr(), matmul_cache->cublas_b_desc, &sp_beta, (_add_to_output == nullptr) ? y_ptr : _add_to_output->dptr(), matmul_cache->cublas_c_desc, y_ptr, matmul_cache->cublas_c_desc, &matmul_cache->cublas_algo, cuda_stream->cublas_workspace(), cuda_stream->cublas_workspace_size(), cuda_stream->cuda_stream()) : CUBLAS_STATUS_NOT_SUPPORTED (15)
*** Check failure stack trace: ***
@ 0x7fe5b15ef96a google::LogMessage::Fail()
@ 0x7fe5b15f28a1 google::LogMessage::SendToLog()
@ 0x7fe5b15ef499 google::LogMessage::Flush()
@ 0x7fe5b15f3189 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fe5aacfb2fb oneflow::(anonymous namespace)::FusedMatmulBiasKernel::Compute()
@ 0x7fe5acd9ec45 oneflow::one::StatefulOpKernel::Compute()
@ 0x7fe5a97d56ea _ZZN7oneflow2vm21OpCallInstructionUtil7ComputeEPNS0_23OpCallInstructionPolicyEPNS0_6StreamEbbENKUlvE_clEv
@ 0x7fe5a97d7018 oneflow::vm::OpCallInstructionUtil::Compute()
@ 0x7fe5a97d3c0d _ZZN7oneflow2vm23OpCallInstructionPolicy7ComputeEPNS0_11InstructionEENKUlPKcE_clES5_.constprop.0
@ 0x7fe5a97d4469 oneflow::vm::OpCallInstructionPolicy::Compute()
@ 0x7fe5a97cd0c8 oneflow::vm::Instruction::Compute()
@ 0x7fe5a97c9de5 oneflow::vm::EpStreamPolicyBase::Run()
@ 0x7fe5a9827529 oneflow::vm::ThreadCtx::TryReceiveAndRun()
@ 0x7fe5a982bcad oneflow::(anonymous namespace)::WorkerLoop()
@ 0x7fe5a982c438 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPN7oneflow2vm9ThreadCtxERKSt8functionIFvS6_EEES6_ZNS3_14VirtualMachine15CreateThreadCtxENS3_6SymbolINS3_6DeviceEEENS3_10StreamTypeEmEUlS6_E3_EEEEE6_M_runEv
@ 0x7fe5b1604f20 execute_native_thread_routine
@ 0x7fe6e198d609 start_thread
@ 0x7fe6e1758133 clone
Stack trace (most recent call last) in thread 1387:
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-3e0702bd.so", at 0x7fe5b1604f1f, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-3e0702bd.so", at 0x7fe5a982c437, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-3e0702bd.so", at 0x7fe5a982bcac, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-3e0702bd.so", at 0x7fe5a9827528, in vm::ThreadCtx::TryReceiveAndRun()
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-3e0702bd.so", at 0x7fe5a97c9de4, in vm::EpStreamPolicyBase::Run(vm::Instruction*) const
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-3e0702bd.so", at 0x7fe5a97cd0c7, in vm::Instruction::Compute()
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-3e0702bd.so", at 0x7fe5a97d4468, in vm::OpCallInstructionPolicy::Compute(vm::Instruction*)
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-3e0702bd.so", at 0x7fe5a97d3c0c, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-3e0702bd.so", at 0x7fe5a97d7017, in vm::OpCallInstructionUtil::Compute(vm::OpCallInstructionPolicy*, vm::Stream*, bool, bool)
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-3e0702bd.so", at 0x7fe5a97d56e9, in vm::OpCallInstructionUtil::Compute(vm::OpCallInstructionPolicy*, vm::Stream*, bool, bool)::{lambda()#1}::operator()() const
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-3e0702bd.so", at 0x7fe5acd9ec44, in StatefulOpKernel::Compute(eager::CallContext*, ep::Stream*, user_op::OpKernel const*, user_op::OpKernelState*, user_op::OpKernelCache const*) const
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-3e0702bd.so", at 0x7fe5aacfb2fa, in (anonymous namespace)::FusedMatmulBiasKernel::Compute(user_op::KernelComputeContext*, user_op::OpKernelState*, user_op::OpKernelCache const*) const
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-3e0702bd.so", at 0x7fe5b15f3188, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-3e0702bd.so", at 0x7fe5b15ef498, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-3e0702bd.so", at 0x7fe5b15f28a0, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-3e0702bd.so", at 0x7fe5b15ef969, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-3e0702bd.so", at 0x7fe5a1821e78, in
Aborted (Signal sent by tkill() 1119 0)
Aborted (core dumped)
@chengzeyi @hjchen2 Let's take a look
@chengzeyi @hjchen2 Let's take a look
这个问题好解决吗?
这个问题在解决中
最新版本已经修复该问题,请安装最新的oneflow,
python3 -m pip install -U --pre oneflow -f https://oneflow-pro.oss-cn-beijing.aliyuncs.com/branch/community/cu121
最新版本已经修复该问题,请安装最新的oneflow,
python3 -m pip install -U --pre oneflow -f https://oneflow-pro.oss-cn-beijing.aliyuncs.com/branch/community/cu121
我用了您说的最新版本,但出现了新的错误。
Stack trace (most recent call last) in thread 1098:
W20240408 19:33:49.514078 994 cudnn_conv_util.cpp:105] Currently available alogrithm (algo=0, require memory=0, idx=1) meeting requirments (max_workspace_size=2147483648, determinism=0) is not fastest. Fastest algorithm (1) requires memory 2149842960
W20240408 19:33:49.514624 994 cudnn_conv_util.cpp:105] Currently available alogrithm (algo=0, require memory=0, idx=1) meeting requirments (max_workspace_size=2147483648, determinism=0) is not fastest. Fastest algorithm (1) requires memory 2148663312
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f418d635f1f, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f4185850be7, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f418585045c, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f418584bcd8, in vm::ThreadCtx::TryReceiveAndRun()
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f41857ee474, in vm::EpStreamPolicyBase::Run(vm::Instruction*) const
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f41857f1777, in vm::Instruction::Compute()
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f4185878acf, in vm::FuseInstructionPolicy::Compute(vm::Instruction*)
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f41857f1777, in vm::Instruction::Compute()
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f41857f8b58, in vm::OpCallInstructionPolicy::Compute(vm::Instruction*)
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f41857f8829, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f41857f397a, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f417cfe3d3c, in
我这边的环境如下:
onediff 0.13.0.dev202404080125
onediffx 0.13.0.dev0 /var/onediff/onediff_diffusers_extensions
oneflow 0.9.1.dev20240406+cu121
onefx 0.0.3
torch 2.2.2
显卡类型 A10 24G
cuda_stream
还有下面的这个错误
F20240408 20:02:20.552623 1965 fused_matmul_bias_kernel.cu:84] Check failed: cublasLtMatmul( cuda_stream->cublas_lt_handle(), matmul_cache->operation_desc, &sp_alpha, weight->dptr(), matmul_cache->cublas_a_desc, x->dptr(), matmul_cache->cublas_b_desc, &sp_beta, (_add_to_output == nullptr) ? y_ptr : _add_to_output->dptr(), matmul_cache->cublas_c_desc, y_ptr, matmul_cache->cublas_c_desc, &matmul_cache->cublas_algo, cuda_stream->cublas_workspace(), cuda_stream->cublas_workspace_size(), cuda_stream->cuda_stream()) : CUBLAS_STATUS_NOT_SUPPORTED (15)
*** Check failure stack trace: ***
@ 0x7f225502096a google::LogMessage::Fail()
@ 0x7f22550238a1 google::LogMessage::SendToLog()
@ 0x7f2255020499 google::LogMessage::Flush()
@ 0x7f2255024189 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f224e726d5b oneflow::(anonymous namespace)::FusedMatmulBiasKernel::Compute()
@ 0x7f22507cbdf5 oneflow::one::StatefulOpKernel::Compute()
@ 0x7f224d1f9dda _ZZN7oneflow2vm21OpCallInstructionUtil7ComputeEPNS0_23OpCallInstructionPolicyEPNS0_6StreamEbbENKUlvE_clEv
@ 0x7f224d1fb708 oneflow::vm::OpCallInstructionUtil::Compute()
@ 0x7f224d1f82fd _ZZN7oneflow2vm23OpCallInstructionPolicy7ComputeEPNS0_11InstructionEENKUlPKcE_clES5_.constprop.0
@ 0x7f224d1f8b59 oneflow::vm::OpCallInstructionPolicy::Compute()
@ 0x7f224d1f1778 oneflow::vm::Instruction::Compute()
@ 0x7f224d1ee475 oneflow::vm::EpStreamPolicyBase::Run()
@ 0x7f224d24bcd9 oneflow::vm::ThreadCtx::TryReceiveAndRun()
@ 0x7f224d25045d oneflow::(anonymous namespace)::WorkerLoop()
@ 0x7f224d250be8 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPN7oneflow2vm9ThreadCtxERKSt8functionIFvS6_EEES6_ZNS3_14VirtualMachine15CreateThreadCtxENS3_6SymbolINS3_6DeviceEEENS3_10StreamTypeEmEUlS6_E3_EEEEE6_M_runEv
@ 0x7f2255035f20 execute_native_thread_routine
@ 0x7f2397ab6609 start_thread
@ 0x7f2397881353 clone
Stack trace (most recent call last) in thread 1965:
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f2255035f1f, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f224d250be7, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f224d25045c, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f224d24bcd8, in vm::ThreadCtx::TryReceiveAndRun()
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f224d1ee474, in vm::EpStreamPolicyBase::Run(vm::Instruction*) const
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f224d1f1777, in vm::Instruction::Compute()
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f224d1f8b58, in vm::OpCallInstructionPolicy::Compute(vm::Instruction*)
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f224d1f82fc, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f224d1fb707, in vm::OpCallInstructionUtil::Compute(vm::OpCallInstructionPolicy*, vm::Stream*, bool, bool)
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f224d1f9dd9, in vm::OpCallInstructionUtil::Compute(vm::OpCallInstructionPolicy*, vm::Stream*, bool, bool)::{lambda()#1}::operator()() const
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f22507cbdf4, in StatefulOpKernel::Compute(eager::CallContext*, ep::Stream*, user_op::OpKernel const*, user_op::OpKernelState*, user_op::OpKernelCache const*) const
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f224e726d5a, in (anonymous namespace)::FusedMatmulBiasKernel::Compute(user_op::KernelComputeContext*, user_op::OpKernelState*, user_op::OpKernelCache const*) const
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f2255024188, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f2255020498, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f22550238a0, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f2255020969, in
Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-bfe31c8d.so", at 0x7f22452393f0, in
Aborted (Signal sent by tkill() 1795 0)
Aborted (core dumped)
你这个问题是不是显存不足引起的,你跑的时候可以监控一下显存占用。另外请问一下你跑的是什么模型,是svd吗
你这个问题是不是显存不足引起的,你跑的时候可以监控一下显存占用。另外请问一下你跑的是什么模型,是svd吗
两个问题都是显存不足吗?
不是 svd 就是 sdxl。我观察到,使用 onediff 进行加速的时候,使用的显存占用量会多很多,这个有什么解决方案吗?我的 A10 只有 24g 显存
你这个问题是不是显存不足引起的,你跑的时候可以监控一下显存占用。另外请问一下你跑的是什么模型,是svd吗
两个问题都是显存不足吗?
不是 svd 就是 sdxl。我观察到,使用 onediff 进行加速的时候,使用的显存占用量会多很多,这个有什么解决方案吗?我的 A10 只有 24g 显存
首先你可以先尝试把vae的编译加速关掉,另外请问一下你使用的分辨率是多大的,如果不用onediff加速的时候正常的显存占用是多少?
你这个问题是不是显存不足引起的,你跑的时候可以监控一下显存占用。另外请问一下你跑的是什么模型,是svd吗
两个问题都是显存不足吗? 不是 svd 就是 sdxl。我观察到,使用 onediff 进行加速的时候,使用的显存占用量会多很多,这个有什么解决方案吗?我的 A10 只有 24g 显存
首先你可以先尝试把vae的编译加速关掉,另外请问一下你使用的分辨率是多大的,如果不用onediff加速的时候正常的显存占用是多少?
把 vae 关掉,不设置 ONEFLOW_CONV_ALLOW_HALF_PRECISION_ACCUMULATION 和 ONEFLOW_MATMUL_ALLOW_HALF_PRECISION_ACCUMULATION 是可以推理的。
使用的分辨率是 1024x1024
但之前设置 vae 时同样的机器是可以正常推理的,设置 ONEFLOW_CONV_ALLOW_HALF_PRECISION_ACCUMULATION 和 ONEFLOW_MATMUL_ALLOW_HALF_PRECISION_ACCUMULATION 之后也会回报错。
你们那边有官方的 docker 镜像可以使用吗?
你们那边有官方的 docker 镜像可以使用吗?
现在没有提供 docker
把 vae 关掉,不设置 ONEFLOW_CONV_ALLOW_HALF_PRECISION_ACCUMULATION 和 ONEFLOW_MATMUL_ALLOW_HALF_PRECISION_ACCUMULATION 是可以推理的。
看起来是 vae 编译导致的问题。你可以先关掉 VAE 的编译。
VAE 这里显存开销增多比较明显,如果你的显存比较少就不适合打开。在 1.2 时,我们会想办法解决 VAE 的显存问题。