oneflow
oneflow copied to clipboard
安装cu112的包,先import torch,后import oneflow,再使用cuda,会报错
import torch, oneflow
x = oneflow.tensor(1).cuda()
报错信息:
F20220819 12:38:12.392885 860212 cuda_stream.cpp:103] Check failed: cublasSetMathMode(cublas_handle_, CUBLAS_TF32_TENSOR_OP_MATH) : CUBLAS_STATUS_INVALID_VALUE (7)
*** Check failure stack trace: ***
@ 0x7f27f35d2dfa google::LogMessage::Fail()
@ 0x7f27f35d30e2 google::LogMessage::SendToLog()
@ 0x7f27f35d2967 google::LogMessage::Flush()
@ 0x7f27f35d54d9 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f27e91eeb25 oneflow::ep::CudaStream::CudaStream()
@ 0x7f27e91e9a53 oneflow::ep::CudaDevice::CreateStream()
@ 0x7f27eb2209e6 oneflow::vm::EpStreamPolicyBase::stream()
@ 0x7f27ec9f8cae oneflow::vm::OpCallInstructionPolicy::Compute()
@ 0x7f27ec9f695f oneflow::vm::EpStreamPolicyBase::Run()
@ 0x7f27eca01f8f oneflow::vm::ThreadCtx::TryReceiveAndRun()
@ 0x7f27eca02d20 oneflow::(anonymous namespace)::WorkerLoop()
@ 0x7f27eca02f1d _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPN7oneflow2vm9ThreadCtxERKSt8functionIFvS6_EEES6_ZNS3_14VirtualMachine15CreateThreadCtxENS3_6SymbolINS3_6DeviceEEENS3_10StreamTypeEEUlS6_E2_EEEEE6_M_runEv
@ 0x7f298da0bde4 (unknown)
@ 0x7f2994cda609 start_thread
@ 0x7f2994e14133 clone
Aborted (core dumped)
cu112的包会报错,cu102的包不会报错,stable和nightly都可以复现 先import oneflow后import torch不会报错 python3.8.10 CUDA driver 515.65.01 oneflow-16
怎么会发现这个问题
这个很可能是PyTorch链接的cublas.so和OneFlow不是一个版本,OneFlow的版本更高 这里可以考虑加一个检查,强制运行时的cublas和cudnn版本不能低于编译时
怎么会发现这个问题
在做一个测试的时候想做个对比,就同时import了torch和oneflow,遇到了这个问题
这里可以考虑加一个检查,强制运行时的cublas和cudnn版本不能低于编译时
好的,我明天研究研究怎么加。是不是 cublasGetVersion() >= CUBLAS_VERSION
checked by https://github.com/Oneflow-Inc/oneflow/pull/9257