onediff
onediff copied to clipboard
Check failed: invalid configuration argument when run StableVideoDiffusionPipeline with big resolution
Describe the bug
A clear and concise description of what the bug is.
运行 StableVideoDiffusionPipeline示例时,warmup分辨率指定为1024*576,实际测试时输入图片分辨率为2400*1080,推理时会一直出现Check failed: invalid configuration argument,程序卡住不继续运行,跑小分辨率图是不会出现此问题的。
请问下,是否onediff不支持大分辨率图片?
Your environment
OS
CentOS
OneDiff git commit id
OneFlow version info if you have installed oneflow
Run python -m oneflow --doctor and paste it here.
path: ['/home/local/miniforge3/envs/svd/lib/python3.10/site-packages/oneflow']
version: 0.9.1.dev20240515+cu122
git_commit: ec7b682
cmake_build_type: Release
rdma: True
mlir: True
enterprise: False
How To Reproduce
Steps to reproduce the behavior(code or script):
The complete error message
Additional context
Add any other context about the problem here.
你跑的例子是哪个?
invalid configuration argument 印象中不是 ondiff 里面报的错误,可以给下更完整的错误栈
F20240719 16:29:14.029467 120179 cutlass_conv_tuner_impl.cpp:123] Check failed: cudaEventSynchronize(end) : an illegal memory access was encountered (700) *** Check failure stack trace: *** @ 0x7fa1532751ca google::LogMessage::Fail() @ 0x7fa153278101 google::LogMessage::SendToLog() @ 0x7fa153274cf9 google::LogMessage::Flush() @ 0x7fa1532789e9 google::LogMessageFatal::~LogMessageFatal() @ 0x7fa14bbc0a3e oneflow::CutlassConvTunerImpl<>::Find() @ 0x7fa14baf4902 oneflow::CutlassConv2dEngine::Init() @ 0x7fa14baeaf41 oneflow::Conv2dEngineMgr::GetConv2dEngine() @ 0x7fa14a5e61ab ZZNK7oneflow12_GLOBAL__N_122Conv2dTuningWarmupPass5ApplyEPNS_3JobEPNS_10JobPassCtxEENKUlPKNS_6OpNodeEE1_clES8 @ 0x7fa14a5e7cce oneflow::(anonymous namespace)::Conv2dTuningWarmupPass::Apply() @ 0x7fa14a415e74 _ZZN7oneflow23LazyJobBuildAndInferCtx8CompleteEvENKUlRKSsiE2_clES2_i @ 0x7fa14a41b59d oneflow::LazyJobBuildAndInferCtx::Complete() @ 0x7fa247125166 oneflow::CurJobBuildAndInferCtx_Complete() @ 0x7fa247125fbb (unknown) @ 0x7fa246e7ab48 (unknown) @ 0x561241833fa4 cfunction_call @ 0x5612417f65d4 _PyObject_MakeTpCall.localalias.3 @ 0x561241899d75 _PyEval_EvalFrameDefault @ 0x561241843742 _PyEval_Vector @ 0x561241843c9b method_vectorcall @ 0x5612417fc03b _PyObject_Call.localalias.1 @ 0x561241897774 _PyEval_EvalFrameDefault @ 0x561241843742 _PyEval_Vector @ 0x561241843c9b method_vectorcall @ 0x5612417fc03b _PyObject_Call.localalias.1 @ 0x561241897774 _PyEval_EvalFrameDefault @ 0x561241843742 _PyEval_Vector @ 0x561241843c9b method_vectorcall @ 0x5612417fc03b _PyObject_Call.localalias.1 @ 0x561241897774 PyEval_EvalFrameDefault @ 0x561241843742 PyEval_Vector @ 0x561241843c9b method_vectorcall @ 0x5612417fc03b PyObject_Call.localalias.1 Stack trace (most recent call last): Object "/home/work/miniforge3/envs/svd/lib/python3.10/site-packages/oneflow/oneflow_internal.cpython-310-x86_64-linux-gnu.so", at 0x7fa246e7ab47, in Object "/home/work/miniforge3/envs/svd/lib/python3.10/site-packages/oneflow/oneflow_internal.cpython-310-x86_64-linux-gnu.so", at 0x7fa247125fba, in Object "/home/work/miniforge3/envs/svd/lib/python3.10/site-packages/oneflow/oneflow_internal.cpython-310-x86_64-linux-gnu.so", at 0x7fa247125165, in CurJobBuildAndInferCtx_Complete() Object "/home/work/miniforge3/envs/svd/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-10b6a2f2.so", at 0x7fa14a41b59c, in LazyJobBuildAndInferCtx::Complete() Object "/home/work/miniforge3/envs/svd/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-10b6a2f2.so", at 0x7fa14a415e73, in Object "/home/work/miniforge3/envs/svd/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-10b6a2f2.so", at 0x7fa14a5e7ccd, in Object "/home/work/miniforge3/envs/svd/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-10b6a2f2.so", at 0x7fa14a5e61aa, in Object "/home/work/miniforge3/envs/svd/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-10b6a2f2.so", at 0x7fa14baeaf40, in Conv2dEngineMgr::GetConv2dEngine(ep::CudaStream*, Conv2dConfig const&, Conv2dArguement const&, std::string const&) Object "/home/work/miniforge3/envs/svd/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-10b6a2f2.so", at 0x7fa14baf4901, in CutlassConv2dEngine::Init(ep::CudaStream*, Conv2dConfig const&, Conv2dArguement const&) Object "/home/work/miniforge3/envs/svd/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-10b6a2f2.so", at 0x7fa14bbc0a3d, in CutlassConvTunerImpl<cutlass::library::Conv2dConfiguration, cutlass::library::ConvArguments>::Find(ep::CudaStream*, cutlass::library::ConvFunctionalKey, cutlass::library::Conv2dConfiguration const&, cutlass::library::ConvArguments const&, void*, unsigned long) Object "/home/work/miniforge3/envs/svd/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-10b6a2f2.so", at 0x7fa1532789e8, in Object "/home/work/miniforge3/envs/svd/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-10b6a2f2.so", at 0x7fa153274cf8, in Object "/home/work/miniforge3/envs/svd/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-10b6a2f2.so", at 0x7fa153278100, in Object "/home/work/miniforge3/envs/svd/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-10b6a2f2.so", at 0x7fa1532751c9, in Object "/home/work/miniforge3/envs/svd/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-10b6a2f2.so", at 0x7fa1433e82fa, in
Aborted (Signal sent by tkill() 120179 10000)
warpup成功后报以上错误
请观察一下报错时的显存占用,看是不是显存满了导致 OOM 了
设备型号也请发下,我们可以尝试复现下
可以尝试调整下预热方式,先用最大的分辨率做预热,然后后面跑小的分辨率。【建议】
另外一种可以尝试的方法是把环境变量 ONEFLOW_CONV2D_KERNEL_ENABLE_TUNING_WARMUP 设置为 0【不太建议,可能导致分辨率变化时,推理开销变大一些】
设备型号也请发下,我们可以尝试复现下
A800上跑的。onediff是不是会增大内存开销呢,2400*1080分辨率不用onediff能跑,用了onediff后就会出错。 @strint
设备型号也请发下,我们可以尝试复现下
A800上跑的。onediff是不是会增大内存开销呢,2400*1080分辨率不用onediff能跑,用了onediff后就会出错。 @strint
我们验证下这个分辨率看看
@marigoold 来安排个
@marigoold 可以发下总结
@marigoold 可以发下总结 请问下这个问题能解决吗
@marigoold 可以发下总结 请问下这个问题能解决吗
您好,这个现象已经找到问题所在,正在修复,您可以使用 export ONEFLOW_CONV2D_KERNEL_ENABLE_TUNING_WARMUP=0 临时应对一下,看看还有没有问题。
另外,如果 vae 编译时候也报错的话,可以在 compile_pipe 里面指定 ignores=["vae"]
@marigoold 可以发下总结 请问下这个问题能解决吗
您好,这个现象已经找到问题所在,正在修复,您可以使用
export ONEFLOW_CONV2D_KERNEL_ENABLE_TUNING_WARMUP=0临时应对一下,看看还有没有问题。 另外,如果 vae 编译时候也报错的话,可以在 compile_pipe 里面指定 ignores=["vae"]
@marigoold 您好,这个方法试过了,还是不行。
@marigoold 可以发下总结 请问下这个问题能解决吗
您好,这个现象已经找到问题所在,正在修复,您可以使用
export ONEFLOW_CONV2D_KERNEL_ENABLE_TUNING_WARMUP=0临时应对一下,看看还有没有问题。 另外,如果 vae 编译时候也报错的话,可以在 compile_pipe 里面指定 ignores=["vae"]@marigoold 您好,这个方法试过了,还是不行。
还是一样的错误吗?
@marigoold 可以发下总结 请问下这个问题能解决吗
您好,这个现象已经找到问题所在,正在修复,您可以使用
export ONEFLOW_CONV2D_KERNEL_ENABLE_TUNING_WARMUP=0临时应对一下,看看还有没有问题。 另外,如果 vae 编译时候也报错的话,可以在 compile_pipe 里面指定 ignores=["vae"]@marigoold 您好,这个方法试过了,还是不行。
还是一样的错误吗?
@marigoold 是一样的错误。
@strint @marigoold 您好,请问这个问题解决了吗?