converted models producing noise
Some recently-converted are producing random noise, like:
This is:
- happening with pytorch 2.x, including 2.0 and 2.1
- even with
low_cpu_mem_usagemonkeypatches on most/allUNetCondition2DModelctors
- even with
- only happening when
ONNX_WEB_CONVERT_EXTRACT=FALSE- which is necessary for converting some newer models
- happening with both fp16 and fp32
- happening with SD v1.5 models
- does not appear to happen with SDXL
Running a diff between a good and bad copy of the same model, most/all of the weights are different:
INFO:__main__:raw data differs for onnx::Mul_9546: -0.12585449
INFO:__main__:raw data differs for onnx::Add_9547: 0.10827637
INFO:__main__:raw data differs for onnx::MatMul_9548: 0.25146484
INFO:__main__:raw data differs for onnx::MatMul_9549: 0.34448242
INFO:__main__:raw data differs for onnx::MatMul_9550: 0.16455078
INFO:__main__:raw data differs for onnx::MatMul_9557: 0.19995117
INFO:__main__:raw data differs for onnx::MatMul_9558: 0.15942383
INFO:__main__:raw data differs for onnx::MatMul_9559: 0.21325684
INFO:__main__:raw data differs for onnx::MatMul_9560: 0.13708496
INFO:__main__:raw data differs for onnx::MatMul_9567: 0.23217773
INFO:__main__:raw data differs for onnx::MatMul_9568: 0.19250488
INFO:__main__:raw data differs for onnx::MatMul_9569: 0.18237305
INFO:__main__:raw data differs for onnx::Mul_9570: -0.03488159
INFO:__main__:raw data differs for onnx::Add_9571: 2.65625
WARNING:__main__:models have 686 differences
This is true for at least the UNet and VAEs.
When this occurs on Windows, it appears to cause a crash rather than noise:
[2023-12-23 20:16:06,568] ERROR: onnx-web worker: directml MainThread onnx_web.chain.pipeline: error while running stage pipeline, 1 retries left
Traceback (most recent call last):
File "onnx_web\chain\pipeline.py", line 227, in __call__
File "onnx_web\chain\source_txt2img.py", line 144, in run
File "diffusers\pipelines\stable_diffusion\pipeline_onnx_stable_diffusion.py", line 433, in __call__
File "diffusers\pipelines\stable_diffusion\pipeline_onnx_stable_diffusion.py", line 433, in <listcomp>
File "onnx_web\diffusers\patches\vae.py", line 79, in __call__
File "diffusers\pipelines\onnx_utils.py", line 60, in __call__
File "onnxruntime\capi\onnxruntime_inference_collection.py", line 220, in run
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Add node. Name:'/decoder/mid_block/attentions.0/Add_1' Status Message: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2759)\onnxruntime_pybind11_state.pyd!00007FF863B5DDF2: (caller: 00007FF863B5DB05) Exception(4) tid(5d98) 80070057 The parameter is incorrect.
I believe this is the same issue, and the different error message is due to DirectML.
I've written a new SD converter that uses the same optimum.main_export call that the SDXL converter is using, which seems to work on most models. Currently testing on the models included in the pre-converted set:
- Cetus
- fails on both v4 and Whalefall
- Dreamshaper
- works on v8
- Elegant Entropy
- works on v1.4
- Faetastic
- fails on v2
- Juggernaut
- not setup yet, not tested
- ReV Animated
- works on v1.2.2-EOL
Examples:
Based on the fact that some models fail to convert with both methods, it seems like there might be an issue with the model or somewhere upstream. All of the failing models (Cetus and Faetastic) convert correctly when using pipeline: txt2img-legacy and ONNX_WEB_CONVERT_EXTRACT=TRUE (which is the default again).