converted models producing noise

Open ssube opened this issue 2 years ago • 1 comments

Some recently-converted are producing random noise, like:

This is:

happening with pytorch 2.x, including 2.0 and 2.1
- even with low_cpu_mem_usage monkeypatches on most/all UNetCondition2DModel ctors
only happening when ONNX_WEB_CONVERT_EXTRACT=FALSE
- which is necessary for converting some newer models
happening with both fp16 and fp32
happening with SD v1.5 models
- does not appear to happen with SDXL

Running a diff between a good and bad copy of the same model, most/all of the weights are different:

INFO:__main__:raw data differs for onnx::Mul_9546: -0.12585449
INFO:__main__:raw data differs for onnx::Add_9547: 0.10827637
INFO:__main__:raw data differs for onnx::MatMul_9548: 0.25146484
INFO:__main__:raw data differs for onnx::MatMul_9549: 0.34448242
INFO:__main__:raw data differs for onnx::MatMul_9550: 0.16455078
INFO:__main__:raw data differs for onnx::MatMul_9557: 0.19995117
INFO:__main__:raw data differs for onnx::MatMul_9558: 0.15942383
INFO:__main__:raw data differs for onnx::MatMul_9559: 0.21325684
INFO:__main__:raw data differs for onnx::MatMul_9560: 0.13708496
INFO:__main__:raw data differs for onnx::MatMul_9567: 0.23217773
INFO:__main__:raw data differs for onnx::MatMul_9568: 0.19250488
INFO:__main__:raw data differs for onnx::MatMul_9569: 0.18237305
INFO:__main__:raw data differs for onnx::Mul_9570: -0.03488159
INFO:__main__:raw data differs for onnx::Add_9571: 2.65625
WARNING:__main__:models have 686 differences

This is true for at least the UNet and VAEs.

Dec 23 '23 22:12 ssube

When this occurs on Windows, it appears to cause a crash rather than noise:

[2023-12-23 20:16:06,568] ERROR: onnx-web worker: directml MainThread onnx_web.chain.pipeline: error while running stage pipeline, 1 retries left
Traceback (most recent call last):
  File "onnx_web\chain\pipeline.py", line 227, in __call__
  File "onnx_web\chain\source_txt2img.py", line 144, in run
  File "diffusers\pipelines\stable_diffusion\pipeline_onnx_stable_diffusion.py", line 433, in __call__
  File "diffusers\pipelines\stable_diffusion\pipeline_onnx_stable_diffusion.py", line 433, in <listcomp>
  File "onnx_web\diffusers\patches\vae.py", line 79, in __call__
  File "diffusers\pipelines\onnx_utils.py", line 60, in __call__
  File "onnxruntime\capi\onnxruntime_inference_collection.py", line 220, in run
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Add node. Name:'/decoder/mid_block/attentions.0/Add_1' Status Message: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2759)\onnxruntime_pybind11_state.pyd!00007FF863B5DDF2: (caller: 00007FF863B5DB05) Exception(4) tid(5d98) 80070057 The parameter is incorrect.

I believe this is the same issue, and the different error message is due to DirectML.

I've written a new SD converter that uses the same optimum.main_export call that the SDXL converter is using, which seems to work on most models. Currently testing on the models included in the pre-converted set:

Cetus
- fails on both v4 and Whalefall
Dreamshaper
- works on v8
Elegant Entropy
- works on v1.4
Faetastic
- fails on v2
Juggernaut
- not setup yet, not tested
ReV Animated
- works on v1.2.2-EOL

Examples:

Based on the fact that some models fail to convert with both methods, it seems like there might be an issue with the model or somewhere upstream. All of the failing models (Cetus and Faetastic) convert correctly when using pipeline: txt2img-legacy and ONNX_WEB_CONVERT_EXTRACT=TRUE (which is the default again).

Dec 24 '23 12:12 ssube