wgpu-py icon indicating copy to clipboard operation
wgpu-py copied to clipboard

How not to panic

Open marcdownie opened this issue 3 years ago • 3 comments

I've asked a very similar question before — here #108 . This is rightfully closed: error messages are vastly better than they were a year ago. However, stepping back a bit, I see the wgpu still happily delivers these messages while issuing a Rust-level panic — that is, printing something and swiftly killing the whole process.

Has this become the long term plan? or is there a way to recover from these panics? is the intention that our wgpu-py renderers are separate, easily re-initializable disposable processes? Is this what triangle_subprocess.py is aiming towards? Is wgpu[-py] for cattle rather than pets?

I'm asking this question here and not upstream because, while this behavior very Rust, it's really not very Python at all — imagine writing wgpu-py from the REPL, or live-coding something built on wgpu-py from Jupyter kernel etc. Now that we have a WSL --- a shader language that we can pass along to wgpu as strings (without recourse to an external shader compiler) --- we dare not really use it dynamically, because a typo in a shader quits our process with:

Failed to parse WGSL code for Some(""): unknown type: `vec`
thread '<unnamed>' panicked at 'Parsing', src/device.rs:112:5

It speaks to what long-term applications of wgpu-py apps that are viable inside the python ecosystem. And it's also completely different set of rules from, say both OpenGL and WebGL --- which are perfectly recoverable from an error state.

marcdownie avatar Jun 14 '21 19:06 marcdownie

Good point! For the time being, things should be pretty safe from the moment that all your shaders are correct (and you don't dun out of GPU memory). For dynamic use these panics are indeed not productive at all.

Eventually wgpu will be used from the browser, which is also very dynamic - and I'm pretty sure they won't kill the tab whenever there is a shader error :) So in the long run I think there won't be any panics ever. For the time being we'll have to live with them, and during that time wgpu-py will be less suited for interactively writing shaders.

almarklein avatar Jun 15 '21 09:06 almarklein

I've asked upstream to get some more info: https://github.com/gfx-rs/wgpu-native/issues/113

almarklein avatar Jun 15 '21 09:06 almarklein

So to clearly answer this question in particular:

Has this become the long term plan?

It seems very very unlikely!

However, full disclosure, we, the maintainers of wgpu-py, have to delegate your question upstream, so please keep an eye on the thread @almarklein mentioned.

Korijn avatar Jun 15 '21 10:06 Korijn

I would love to know the progress on this. From the linked issue it looks like there has been some progress. A python try: can't catch these panics so crashes instead. But the error messages look really helpful for my use case, so if there is a way to get these into python as exception for example.

StdErr from Kernel Process thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: [Error { kind: SemanticError("Unknown function 'sdBox'"), meta: Span { start: 2051, end: 2090 } }]', src\conv.rs:461:56

Vipitis avatar Aug 15 '23 00:08 Vipitis

We plan to update wgpu-py soon(ish) so it can make use of the new wgpu-native that has implemented the push/pop error scopes. From what I know there are still few situations where wgpu can get into a state where it will panic, but these cases will get more rare over time.

almarklein avatar Aug 15 '23 05:08 almarklein

Will this be a weeks, months or years timeline? If it's unlikely to be done this year, I will have to look for a different solution for my project. By any chance is there a way to preview those features, perhaps a dev branch?

Vipitis avatar Aug 15 '23 12:08 Vipitis

Taking into account that it's the holiday season my guess would be that we start working on 5-7 weeks from now. Nothing has been done yet, so no dev branch to check.

almarklein avatar Aug 15 '23 20:08 almarklein

I tried my implementation with v0.10.0 but still got into a panic. I will try to run the debugger a bit more and see how the new changes of ErrorHandler work. Since the debugger now jumps farther and gets there. this time I remain optimistic and might even find a solution

Vipitis avatar Oct 09 '23 18:10 Vipitis

Are you able to reduce your use-case into a minimal standalone example that you could post here, and that we can then run? That might help get more insight or ideas on how to resolve it.

almarklein avatar Oct 10 '23 14:10 almarklein

I have taken a look at the examples and simply took the default Shadertoy shader and made an intentional typo in line 10, writing coll instead of col

    // Output to screen
--- fragColor = vec4(col,1.0);
+++ fragColor = vec4(coll,1.0);

This is I guess the minimal script to run this in wgpu-py. (It works without the typo as execpected)

from wgpu.utils.shadertoy import Shadertoy

shader_code = """
void mainImage( out vec4 fragColor, in vec2 fragCoord )
{
    // Normalized pixel coordinates (from 0 to 1)
    vec2 uv = fragCoord/iResolution.xy;

    // Time varying pixel color
    vec3 col = 0.5 + 0.5*cos(iTime+uv.xyx+vec3(0,2,4));

    // Output to screen
    fragColor = vec4(coll,1.0);
}
"""
shader = Shadertoy(shader_code, resolution=(800, 450))

if __name__ == "__main__":
    shader.show()

instead of raising a ValidationError and getting a message telling me "unknown identifier" (as a similar typo does in the provided examples), you get this panic and the program exists

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: [Error { kind: UnknownVariable("coll"), meta: Span { start: 636, end: 640 } }]', src\conv.rs:557:58
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace`

it might be specific to GLSL support using code directly from shadertoy.com I have chased the debugger all the way to this specific line (debugger crashes when stepping into here): https://github.com/pygfx/wgpu-py/blob/f2b77095a0ee65dfc89aae7259de17b4f4b8d055/wgpu/backends/rs_helpers.py#L407 a helpful first breakpoint is here: https://github.com/pygfx/wgpu-py/blob/f2b77095a0ee65dfc89aae7259de17b4f4b8d055/wgpu/utils/shadertoy.py#L320


I really hope this helps, I just looked at this on my home PC with an Intel Arc GPU - but had the same issue using an external machine (with Intel GPUs as well). Will try this on my laptop shortly (nvidia gpu).

Vipitis avatar Oct 10 '23 18:10 Vipitis

To chime in here, "incorrect shaders" (including syntactical and semantic typos which could be caught by some SPIRV-ish toolchain, but also incorrect and mismatched layouts of buffers where the shader is correct in and of itself but wrong in context) is an extremely common condition in any creative coding applications. My shaders are likely "correct" in a professional sense only at some unspecified point before the audience walks into the hall; at my desk they spend more time being incorrect than not.

marcdownie avatar Oct 10 '23 18:10 marcdownie

the whole use case for me is as part of an evaluation benchmark for code generation language models. One idea I am following is to run the generated code and then evaluate if frames match, another will look at the rendered frames with a CLIP model to see if the rendered frame matches semantically to the input prompt. Understanding what 'kind' of error will be helpful as you can either grade a generation as wrong or simply incomplete, if it is for example calling a function that doesn't exist (which you could then ask the model etc) until it ends up with something working. It is currently not clear if I manage to get to that 2nd part during my bachelor thesis, but it's definitely something very interesting I want to look at in the next year. I have previously been told to look at naga for validation and I still might, but the lazy python developer I am first looked at various solutions to easily run shadertoy shaders directly with python, and this project had the most useful implementation.

Vipitis avatar Oct 10 '23 18:10 Vipitis

I played around a bit with a few shaders, and I think that errors in shaders written in WGSL result in Python exceptions, whereas errors in GLSL result in a panic. This is unfortunate, but there is not much we can do here. It's simply that they've not made this work correctly yet for GLSL. I presume that WGSL support is of higher priority to the Naga devs than GLSL, but I expect things it will be fixed eventually.

You could check the repo to see if there is an issue that already tracks this, and create one if not. (If you know a bit of Rust you might be able to help implement it.)

almarklein avatar Oct 10 '23 19:10 almarklein

To chime in here, "incorrect shaders" (including syntactical and semantic typos which could be caught by some SPIRV-ish toolchain, but also incorrect and mismatched layouts of buffers where the shader is correct in and of itself but wrong in context) is an extremely common condition in any creative coding applications. My shaders are likely "correct" in a professional sense only at some unspecified point before the audience walks into the hall; at my desk they spend more time being incorrect than not.

To be clear, we're not questioning your need to be able to handle exceptions. We want it too!

The simple fact is that wgpu-py is a Python library with Python wrappers for the Rust library wgpu-native. We don't develop wgpu-native. We develop wgpu-py here.

The reality is that if a Rust library panics, the Python wrappers can't "undo the panic" and raise a Python exception instead. There's just nothing we can do in that situation.

We depend on the Rust library developers to improve wgpu-native, to prevent it from panicking and provide a runtime mechanism to handle the errors that we can utilize in Python wrappers.

Korijn avatar Oct 10 '23 19:10 Korijn

You could check the repo to see if there is an issue that already tracks this, and create one if not. (If you know a bit of Rust you might be able to help implement it.)

I will dig around for the next few weeks to hopefully find a solution. But already found essentially the same issue tracking here: https://github.com/gfx-rs/wgpu/issues/2545

Vipitis avatar Oct 10 '23 23:10 Vipitis

for now I am using a janky solution that works to convince my supervisors.

with tempfile.NamedTemporaryFile(suffix=".frag", mode="w") as f:
    f.write(frag_shader_code)
    f.flush()
    try:
        subprocess.run(["naga", f.name], check=True, capture_output=True)
    except subprocess.CalledProcessError as e:
        raise e

it's pasted about here: https://github.com/pygfx/wgpu-py/blob/2ffacd933d60423537826919c3ac6a950188919c/wgpu/utils/shadertoy.py#L316

perhaps there is a better solution eventually, but it bypasses the panic for me

Vipitis avatar Oct 12 '23 16:10 Vipitis

We've had code like that in wgpu-py itself in its very early days 😝, before WGSL existsted and spirv was the only target. I hope it gets fixed upstream fast!

almarklein avatar Oct 12 '23 19:10 almarklein

Getting back to the original issue. The current situation is as follows:

  • wgpy-native has a mechanism to communicate errors via the C API.
  • In wgpu-py we make use of it, and raise Python errors (at the appropriate C call) in most cases, and log the error otherwise.
  • In wgpu-core, most errors now get communicated via this route.
  • There may be Rust panics in some specific cases (e.g. syntax errors in GLSL), but they will most certainly be communicated via the same mechanism eventually.

I think that's enough to close this issue.

almarklein avatar Oct 12 '23 19:10 almarklein

not to revive this issue. But I have run into at least one more shader that panics in a similar way. And this one validates with naga just fine(the frag_shader_code.frag mentioned above) - it panics with naga when translating to SPIR-V, not when translating to wgsl. It's also working on shadertoy. here is the source: https://www.shadertoy.com/view/NsffD2

I made a minimal script of it in this gist: https://gist.github.com/Vipitis/0981436d889deb4d330768fa5843b9c8

Haven't tried to minimize the shadercode to figure out what causes it. But I have seen at least one other panic which I will look into in the coming days.

the panic happens at the same place as above, the result = ob(**args).

backtrace looks as follows

thread '<unnamed>' panicked at C:\Users\runneradmin\.cargo\git\checkouts\naga-dbb2b19faed49210\bac2d82\src\back\spv\block.rs:2066:56:
internal error: entered unreachable code: Expression [53] is not cached!
stack backtrace:
   0:     0x7ff83a804cea - wgpu_render_pass_execute_bundles
   1:     0x7ff83a81bb1b - wgpu_render_pass_execute_bundles
   2:     0x7ff83a8020c1 - wgpu_render_pass_execute_bundles
   3:     0x7ff83a804a6a - wgpu_render_pass_execute_bundles
   4:     0x7ff83a806e1a - wgpu_render_pass_execute_bundles
   5:     0x7ff83a806a88 - wgpu_render_pass_execute_bundles
   6:     0x7ff83a8074ce - wgpu_render_pass_execute_bundles
   7:     0x7ff83a8073bd - wgpu_render_pass_execute_bundles
   8:     0x7ff83a8056d9 - wgpu_render_pass_execute_bundles
   9:     0x7ff83a8070c0 - wgpu_render_pass_execute_bundles
  10:     0x7ff83a847f15 - wgpu_render_pass_execute_bundles
  11:     0x7ff83a78e57d - wgpu_render_pass_execute_bundles
  12:     0x7ff83a79fb91 - wgpu_render_pass_execute_bundles
  13:     0x7ff83a7a958f - wgpu_render_pass_execute_bundles
  14:     0x7ff83a7ae3e2 - wgpu_render_pass_execute_bundles
  15:     0x7ff83a7afe04 - wgpu_render_pass_execute_bundles
  16:     0x7ff83a64ee19 - wgpu_render_pass_execute_bundles
  17:     0x7ff83a656957 - wgpu_render_pass_execute_bundles
  18:     0x7ff83a3c6fbf - wgpuDeviceDestroy
  19:     0x7ff83a326da7 - wgpuDeviceDestroy
  20:     0x7ff83a4cc7e8 - wgpuDeviceCreateRenderPipeline
  21:     0x7ff85bc010f3 - <unknown>
  22:     0x7ff85bc1ac90 - PyInit__cffi_backend
  23:     0x7ff85bc074af - <unknown>
  24:     0x7ff85926d44b - PyObject_Call
  25:     0x7ff85924c65f - PyEval_EvalFrameDefault
  26:     0x7ff859243963 - PyObject_GC_Del
  27:     0x7ff8592455a7 - PyFunction_Vectorcall
  28:     0x7ff85924823d - PyEval_EvalFrameDefault
  29:     0x7ff859243963 - PyObject_GC_Del
  30:     0x7ff85924b182 - PyEval_EvalFrameDefault
  31:     0x7ff85924812f - PyEval_EvalFrameDefault
  32:     0x7ff859243963 - PyObject_GC_Del
  33:     0x7ff8592455a7 - PyFunction_Vectorcall
  34:     0x7ff859298a6a - PyArg_ParseTuple_SizeT
  35:     0x7ff85929699d - PyType_GenericNew
  36:     0x7ff8592f30ac - PyObject_MakeTpCall
  37:     0x7ff85924e073 - PyEval_EvalFrameDefault
  38:     0x7ff859243963 - PyObject_GC_Del
  39:     0x7ff859308bdd - PyEval_EvalCodeWithName
  40:     0x7ff859308b1f - PyEval_EvalCodeEx
  41:     0x7ff859308a7d - PyEval_EvalCode
  42:     0x7ff85923587a - PyBytes_DecodeEscape
  43:     0x7ff8592357fa - PyBytes_DecodeEscape
  44:     0x7ff8592d8083 - Py_wfopen
  45:     0x7ff8592d6434 - Py_stat
  46:     0x7ff8592aa7ef - PyRun_SimpleFileExFlags
  47:     0x7ff8593a2e04 - PyRun_AnyFileExFlags
  48:     0x7ff8592d8dd2 - PyList_GetItem
  49:     0x7ff8592da32c - Py_RunMain
  50:     0x7ff8592da1b5 - Py_RunMain
  51:     0x7ff8592d9e85 - Py_Main
  52:     0x7ff795d81258 - <unknown>
  53:     0x7ff8db707344 - BaseThreadInitThunk
  54:     0x7ff8dbee26b1 - RtlUserThreadStart

the error originates from naga and not inside wgpu as was with the Unknown Variables. So I suspect it's actually an issue with naga - but I haven't yet understand enough or found a similar issue

Vipitis avatar Oct 19 '23 23:10 Vipitis

for now I am using a janky solution that works to convince my supervisors.

with tempfile.NamedTemporaryFile(suffix=".frag", mode="w") as f:
    f.write(frag_shader_code)
    f.flush()
    try:
        subprocess.run(["naga", f.name], check=True, capture_output=True)
    except subprocess.CalledProcessError as e:
        raise e

it's pasted about here:

https://github.com/pygfx/wgpu-py/blob/2ffacd933d60423537826919c3ac6a950188919c/wgpu/utils/shadertoy.py#L316

perhaps there is a better solution eventually, but it bypasses the panic for me

One more idea for a workaround is to run python in a subprocess with sys.executable, to see if some code will panic or not. You can avoid the dependency on the naga binary that way.

Korijn avatar Oct 20 '23 06:10 Korijn

If you manage to minimize the shader code to expose the place where the error originates, it might be worth reporting this at the Naga repo.

almarklein avatar Oct 20 '23 07:10 almarklein

for now I am using a janky solution that works to convince my supervisors.

with tempfile.NamedTemporaryFile(suffix=".frag", mode="w") as f:
    f.write(frag_shader_code)
    f.flush()
    try:
        subprocess.run(["naga", f.name], check=True, capture_output=True)
    except subprocess.CalledProcessError as e:
        raise e

it's pasted about here: https://github.com/pygfx/wgpu-py/blob/2ffacd933d60423537826919c3ac6a950188919c/wgpu/utils/shadertoy.py#L316 perhaps there is a better solution eventually, but it bypasses the panic for me

One more idea for a workaround is to run python in a subprocess with sys.executable, to see if some code will panic or not. You can avoid the dependency on the naga binary that way.

I bypass it now by not just using naga file.frag but naga file.frag file.spv since this translation does panic and hence throw an exception I can work with. However this doesn't make semantic sense in my implementation as I label those as 'code_error'. But that's all on me and trivial to solve.

If you manage to minimize the shader code to expose the place where the error originates, it might be worth reporting this at the Naga repo.

I will open an issue on naga as it's clearly not intended to have validation be successful and then translation panic. While looking a bit deeper I found one or two open issues that mention a similar error message.

Vipitis avatar Oct 20 '23 09:10 Vipitis

now tracking in https://github.com/gfx-rs/wgpu/issues/4549 I have also been able to construct a new minimal example, so the gist linked above as been updated to include that.

Vipitis avatar Oct 24 '23 00:10 Vipitis

Thanks for diving into this!

almarklein avatar Oct 24 '23 08:10 almarklein

The wgpu-native just received a fix (https://github.com/gfx-rs/wgpu-native/pull/313) that makes errors in GLSL also result in nice exceptions (in wgpu-py) instead of a panic. That's one part of the problem gone.

almarklein avatar Nov 03 '23 11:11 almarklein

great to hear. I will see how complicated it is to try wgpu-native from source and perhaps that makes everything a lot easier. As I have since discovered more panics which don't get caught by naga validation or naga translation.

Vipitis avatar Nov 03 '23 18:11 Vipitis