wgpu-py icon indicating copy to clipboard operation
wgpu-py copied to clipboard

Need to keep a reference to a bindgroup even if its used in a render pass

Open fyellin opened this issue 1 year ago • 12 comments

To understand textures and mipmaps better, I copied some code from https://math.hws.edu/graphicsbook/source/webgpu/making_mipmaps.html and converted it to Python and WebGPU.

Here is my translation. mona_lisa.txt. I had to rename it to be a .txt file because Github doesn't allow me to attach a .py file.

When I run the program, I get the following:

% RUST_BACKTRACE=1 python3 mona_lisa.py
2024-06-04 10:39:12.370 Python[27543:25035108] WARNING: Secure coding is not enabled for restorable state! Enable secure coding by implementing NSApplicationDelegate.applicationSupportsSecureRestorableState: and returning YES.
thread '<unnamed>' panicked at /Users/runner/.cargo/git/checkouts/wgpu-53e70f8674b08dd4/9f505e7/wgpu-core/src/storage.rs:113:39:
BindGroup[Id(8,1,mtl)] does not exist
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: wgpu_core::storage::Storage<T,I>::get
   3: wgpu_core::track::stateless::StatelessTracker<A,Id,T>::add_single
   4: wgpu_core::command::render::<impl wgpu_core::global::Global<G>>::command_encoder_run_render_pass_impl
   5: _wgpuRenderPassEncoderEnd
   6: <unknown>
   7: <unknown>
   8: _cdata_call
   9: __PyObject_Call
  10: __PyEval_EvalFrameDefault
  11: _PyEval_EvalCode
  12: _run_mod
  13: _pyrun_file
  14: __PyRun_SimpleFileObject
  15: __PyRun_AnyFileObject
  16: _pymain_run_file_obj
  17: _pymain_run_file
  18: _Py_RunMain
  19: _pym
[mona_lisa.txt](https://github.com/user-attachments/files/15568594/mona_lisa.txt)
ain_main
  20: _Py_BytesMain

If I make drawing each image a separate rendering pass, the code does what it's supposed to do. [Comments about how to make this change are in the code.) So I know that I have correctly created the textures. The bug seems to be about doing multiple draws inside a single render pass.

platform: macOS-14.4.1-arm64-arm-64bit python_implementation: CPython python: 3.12.0 wgpu: 0.15.2

Apple M1 Pro running Metal 3

fyellin avatar Jun 04 '24 18:06 fyellin

Based on the error log, it is evident that the internal instance object of BindGroup was automatically garbage collected before it was actually used.

BindGroup[Id(8,1,mtl)] does not exist

This is a bug in your code.

From my brief review of your code, the scope of your bind_group is incorrect. At the very least, you should ensure that the bind_group instance remains valid until pass_encoder.end() is called.

A simple solution would be to move the create_bind_group() operation outside of the loop and store the results in a list. You can then retrieve the bind_group directly from the list when initiating a DrawCall.

Additionally, I would like to mention that creating bind_group resources (or other GPU resources) within the rendering loop is not a good practice. You should prepare these GPU resources elsewhere in advance and cache them for reuse. When preparing to issue a GPU DrawCall in the rendering loop, you should use the cached resources as far as possible instead of creating new ones each time.

panxinmiao avatar Jun 05 '24 06:06 panxinmiao

To add to that, the fact that Rust panics can be considered a bug in wgpu-native/wgpu-core. Eventually all errors should be reported via the error mechanism, so they become Python exceptions. There are a few error-cases where Rust still panics, this being one of them.

almarklein avatar Jun 05 '24 07:06 almarklein

@panxinmiao , @almarklein. Thanks for the quick response. If I can make a few comments.

  1. My code was essentially a direct translation of the Java Script source I mentioned. They created a bind group in a loop, so I created a bind group in a loop. I tend to avoid that in my own code.
  2. Is it documented that render_pass.set_bind_group() only keeps a weak pointer to the object, and that the user is responsible for keeping the bind group alive? Are all the methods like this? If so, this really needs to be documented. It's not typical Python behavior. Weak pointers are very rare. The closest example I can give is https://matplotlib.org/stable/api/animation_api.html where the documentation is clear that the user is responsible for holding onto the Animation object.
  3. My program actually had two parts: Generate the mipmaps and then display them. Both halves of the code used a loop to create a bind group and then call pass_encoder.set_bind_group(...) on it. The first half never crashed, just the second. Was this just the timing of the garbage collector?

fyellin avatar Jun 05 '24 16:06 fyellin

Is it documented that render_pass.set_bind_group() only keeps a weak pointer to the object, and that the user is responsible for keeping the bind group alive? Are all the methods like this? If so, this really needs to be documented.

I don't think its currently documented. Most method should not be like this. I am actually a bit surprised by this behavior myself, because I'd think wgpu-core keeps the object alive because its being used by the render pass.

To give some context: the real objects live in wgpu-core (Rust code) and the objects in wgpu-py wrappers. When a Python object is cleared by Python's gc, we bump the refcount down for wgpu-core. When wgpu-core finds the ref-count to be zero, the object is really deleted. However, it does not seem to add a ref when adding to the render pass.

This is worth looking into, I'll rename the issue to better reflect the problem.

almarklein avatar Jun 10 '24 07:06 almarklein

@almarklein. Thanks. The behavior you describe is certainly the behavior Python programmers expect.

fyellin avatar Jun 10 '24 17:06 fyellin

Yeah. The surprise is that the underlying rust library is expected to handle this, but apparently does not.

Korijn avatar Jun 10 '24 18:06 Korijn

Just along a similar vein.

I changed some code from

      render_pass_descriptor = get_render_pass_descriptor()
      render_pass = encoder.begin_render_pass(**render_pass_descriptor)

to the seemingly equivalent:

     render_pass = encoder.begin_render_pass(**get_render_pass_descriptor())

and again had Rust fail on me. Just another manifestation of the same issue.

fyellin avatar Jun 13 '24 06:06 fyellin

Thanks! Let's collect cases where this issue happens here.

almarklein avatar Jun 13 '24 07:06 almarklein

If I recall correctly, I have encountered similar issues with other Python binding libraries, such as PyAV, OpenCV, or PyQt. It seems that this situation cannot be completely avoided? :thinking: Because the lifecycle management of the underlying C++ objects is delegated to the Python objects. When the Python objects are prematurely destroyed, the internal instances become unusable. Therefore, when using these binding libraries, what I can do is pay attention to the lifecycle management of the Python objects, ensuring their scope aligns as much as possible with the usage scope of the bound underlying object instance.

panxinmiao avatar Jun 13 '24 07:06 panxinmiao

Good point. I think this applies to descriptors at least, since these are just little structs that exists in memory. However, for things that are actually objects in wgpu-core, like BindGroups, I would expect wgpu-core to increase the refcount itself.

almarklein avatar Jun 13 '24 07:06 almarklein

One can attempt to tie object lifecycles together in the python wrappers, but then the risk flips in the other direction. Besides objects being prematurely released, their lifetimes can also be extended too much, over consuming memory. Ideally it's handled at the rust or c++ level.

Korijn avatar Jun 13 '24 07:06 Korijn

We actually already do something like this in begin_render_pass(): https://github.com/pygfx/wgpu-py/blob/b5e9f66e13fb83b8976e65ad7aea8fa35437bd3f/wgpu/backends/wgpu_native/_api.py#L2316

But yeah, preferably this is not needed, cause it comes with a risk.

almarklein avatar Jun 13 '24 07:06 almarklein

This bug has been fixed with the newest wgpu.

Fix verified by PR #628

fyellin avatar Oct 27 '24 01:10 fyellin