win-win Avoiding races between hwnd use and window destruction

In the Reddit discussion on this crate, the issue of avoiding use-after-free of a HWND came up. Currently, the crate doesn't do anything to try to prevent this - it gives you a HWND, and what you do with that afterwards is up to you.

There are several strategies for avoiding such races, each with advantages and disadvantages.

Do nothing. Windows has protection against stale hwnd we may simply define our interpretation of the soundness criterion on Windows such that calling a function (such as PostMessage) on a HWND that has recently been destroyed is not unsafe.
Defer window destruction. This is what pcwalton suggested in the thread. The idea is that actual destruction of the window wouldn't happen until after all references to the hwnd have been dropped. There are some tricky aspects, including the fact that DestroyWindow would have to be called from a window procedure handler because the drop of the last hwnd reference might happen in another thread, and that any naked use of DestroyWindow (including as the default action of WM_CLOSE) would be unsafe.
Use a weak reference to the HWND, with the expectation that the only long-lived strong reference would be dropped on window destruction. There could technically be a race if PostMessage were called with a HWND retrieved from the weak reference while the window is being destroyed, but the duration of the race would be very short, and perhaps we can count on the use-after-free protection to be completely effective in such cases.
Use Arc<Mutex<HWND>> to wrap the window handle, such that it's zeroed out in the WM_NCDESTROY handler. All uses of the HWND other than inside the window procedure (the main intended use case being cross-window or cross-thread PostWindow) would need to be done while holding the lock. The main risk here is deadlock; in particular, calling DestroyWindow would be likely to trigger deadlock. But this might be a reasonable restriction to put on clients, as most of the time you want to defer DestroyWindow anyway to reduce complications from reentrancy.

All of these can be done on top of the mechanisms provided by this crate, though some would benefit from having special lifetime-related logic, for example in the WM_NCDESTROY handler. It's not clear to me yet whether there's a "best" solution, and whether it should be provided by this crate. Discussion is welcome.

Feb 04 '20 20:02 raphlinus

How important is it to solve this problem to begin with? How easily could we get away with not destroying windows?

We have a few different kinds of windows:

Main windows and their children. Those are usually active until right before the process terminates, and we can just let the system destroy those windows as the process terminates.
System modals such as file dialogs. Application code doesn't usually interact with their HWND's. Application code calls a function that blocks while the dialog is active and then returns a result.
User modals. These come with a DialogProc, which receives HWND's that application code has to interact with, and the system will destroy the dialog window after it's dismissed. We have to solve the problem for this case, but do we also have to support passing messages to dialog windows directly from other threads? If not, we could give the dialog HWND wrapper a lifetime and call it a day.
Non-modals. Could apps just show and hide them until the process terminates? How many apps would create a large number of different ones?

I'm probably missing something, but these are the most common cases, no?

Feb 05 '20 17:02 kennethuil

My personal feeling is that not destroying windows is an unappealing resource leak, and out of proportion to the actual safety problem. But I'm interested to hear what other people think.

Feb 05 '20 18:02 raphlinus

I concur with @raphlinus. The key important detail in point #1 is the use of the word usually. Failing to define the resource ownership model correctly means that there are applications for which this wouldn't be suited. Same for #4 - hiding the window would still leak the resource & result in a class of applications this wouldn't be suitable for. Given the goal for this project to be the definitive rust-for-windows-on-windows crate, it may be worthwhile to nail down resource leaks.

My gut feeling is that choice 2 or 4 is probably the right balance to strike. The challenge with the third option I think is that you end up having to write a bunch of code around "did I successfully acquire a strong reference". The first is also an option but the challenge comes up that you may have bugs in your code as a result and so, while it may not crash, it may not behave correctly either and the compiler won't help you.

The challenge with the second option, I think, is defining the message you'll send to the window thread to destroy it. WM_CLOSE may not be appropriate as there's a valid use-case for the consumer of the library to handle it custom & then the question is can you distinguish the internal "close" request due to no more refs from some other close that's perfectly valid to intercept (perhaps WM_CLOSE + SendMessageCallback so you can intercept when it's your close vs some other? or get opinionated about reserving some user messages for your library & hope that users porting to Rust need to rewrite large enough portions that where the user offset is doesn't matter or redefine the WM_USER you export as real WM_USER + X).

I think the instinct is 100% correct that DestroyWindow is special & it's fine to be opinionated that it's unsafe for users of the library to be calling it manually since it's bypassing the resource ownership you're establishing.

Feb 27 '20 05:02 vlovich