winforms
winforms copied to clipboard
Massive GDI (region) leak. Help needed.
.NET version
.Net7 and .Net8
Did it work in .NET Framework?
Yes
Did it work in any of the earlier releases of .NET Core or .NET 5+?
We didn't see these problems before .Net7.
Issue description
For several month we are trying to investigate huge GDI (regions) leak in our app. This leak is critical because can reach GDI limit (10k) in one day.
- The leaking GDI objects are regions.
- This is not managed leak - we have no leaking managed objects in dumps.
- It's start leaking suddenly and most of the time continue to the the limit or app restart, but some time stopped.
- The leak is related to elements redraw because of regions of course and most of the time it's happens around rdp connect/disconnect.
- We can't repro this leak :(
- We once managed to catch this behavior using performance HUD. This is quite problematic, because HUD slows down the working PC. And what we saw was very strange, it felt like something was starting to leak that had not leak before. Unfortunately, the call stack saved in HUD (
*.hudinsight
) does not show the names of the methods when viewed (may be some one knew how to overcome this?) :( And even those copied manually as text also turned out to be cropped due to the large size. Therefore, I will present here what I managed to get (sorry for this).
-
479 leaked regions due to redrawing (almost all are system calls):
1.csv
-
84 leaked regions due to close all opened child mdi forms. Closing all windows is done through the menu, when opened, the child menu is filled with open forms (15 in our case). Big part of it is leaking in
MenuStrip
->Control.SetBoundsCore
->SetWindowPos
. Call stacksToolStripDropDownItem.OnDropDownOpened
->EtwWriteTransfer
andToolStripDropDownItem.OnDropDownClosed
->EtwWriteTransfer
are full.2.csv The logic in
tsmiWidow_DropDownOpened
is populate childeDropDownItems
with 15 (in this situation) items. The logic intsmiWidow_DropDownClosed
is clear all items previously added:
while (tmi.DropDownItems.Count > 0)
{
ToolStripItem ti = tmi.DropDownItems[tmi.DropDownItems.Count - 1];
ti.MouseDown -= ActivateW;
var img = ti.Image;
ti.Dispose(); // will remove on dispose
img?.Dispose();
}
There are no managed leaks, all objects were properly deleted (this is not always 100% true I explain below).
It is very strange that the leak occurs both when adding and removing elements. In 99% of cases everything works completely correctly.
While researching I found a small managed leak here:
This small managed leak is reproducible and can't lead to such catastrophic consequences. Can easily be fixed with
WeakReference
here (I will open a PR later):
https://github.com/dotnet/winforms/blob/7504692fbb7ba100c5b098ce5c47e161450a98e2/src/System.Windows.Forms/src/System/Windows/Forms/Input/MouseHoverTimer.cs#L10-L11
- 800+ (maybe not all of 937 are leaked) leaked regions including points 1. and 2. This is also done through the menu, and one submenu is filled and cleared with 43 items in our case, exactly the same as in point 2. These are all the regions that the performance HUD detected after closing all child windows. In this state (in normal condition) our app consume about 100 GDI objects and 10 of them are regions. And in this situation there were about 5500 GDI objects and 5400 of them are regions.
all.csv
OS: Windows 10 Pro for Workstations 22H2.
In conclusion, it seems to me that the problem is somewhere in Winforms, or even in the OS. Any assistance in further investigation is greatly appreciated. 🙏
Steps to reproduce
--
@kirsan31 I'm very interested in looking more deeply at this. Unfortunately, I'm tied up for a number of weeks on critical BinaryFormatter
work. If you have some mitigations like the WeakReference piece, I'm happy to take a look at PRs there.
When the other work is done, I can try to see what I might be able to find out.
Also, have you been able to repro the same thing with .NET 9?
@JeremyKuhne
Also, have you been able to repro the same thing with .NET 9?
I can’t reproduce this at all (no matter how hard I try). This only happens on a working machine and only during work. Moreover, leaks have never started immediately after launching the application, only after a few days. Therefore, my ability to experiment there is very limited and I cannot use .Net9 :( I will continue my experiments...
@weltkante sorry to bother you. but may be you have some ideas?
I'm tied up for a number of weeks on critical BinaryFormatter work.
By the way, I have a question on this topic that no one has answered yet.
@weltkante sorry to bother you. but may be you have some ideas?
No problem, unfortunately this is nothing I've come across in the past. So far I've always been able to rely on managed leaks and memory snapshots/dumps to compare, or being able to reproduce the problem locally and do a time travel debug trace for inspecting the unmanaged leaks. Seems like neither is an option for you.
If I had to diagnose this issue I'd probably try to isolate what effects it:
- make a dump of the leaking process from the task manager and check for any unexpected 3rd party dlls that may have injected themselves into the process
- have a second machine being setup and used. if it never happens on another machine the machine may be simply broken or the windows installation has been corrupted
- if possible consider some sort of virtualization for the second setup for easy transfer between machines. I've been using hyperv based windows sandbox scripts lately to setup isolated applications in tricky cases, but there may be alternatives that are easier
Oh, and make sure the finalizer thread is not stuck on something (look at a few dumps in a debugger after leaks started and check that the thread is idle or at least differs between dumps). Depending on your tooling finalizable objects may not show up as leaks in your managed analysis, but if the finalizer thread is hanging and can't finalize things anymore that may end up this way.
@weltkante
make a dump of the leaking process from the task manager and check for any unexpected 3rd party dlls that may have injected themselves into the process
Nice case, just checked - everything is ok here. But the probability was extremely low because... A work PC has very high restrictions on installed software and Internet use.
have a second machine being setup and used. if it never happens on another machine the machine may be simply broken or the windows installation has been corrupted if possible consider some sort of virtualization for the second setup for easy transfer between machines. I've been using hyperv based windows sandbox scripts lately to setup isolated applications in tricky cases, but there may be alternatives that are easier
Due to the specifics, there is no way for us to configure either a second physical machine or a virtual one. And users won’t approve of this, it’s easier for them to restart the application every few days :)
Oh, and make sure the finalizer thread is not stuck on something (look at a few dumps in a debugger after leaks started and check that the thread is idle or at least differs between dumps). Depending on your tooling finalizable objects may not show up as leaks in your managed analysis, but if the finalizer thread is hanging and can't finalize things anymore that may end up this way.
There's nothing wrong with that. Because other objects are finalized normally, and most regions too. Also in dumps after GC, ready for finalization objects are empty and nothing extraordinary in dead objects.
Thank you for your attention any way 🙏
A small update on what I found out.
- As a result, native functions leak, very often these are
SetWindowPos
. - When running our two applications in parallel, the leaks are not the same. In one application, every menu restructuring is leaked, while in another it is not leaked at all. It follows from this that the problem is not system wide, but begins in a certain process after certain conditions are met (and after certain conditions can stop).
As a result, native functions leak, very often these are SetWindowPos.
Just as a side note to avoid other people reading this drawing the wrong conclusions: SetWindowPos can trigger a lot messages, including callbacks into managed code, so its unlikely to be the direct cause of the leak
SetWindowPos can trigger a lot messages, including callbacks into managed code, so its unlikely to be the direct cause of the leak
Of course that's not the direct cause of the leak. And SetWindowPos really trigger a lot messages, but no one callback to managed code:
I point this method like the most common last managed method in call stack.
but no one callback to managed code
Sounds weird, all WinForms controls should, at the very least, go through the managed message handler of the control. And SetWindowPos can (and usually does) trigger resize and redraw logic, both of which can have managed event handlers that need to be dispatched too, even if they are empty.
Anyways, just meant to say that seeing this method as call root doesn't mean the problem is guaranteed to be on the native code.
@kirsan31 Do you think you could provide consistent repro for us to investigate this?
@kirsan31 Do you think you could provide consistent repro for us to investigate this?
I hope so... Currently the issue still exist (very sporadically) and I can't get the root cause :(
@kirsan31 as soon as we get actionable stuff here we can assign it to whatever the current release is.
Once again we were able to directly catch leaks and use the performance HUD. What we found out:
- Leaks are completely related to the RDP session; they occur during connection, disconnection and also when manipulating the RDP window; one of the leaks happened simply when minimizing the RDP window.
- Leaks do not depend on the RDP client application.
- Leaks do not depend on the client OS.
- Two of our applications were launched - leaks were observed simultaneously in both.
- As I already wrote, it feels like absolutely everything related to redrawing is starting to leak. Moreover, this is not a constant process - not every connection/disconnect/change in the RDP window entails a leak.
This time I copied (in several approaches) all the stacks for two applications from the performance HUD (If necessary, I will provide them all). But they don’t give anything new - all the stacks are some kind of drawing of a menu/tooltip, etc., which always end with EtwWriteTransfer
. Example:
https://github.com/user-attachments/assets/08ce1d7c-a925-45f4-991d-ba0c3105774c
From all this and the fact that before .Net7 such behavior was not observed, I have only two possible assumptions - either this is somehow a Windows bug/corruption (appeared with some kind of system update), or, after all, a regression in .Net.
Does anyone have any other ideas or tips?
P.S. why do all messages go through office component ComponentManager.Microsoft.Office.IMsoComponentManager.FPushMessageLoop
?
//cc @JeremyKuhne
P.S. why do all messages go through office component ComponentManager.Microsoft.Office.IMsoComponentManager.FPushMessageLoop?
Thats just an interface for Office/VisualStudio compatibility, the naming is historical, WinForms uses its own implementation if Office/VS is not detected to provide the interface implementation.
omponentManager.FPushMessageLoop?
Thats just an interface for Office/VisualStudio compatibility, the naming is historical, WinForms uses its own implementation if Office/VS is not detected to provide the interface implementation.
Note that this will now be turned off by default (.NET 9). It can be turned back on with the "Switch.System.Windows.Forms.EnableMsoComponentManager" switch.
Even with the stub it was a fair amount of overhead for message processing. As we had to rewrite all of our COM for ComWrappers we took the opportunity to simplify the message loop.