SkiaSharp
SkiaSharp copied to clipboard
[BUG] iOS: Crash when GC collects SKMetalView resources
Description
SkiaSharp v3.116.1
.Net 9
maui 9.0.0/9.0.100 SDK 9.0.100
maui-ios 9.0.0/9.0.100 SDK 9.0.100
IDE: JetBrains Rider
In our app we have several views, one of which contains an SKMetalView. Every time the GC runs to collect resources, the app crashes with the following stack:
at <unknown> <0xffffffff>
at SkiaSharp.SkiaApi:gr_direct_context_abandon_context <0x00007>
at SkiaSharp.GRContext:AbandonContext <0x0010f>
at SkiaSharp.GRContext:DisposeNative <0x00083>
at SkiaSharp.SKNativeObject:Dispose <0x001c3>
at SkiaSharp.SKObject:Dispose <0x0007b>
at SkiaSharp.GRContext:Dispose <0x0007b>
at SkiaSharp.SKNativeObject:Finalize <0x000a7>
at System.Object:runtime_invoke_virtual_void__this__ <0x00087>
Code
To reproduce the issue SkiaSharpSample.zip:
- Click Open Skia Screen, then go back
- Click Open Text Screen, then go back
- Repeat steps 1 and 2 and the app will crash
Expected Behavior
No response
Actual Behavior
No response
Version of SkiaSharp
3.116.0 (Current)
Last Known Good Version of SkiaSharp
2.88.9 (Previous)
IDE / Editor
Other (Please indicate in the description)
Platform / Operating System
iOS
Platform / Operating System Version
iOS 17.5 iOS 18.2
Devices
No response
Relevant Screenshots
No response
Relevant Log Output
at <unknown> <0xffffffff>
at SkiaSharp.SkiaApi:gr_direct_context_abandon_context <0x00007>
at SkiaSharp.GRContext:AbandonContext <0x0010f>
at SkiaSharp.GRContext:DisposeNative <0x00083>
at SkiaSharp.SKNativeObject:Dispose <0x001c3>
at SkiaSharp.SKObject:Dispose <0x0007b>
at SkiaSharp.GRContext:Dispose <0x0007b>
at SkiaSharp.SKNativeObject:Finalize <0x000a7>
at System.Object:runtime_invoke_virtual_void__this__ <0x00087>
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
Interesting, I've also experienced an occasional crash using SKMetalView too - might be the same as this one. I notice that the backend context is created twice with the default constructor. So, related in a sense, but probably won't fix this issue:
https://github.com/mono/SkiaSharp/pull/3256
OK, what I'm seeing is definitely the same problem as described here. It occurs on both iOS and MacCatalyst. As the original stack trace shows, there's a bug disposing/finalizing GRContext with the Metal backend. SKMetalView isn't even required to repro - just creating two GRContexts and disposing them is enough:
CreateAndDisposeMetalContext();
CreateAndDisposeMetalContext();
private static void CreateAndDisposeMetalContext()
{
var device = MTLDevice.SystemDefault!;
using (GRMtlBackendContext backendContext = new()
{
Device = device,
Queue = device.CreateCommandQueue(),
})
{
using GRContext context = GRContext.CreateMetal(backendContext);
}
GC.Collect();
GC.WaitForPendingFinalizers();
}
I'll have a look into fixing it, or at least adding the above as a unit test.
So, the above code was based on what SKMetalView was doing... The actual crash is coming from here
Thread 35 Crashed:: .NET Long Running Task
0 libdispatch.dylib 0x18f9c6c7c _dispatch_semaphore_dispose.cold.1 + 40
1 libdispatch.dylib 0x18f993e0c _dispatch_semaphore_dispose + 72
2 libdispatch.dylib 0x18f9924e4 _dispatch_dispose + 208
3 libdispatch.dylib 0x18f99232c dispatch_release + 152
4 Metal 0x19ac8526c -[_MTLCommandQueue dealloc] + 172
5 IOGPU 0x1b145fa94 -[IOGPUMetalCommandQueue dealloc] + 112
6 AGXMetalG14X 0x11a34a230 -[AGXG14XFamilyCommandQueue dealloc] + 84
7 SkiaSharp.Tests.Devices 0x102c85988 xamarin_dyn_objc_msgSend + 160 (trampolines-arm64-objc_msgSend.inc.s:99)
8 SkiaSharp.Tests.Devices 0x1030278e4 do_icall + 160 (interp.c:2268)
9 SkiaSharp.Tests.Devices 0x103025f20 do_icall_wrapper + 356 (interp.c:2350)
10 SkiaSharp.Tests.Devices 0x10301b500 mono_interp_exec_method + 2832
11 SkiaSharp.Tests.Devices 0x103019114 interp_runtime_invoke + 244 (interp.c:2109)
12 SkiaSharp.Tests.Devices 0x102f67420 mono_jit_runtime_invoke + 1320 (mini-runtime.c:3683)
13 SkiaSharp.Tests.Devices 0x102f07de8 do_runtime_invoke + 60 (object.c:2576) [inlined]
14 SkiaSharp.Tests.Devices 0x102f07de8 mono_runtime_invoke_checked + 148 (object.c:2792)
15 SkiaSharp.Tests.Devices 0x102f0f3ec mono_runtime_try_invoke_byrefs + 548 (object.c:5176)
16 SkiaSharp.Tests.Devices 0x102ecc070 ves_icall_InternalInvoke + 236 (icall.c:3615)
17 SkiaSharp.Tests.Devices 0x102ed5e88 ves_icall_InternalInvoke_raw + 100 (icall-def.h:371)
18 SkiaSharp.Tests.Devices 0x10302794c do_icall + 264 (interp.c:2298)
19 SkiaSharp.Tests.Devices 0x103025f58 do_icall_wrapper + 412 (interp.c:2354)
20 SkiaSharp.Tests.Devices 0x10301b500 mono_interp_exec_method + 2832
21 SkiaSharp.Tests.Devices 0x103019114 interp_runtime_invoke + 244 (interp.c:2109)
22 SkiaSharp.Tests.Devices 0x102f67420 mono_jit_runtime_invoke + 1320 (mini-runtime.c:3683)
23 SkiaSharp.Tests.Devices 0x102f07de8 do_runtime_invoke + 60 (object.c:2576) [inlined]
24 SkiaSharp.Tests.Devices 0x102f07de8 mono_runtime_invoke_checked + 148 (object.c:2792)
25 SkiaSharp.Tests.Devices 0x102f1c560 start_wrapper_internal + 524 (threads.c:1213) [inlined]
26 SkiaSharp.Tests.Devices 0x102f1c560 start_wrapper + 592 (threads.c:1271)
27 libsystem_pthread.dylib 0x18fb49c0c _pthread_start + 136
28 libsystem_pthread.dylib 0x18fb44b80 thread_start + 8
I can't say I'm familiar with Apple APIs or Objective C (and especially not Apple interop). But it looks/feels like a double free situation (i.e. abandonContext freeing the command queue followed by the finalizer). If I manually call retain on the command queue then it at least doesn't crash, but I'm not sure if this is just masks the problem or causes a leak in its place:
NSObject o = NSObject.FromObject(commandQueue);
o.DangerousRetain();
I've put in a draft pull request with a unit test and a proof of concept for the retain. Hopefully @mattleibow can take a look/take over!
Some bits in Skia showing use of retain for the backend context. Looks like it calls through to CFRetain on the command buffer when creating a test context:
https://skia.googlesource.com/skia/+/7755e6ab0bc0/tools/gpu/mtl/MtlTestContext.mm
Randomly (1 out of 3) getting this on iPhone when navigating away from a page:
=================================================================
Managed Stacktrace:
=================================================================
at <unknown> <0xffffffff>
at SkiaSharp.SkiaApi:gr_direct_context_abandon_context <0x00042>
at SkiaSharp.GRContext:AbandonContext <0x0006a>
at SkiaSharp.GRContext:DisposeNative <0x0001e>
at SkiaSharp.SKNativeObject:Dispose <0x000cc>
at SkiaSharp.SKObject:Dispose <0x00020>
at SkiaSharp.GRContext:Dispose <0x00020>
at SkiaSharp.SKNativeObject:Finalize <0x0003a>
at System.Object:runtime_invoke_virtual_void__this__ <0x00090>
at <unknown> <0x00000>
=================================================================
INFO: Basic Fault Address Reporting
=================================================================
Memory around native instruction pointer (0x193a27aa4):0x193a27a94 00 00 00 ea 6d ff
INFO: ff 54 10 7c 5f c8 02 82 7d 92 ....m..T.|_...}.
0x193a27aa4 51 10 40 f9 f1 02 10 36 d0 03 00 36 11 fe 6c d3 [email protected]
So i can easily reproduce.. Dunno how this could help though?
A silly app-level workaround is to stop updating skglview when navigating away from page, so the grcontext is not in use after Dispose(false) is called maybe, something like that.
This is very similar to Windows double-buffered rendering, when you have queued renderings, you "think" you stopped the rendering and you call dispose/anything destroying and a queued backbuffer is still using some resources = crash. Guess here its same, as metal rendering is async too. Just that not many people use skglview on windows so we do not hear much from this: https://github.com/mono/SkiaSharp/issues/1490
To reproduce the crash: one can use an SKGLView with either HasRenderLoop set to true or other fast updating means, so it's updating non-stop when you navigate from a page with it inside. This way we can reproduce a collision between Dispose (calls AbandonContext) and a queued rendering that accesses released resources.
Works for both iOS and Windows.
EDIT: Just catched another randomly happening crash on iOS when navigating away from a page with a non-metal SKCanvas (not SKGLView) view, but it had a native metal view inside:
Thread 2 name: Finalizer
Thread 2 Crashed:
0 libobjc.A.dylib 0x193a27aa4 objc_release + 16
1 libSkiaSharp 0x10150841c 0x101208000 + 3146780
2 libSkiaSharp 0x101508480 0x101208000 + 3146880
3 libSkiaSharp 0x101405728 0x101208000 + 2086696
4 XXX.Mobile 0x100810654 do_icall + 124
5 XXX.Mobile 0x10080ece8 do_icall_wrapper + 348
6 XXX.Mobile 0x100802260 mono_interp_exec_method + 2580
7 XXX.Mobile 0x1007ff618 interp_entry_from_trampoline + 656
8 XXX.Mobile 0x1005b5970 native_to_interp_trampoline + 112
9 XXX.Mobile 0x1007ac380 mono_gc_run_finalize + 588
10 XXX.Mobile 0x1006d9144 sgen_gc_invoke_finalizers + 264
11 XXX.Mobile 0x1007ad594 finalizer_thread + 768
12 XXX.Mobile 0x10078a9c8 start_wrapper + 348
13 libsystem_pthread.dylib 0x1e55040ec _pthread_start + 115
14 libsystem_pthread.dylib 0x1e550272c thread_start + 7
mono_gc_run_finalize.. ok maybe its mono runtime GC'ing something not recognized as being still in use?
Fixed what was happening for me inside a custom handler, sharing for what it's worth. Can't say in general. What was happening for me:
inside public class SKMetalView (SkiaSharp current code in main branch) we create
_backendContext = new GRMtlBackendContext
{
Device = _device,
Queue = _device.CreateCommandQueue() // this queue will get killed by GC collect, triggered either by runtime, or by MAUI code when disconnecting handler. No idea why, this makes me thing of .net runtime issue.
};
then we use it inside void IMTKViewDelegate.Draw(MTKView view):
using var commandBuffer = _backendContext.Queue.CommandBuffer();
And it looks like commandBuffer is using native resources from Queue. So if the commandBuffer finalizer is called after the Queue was disposed we get a crash trying to dispose some already disposed resources.
Queue can be disposed for some reason after base is called inside the handler protected override void DisconnectHandler(SKMetalView platformView). Or later depending on the GC pressure, this creates random situations.
If commandBuffer is disposed (notice using) after that (randomless++) we suffer:
at <unknown> <0xffffffff>
at ObjCRuntime.Messaging:void_objc_msgSend <0x00098>
at ObjCRuntime.BaseWrapper:Release <0x00096>
at CoreFoundation.NativeObject:Dispose <0x00062>
at ObjCRuntime.DisposableObject:Finalize <0x00028>
at System.Object:runtime_invoke_virtual_void__this__ <0x00090>
at <unknown> <0x00000>
So the fix for SKMetalView was to pin the queue inside private void Initialize():
_queuePin = GCHandle.Alloc(_backendContext.Queue, GCHandleType.Pinned);
and unpin inside Dispose:
if (_queuePin.IsAllocated)
{
_queuePin.Free();
}
I hope this could help.
@taublast I don't think that's quite correct. The crash occurs on dispose - preventing the GC/finalization doesn't help with that, since it can be made to happen deterministically:
https://github.com/mono/SkiaSharp/compare/main...jeremy-visionaid:SkiaSharp:grcontext-metal-finalizer
@jeremy-visionaid sure, it's all possible, i'm not an expert. Like i said, fixed the crash for myself and wanted to share.
Looking at the apple docs: https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Cmd-Submiss/Cmd-Submiss.html#//apple_ref/doc/uid/TP40014221-CH3-SW14
Command Queue A command queue accepts an ordered list of command buffers that the GPU will execute. All command buffers sent to a single queue are guaranteed to execute in the order in which the command buffers were enqueued. In general, command queues are thread-safe and allow multiple active command buffers to be encoded simultaneously.
To create a command queue, call either the newCommandQueue method or the newCommandQueueWithMaxCommandBufferCount: method of a MTLDevice object. In general, command queues are expected to be long-lived, so they should not be repeatedly created and destroyed.
Yeah, the trouble here though appears to be that Skia itself is destroying the command queue, which I believe is the cause of the crash. In the upstream tests, they add a retain: https://skia.googlesource.com/skia/+/7755e6ab0bc0/tools/gpu/mtl/MtlTestContext.mm#48
Maybe we need to ask on the Skia forums about what's expected from the caller there...
I am asking here, so we shall see: https://groups.google.com/g/skia-discuss/c/fYp8lLue4zU
There is an incredible amount of "no retain" words in this block:
https://github.com/google/skia/blob/main/src/gpu/ganesh/mtl/GrMtlGpu.mm#L49-L75
I am wondering if we are supposed to retain?
Lookin at instruments with an app, I see the ques being created and then they disappear as well - this is with the dangerous retain. Very interesting. I wonder what the latest skia does and if this is a currect skia bug or an always skia bug. Or a feature.
Oh, is this a problem? Looks like it resets some smart pointers and then it immediately goes out of scope...
sk_sp<GrDirectContext> GrDirectContext::MakeMetal(void* device, void* queue,
const GrContextOptions& options) {
sk_sp<GrDirectContext> direct(new GrDirectContext(GrBackendApi::kMetal, options));
GrMtlBackendContext backendContext = {};
backendContext.fDevice.reset(device);
backendContext.fQueue.reset(queue);
return GrDirectContext::MakeMetal(backendContext, options);
}
It's a deprecated function, and doesn't look like the first GrDirectContext is even used...
Ooh, that is what we are using and it looks so weird.
This is the one we are suppsoed to be using and it looks like somehting got lost in the migration:
sk_sp<GrDirectContext> GrDirectContext::MakeMetal(const GrMtlBackendContext& backendContext,
const GrContextOptions& options) {
sk_sp<GrDirectContext> direct(new GrDirectContext(GrBackendApi::kMetal, options));
direct->fGpu = GrMtlTrampoline::MakeGpu(backendContext, options, direct.get());
if (!direct->init()) {
return nullptr;
}
return direct;
}
EDIT
I see the deprecated calls this one eventually. And options is null/the default object, so a duplicate is probably not a problem.
The "reset" is just the way the pointers work to "set" the value. It is effectively constructing a GrMtlBackendContext object and then passing that to this function.
Yeah, I can't say whether it matters or not without poking at it, but the deprecated named constructor looks sketchy to me. I haven't compiled the native bits on mac before, but otherwise looks simple enough to at least do a quick hack to try out calling the other one... I imagine you'd beat me to it š
Other places do things like this:
externals/skia/tools/gpu/mtl/MtlTestContext.mm:
backendContext.fDevice.retain((GrMTLHandle)device.get());
sk_cfp<id<MTLCommandQueue>> queue([*device newCommandQueue]);
backendContext.fQueue.retain((GrMTLHandle)queue.get());
externals/skia/tools/window/MetalWindowContext.mm:
GrMtlBackendContext backendContext = {};
backendContext.fDevice.retain((GrMTLHandle)fDevice.get());
backendContext.fQueue.retain((GrMTLHandle)fQueue.get());
So I think we are supposed to?
I reviewed your PR and I tihnk for now, since the only place we own the queue is the SKMetalView, we should keep the retain there.
As soon as we have more data we can change.
I don't want to inadvertantly retain and then behaviour happens as a result and we can't revert. Maybe Google will say something exciting and we will be "oh yeah, that was smart! Good call Google" or something.
We also need to test this on an actual iOS device.
I'm wondering if the retains there are also to workaround the local scoping issues of GrMtlBackendContext š¤ As an interrim thing, I don't think retaining in SKMetalView is too bad. It would at least fix the crashes that SkiaSharp users are currently getting, even if GRContext disposal were still broken without others needing to do an explicit retain. But if you've got a few more days then we can wait to see what the Skia guys say and/or I could try another approach.
@mattleibow Do you know which version of XCode the native deps are expected to build against so I can build the native deps locally (it doesn't look like XCode 16.4/macOS 15.5 SDK works)? I can only see the container is macos14...
Ah, nvm, #3058 is sufficient to fix the local build
Now that I have a local native build, here's a quick and dirty hack to show the sketchy constructor in Skia is the cause of the problem:
sk_sp<GrDirectContext> GrDirectContext::MakeMetal(void* device, void* queue,
const GrContextOptions& options) {
GrMtlBackendContext *backendContext = new GrMtlBackendContext();
backendContext->fDevice.reset(device);
backendContext->fQueue.reset(queue);
return GrDirectContext::MakeMetal(*backendContext, options);
}
Cherry-picking the unit test commit on top, the test passes. So, I'd say that my proposed fix is near enough equivalent for the queue at least but without the leak (though it would still leave a potential issue for removable metal devices).
An alternative approach might be to tie GrMtlBackendContext to the GrDirectContext lifetime on the managed side. WDYT @mattleibow ?
Are you saying removing the dodgy line is all it takes to fix it?
Iām saying that it might be either the scope of the context or that they called reset instead of retain: https://github.com/mono/SkiaSharp/pull/3258#discussion_r2286464628