native icon indicating copy to clipboard operation
native copied to clipboard

App crashes after hot restart due to dangling Java callback references to disposed Dart objects

Open dickermoshe opened this issue 3 months ago • 21 comments

App crashes after hot restart when using Android audio player with Dart callbacks. The crash occurs because Java-side callback references to Dart objects become invalid after hot restart, but Java continues to hold and attempt to use these stale references.

There must be some way to run some cleanup code before a hot reload. It would be nice if this was not a Flutter specific API so we would not need to introduce a flutter dependency.

dickermoshe avatar Sep 12 '25 14:09 dickermoshe

This problem is not specific to hot-restart but anytime that Java holds a reference to a Java interface implemented within Dart that lives longer than the underlying isolate.

Using the normal Finalizer won't work as they don't guarantee that they will be run. NativeFinalizer does have this guarantee but it's tricky to use as you'd have to write this function in C. Also attach method gets a single Pointer<Void> token object that is passed as a parameter to a native callback, so you might have to pack multiple arguments in an array and pass the array as token and later unpack in the C function.

NativeCallable.isolateGroupShared could be used once the shared memory multithreading proposal lands. This creates the function pointer required by NativeFinalizer and since JObjects will be trivially sharable, it's easy to write a function that does such clean up.

One question is: Will the isolate group stay alive at hot-restart until all the native finalizers' callbacks fire or not? If not, then the function pointer we get from NativeCallable.isolateGroupShared will also be invalid.

cc @mraleph, @aam

HosseinYousefi avatar Sep 12 '25 19:09 HosseinYousefi

NativeCallable.isolateGroupShared can't be used as a native finalizer because you can't enter Dart from native finalizer. It's forbidden. These callbacks fire directly from GC and all Dart mutator threads are paused at that point. Only VM runtime code and native code which does not touch Dart can execute.

mraleph avatar Sep 12 '25 19:09 mraleph

I don't think it's unreasonable to require than the developer clean up any java code before shutting down the isolate. However, I if your goal is to make jni "just work" for every random developer then I guess you'd want a way to solve this.

dickermoshe avatar Sep 12 '25 20:09 dickermoshe

Will the isolate group stay alive at hot-restart until all the native finalizers' callbacks fire or not?

Also note that given https://github.com/flutter/flutter/issues/75528 I think in common case flutter hot restart kills all isolates in an isolate group therefore running corresponding native finalizers, but there are caveats to that, where it might not work.

aam avatar Sep 12 '25 20:09 aam

NativeCallable.isolateGroupShared can't be used as a native finalizer because you can't enter Dart from native finalizer.

OK, so we can't use that.

I'm trying to think how else we can simplify this then. One idea is that we could make it a bit simpler to create the cleanup C function using codegen then, something like:

@jniFinalizer
void cleanUp(JFoo foo, JBar bar) {
  // do clean up here
  foo.removeListener(bar);
  bar.close();
}

However the function must not use any variable from outside the function and should only really be calling simple Java methods and no other Dart operations, it has to return void and has to only accept arguments that extend JObject. JNIgen can then codegen an equivalent C function that does the clean up and expose helpers for the user to attach it to a NativeFinalizer.

Because we don't want to write a full Dart to C compiler of course something like this won't work:

@jniFinalizer
void cleanUp(JFoo foo, JBar bar) {
  // do clean up here
  final a = 2 + 2; // ERROR: the function should only have method calls on the arguments
  foo.removeListener(bar);
  bar.close();
}

HosseinYousefi avatar Sep 13 '25 16:09 HosseinYousefi

It would be nice if this was not a Flutter specific API

Hot restart is a Flutter concept, not a Dart concept. Dart only has hot-reload. And hot reload will likely not invalidate any native pointers (or have unpredictable behavior if you were to remove some fields and add some fields that might have contained pointers).

Indeed, the Flutter-specific request is tracked in:

  • https://github.com/flutter/flutter/issues/75528

so we would not need to introduce a flutter dependency.

But yes, it would be ~~nice~~ required to avoid a Flutter dependency. (Especially when we start migrating core libraries such as HTTP or IO to native interop.)

  • https://github.com/dart-lang/pub/issues/4669

dcharkes avatar Sep 15 '25 07:09 dcharkes

I think there are two parts here:

  • We need to make sure that NativeFinalizer behaves as documented in situations when Flutter is shutting down isolates for hot restart. If it does not - we need to fix it. I assume that it does - what are the caveats you are referring to @aam?
  • jnigen should make use of NativeFinalizer.

If we for some reason believe that NativeFinalizer can't be used - we need to come up with a mechanism that can be used. But I think this should not be necessary.

@HosseinYousefi

I'm trying to think how else we can simplify this then. One idea is that we could make it a bit simpler to create the cleanup C function using codegen then, something like:

tldr is that our current philosophy here is that if you need some C code - just write C code and include it using code assets.

Originally (4 years ago! see https://github.com/dart-lang/sdk/issues/47778) we looked at defining a C-like subset of Dart which when compiled can run more-or-less like native code without runtime system. @dcharkes have done some initial prototyping and we had a lot of discussions about it - but in the end we decided that if you want to write C you should just write C and not a weird dialect of Dart. So we prioritized code assets instead.

mraleph avatar Sep 15 '25 07:09 mraleph

  • We need to make sure that NativeFinalizer behaves as documented in situations when Flutter is shutting down isolates for hot restart. If it does not - we need to fix it. I assume that it does - what are the caveats you are referring to @aam

IIRC, the way that the Flutter engine shuts down isolate groups does not guarantee the isolates are shut down before the new isolate group is started:

  • https://github.com/flutter/flutter/issues/75528#issuecomment-1771621574

Agreed, if we fix that issue, we should be able to rely on NativeFinalizers, and have no need for knowing about a specific Flutter API.

dcharkes avatar Sep 15 '25 08:09 dcharkes

I think we currently have discrepancy between what we guarantee in NativeFinalizer documentation and what we implement. Documentation says:

Callbacks will happen as early as possible, when the object becomes inaccessible to the program, and may happen at any moment during execution of the program. At the latest, when an isolate group shuts down, this callback is guaranteed to be called for each object in that isolate group that the finalizer is still attached to.

Compared to the Finalizer from dart:core, which makes no promises to ever call an attached callback, this native finalizer promises that all attached finalizers are definitely called at least once before the isolate group shuts down, and the callbacks are called as soon as possible after an object is recognized as inaccessible.

But implementation scopes finalizers to individual isolates providing a stronger guarantee that NativeFinalizer fires when isolate which created it is shutting down.

I understand that the documentation is kinda future looking - because we were discussing possibility to share NativeFinalizer between isolates (I think this will become more important in context of https://github.com/dart-lang/sdk/issues/56841) - which means that finalizer is not necessarily owned by any single isolate.

I think current implementation provides very good invariant: finalizers are fired before we fully shutdown the isolate, so you can use them to cleanup isolate specific state on the native side. I am concerned however that there is (seemingly) no safe way to forcefully shutdown an isolate which uses NativeCallable.listener even if you rely on this invariant.

Consider the following situation: you have an isolate I which creates cb using NativeCallable.listener and gives it to native code, i.e. producer->set_callback(cb). We assume that producer is calling cb on some worker threads.

Now we are shutting I down forcefully. How do you prevent producer's worker thread from crashing when it tries to call cb?

If you just go purely by documentation there is no way to do that at all. NativeFinalizer does not provide necessary strong guarantee.

Now consider that you know implementation details, you know that NativeFinalizer actually provides stronger guarantee than its documentation promises so actually try to use NativeFinalizer to do the job. Now you hit another problem: I think you can't really write a safe finalizer without causing deadlocks. The gist of the problem is the following: isolate shutdown can only safely proceed when it knows that cb is unregistered from the producer and that there is no cb invocation inflight. This means your code has to look something like this:

void Producer::WorkerThread() {
   // ...
   std::scoped_lock cb_lock(cb_mutex_);
   if (cb_ != nullptr) {
     cb_(...);
   }
   // ...
}

void Producer::set_callback(CallbackType cb) {
   std::scoped_lock cb_lock(cb_mutex_);
   cb_ = cb;
}

void UnregisterCallbackOnIsolateShutdown(Producer* producer) {
  producer->set_callback(nullptr);
  // We know that cb_ can be used when set_callback returns
  // because WorkerThread is holding lock as long as it is using
  // cb_.
}

Hower this code creates situation similar to one we encountered in https://github.com/dart-lang/sdk/issues/61272 and https://github.com/dart-lang/sdk/issues/61372 (at least given the current implementation of RunAndCleanupFinalizersOnShutdown): if one thread tries to enter IG safepoint, while another thread is trying to perform callback in WorkerThread (holding cb_mutex_) then UnregisterCallbackOnIsolateShutdown called from isolate shutdown will cause deadlock.

I think we need to find a way to rework this code which makes this sort of pattern actually work. /cc @mkustermann @liamappelbe

[^1]: At least that's my reading of the code based on what we experienced in

mraleph avatar Sep 15 '25 08:09 mraleph

tldr is that our current philosophy here is that if you need some C code - just write C code and include it using code assets.

The problem is we bring the user out of "oh everything is type safe and generated for you" into "learn how to use JNI". Another idea is get user to write Java/Kotlin static method for the clean up instead (and include all the machinery that's necessary to get this to work). This way at least they can still write type safe code in another language.

HosseinYousefi avatar Sep 15 '25 09:09 HosseinYousefi

The problem is we bring the user out of "oh everything is type safe and generated for you" into "learn how to use JNI".

I did not mean user should write this. We should include necessary (C) bits into jni package itself.

mraleph avatar Sep 15 '25 09:09 mraleph

Callbacks will happen as early as possible, when the object becomes inaccessible to the program, and may happen at any moment during execution of the program. At the latest, when an isolate group shuts down, this callback is guaranteed to be called for each object in that isolate group that the finalizer is still attached to.

The issue with Flutter is that our isolate group shutdown API is not synchronous. So Flutter doesn't wait until the isolate group has fully shut down before hot restarting. (So, while the Dart VM delivers on its promises, the Flutter engine isn't currently able to use that.)

Note that the isolate group shutdown is important in our use case. Many usages of native code happen on helper isolates (e.g. doing queries to a database). So, before Flutter restarts the main function from an app, the helper isolates must have also shut down. I presume the same is true for JNI use cases. And the JObjects are already vm:shared.

dcharkes avatar Sep 15 '25 10:09 dcharkes

I think we need to find a way to rework this code which makes this sort of pattern actually work.

To summarize the issue as I understand it:

A. Calling the listener requires entering the isolate group. B. Entering the isolate group will block if the group is currently shutting down. C. The user needs to do some sort of locking that will block the finalizer if the callback is currently being invoked, like this UnregisterCallbackOnIsolateShutdown example.

So we have a deadlock.

We can't really loosen C. We could break the deadlock using atomics, but then we'd open up the potential for use-after-free issues, which will (usually) FATAL once this lands (eg Producer::WorkerThread loads cb, UnregisterCallbackOnIsolateShutdown sets it to null, isolate group shuts down and deletes cb, Producer::WorkerThread invokes cb). The whole point of the lock is to block the isolate group from shutting down until the cb invocation is complete (or rather, until the listener message is sent; it will not be received).

We also can't do much about A either. Not without a major rewrite of the argument marshaling code to use the Dart_PostCObject API. IIRC we considered this approach when we were designing listeners, but decided against it because it would limit what types of object can be sent.

Maybe we can loosen B? IsolateGroup::EnterTemporaryIsolate currently isn't allowed to fail (there are asserts that the temp isolate isolate isn't null). It also blocks if the group is currently shutting down. Maybe we can alter it to fail gracefully and return null if the group is shutting down?

liamappelbe avatar Sep 16 '25 00:09 liamappelbe

To summarize the issue as I understand it:

A. Calling the listener requires entering the isolate group. B. Entering the isolate group will block if the group is currently shutting down

Corrections: Isolate group is not shutting down. It is one of the isolates that is shutting down and is currently executing finalizers. All finalizers are currently executed in NoSafepointScope if another isolate wants to enter safepoint it has to wait until the isolate which is shutting down reaches the point where we can enter safepoint. So fundamentally the picture is like this:

  • Thread1 is waiting inside Isolate::Shutdown -> (NoSafepointScope) Isolate::RunAndCleanupFinalizersOnShutdown -> (Finalizer) -> UnregisterCallbackOnIsolateShutdown -> Producer::set_callback -> Lock(Producer::cb_mutex_). Lock is waiting because Thread3 owns cb_mutex_. Thread1 is a
  • Thread2 is trying to do an operation which requires all other threads to safepoint (e.g. GC), so it is waiting inside WaitUntilThreadsReachedSafepointLevel.
  • Thread3 has invoked Producer::cb_ (it acquired cb_mutex_ before invocation). It is now inside Producer::cb_ -> FfiTrampoline -> DLRT_GetFfiCallbackMetadata -> EnterTemporaryIsolate -> Thread::EnterIsolate -> AddActiveThread -> while(AnySafepointInProgressLocked()) threads_lock.Wait() and waiting for Thread2 to finish what it started.

So we are deadlocked: Thread3 waits for Thread2 to release safepoint, Thread2 waits for Thread1 to reach safepoint, Thread1 is in the middle of NoSafepointScope and blocked waiting on Thread3 to release cb_mutex_.

I think the only way to brake the deadlock is to allow safepoint during finalizer invocations. I think that should be possible - though it requires detaching them first (so that GC does not interfere with us). Do you sense any problems here @mkustermann @dcharkes?

We also can't do much about A either. Not without a major rewrite of the argument marshaling code to use the Dart_PostCObject API. IIRC we considered this approach when we were designing listeners, but decided against it because it would limit what types of object can be sent.

I don't think rewrite of .listener implementation would solve general problem anyway. We would hit the same thing with NativeCallable.isolateGroupBound (and .listener is kinda just a special case of that anyway).

mraleph avatar Sep 17 '25 08:09 mraleph

The gist of the problem is the following: isolate shutdown can only safely proceed when it knows that cb is unregistered from the producer and that there is no cb invocation inflight.

I think we need to find a way to rework this code which makes this sort of pattern actually work.

My understanding of the problem:

  • We have a thread in Dart that is wanting to shut down an isolate.
  • We have a thread in native that is trying to do an async callback to that isolate.
  • We don't expose isolate-lifecycles explicitly to the native code. So we use native finalizers to signal to the native code that it cannot perform callbacks anymore.
  • Because we fundamentally have two concurrent threads that might race to do these things, we need a lock.

However, I wonder if doing this with a native finalizer is the right place. Consider we have a callback in flight, trying to get the lock inside the native finalizer which is inside RunAndCleanupFinalizersOnShutdown is too late. The callback is in flight, but we're already concurrently running Isolate::Shutdown and running Isolate::UnMarkIsolateReady. So, I feel the locking should be tied to the isolate (and isolate group) lifecycle instead. An isolate or isolate group should really only start shutting down once all native resources have been cleaned up.

That blocking until all the native code has released resources should probably not happen in a safepoint scope indeed.

Basically, if native code expects to be able to run callbacks synchronously, it should know about the isolate life cycle. And if native code expects to be able to fire off async callbacks, it needs to know about the isolate life cycle in order to queue the callback.

Probably we want to expose the isolate life-cycle and isolate-group life cycle as synchronous callbacks in a Dart API (probably the same signature as native finalizers). The users could then write some C code like they would do for native finalizers. (Using a Dart API is likely more ergonomic than exposing isolate lifecycle functionality in dart_api_dl.h.)

What do with both native finalizers and isolate shutdown sync callbacks? We don't want to double-call the native-release function. I'm thinking it's unlikely that this will be an issue. If you were to unregister some callback on the native side with a native finalizer, an async callback could already be in flight and racing with an object being GCed. More likely, the callback itself holds on to all Dart objects it needs. Hence, one should use the isolate-shutdown callback instead of a native finalizer.

(Edit: Zooming out to the original issue: I wonder if a synchronous isolate group shutdown callback and isolate shutdown callback is even the right place. E.g. is early enough in the context of a hot restart in Flutter? Or are there some things happening in the Flutter engine before it calls the shutdown of the Dart isolate / isolate group that we need the lock around with the callback coming from native.)

dcharkes avatar Sep 17 '25 08:09 dcharkes

However, I wonder if doing this with a native finalizer is the right place.

Now that you said it - it made me realize that the same dead lock can occur from GC as well. And that that problem is not limited to callbacks lifetimes themselves - but its a general issues with any sort of locking.

If your finalizer takes a lock to manage some resource and the same lock can be taken while calling into Dart from another thread - you have a potential deadlock situation[^1].

[^1]: It's a classical lock order inversion, if you think about safepoint as a lock. One thread does Request Safepoint -> GC -> Finalizer -> Take Lock, another thread does Take Lock -> Call into Dart -> Pause on Safepoint.

mraleph avatar Sep 17 '25 10:09 mraleph

@dickermoshe

App crashes after hot restart when using Android audio player with Dart callbacks. The crash occurs because Java-side callback references to Dart objects become invalid after hot restart, but Java continues to hold and attempt to use these stale references.

As a work-around for the original issue: if you make the callbacks fire-and-forget by passing methodName$async: true when implementing the interface and remove all listeners at the start of your program before adding new ones then you won't get crashes anymore.

HosseinYousefi avatar Sep 17 '25 11:09 HosseinYousefi

The issue with Flutter is that our isolate group shutdown API is not synchronous. So Flutter doesn't wait until the isolate group has fully shut down before hot restarting.

The fact that flutter’s restart may stop one isolate group concurrently with starting a new one is an issue, but isn’t related to this issue being discussed here.

I think we currently have discrepancy between what we guarantee in NativeFinalizer documentation and what we implement

I view this as a bug in the documentation. In the end I think our APIs should provide symmetry between resources and cleanup:

FFI callbacks owned by a particular isolate <-> NativeFinalizers (attached to any object) running (at the latest) on isolate shutdown FFI callbacks owned by isolate group <-> shared NativeFinalizers (attached to sharable objects) running (at the latest) on isolate group shutdown

Currently we only have the former.

I think the only way to brake the deadlock is to allow safepoint during finalizer invocations. I think that should be possible - though it requires detaching them first (so that GC does not interfere with us). Do you sense any problems here @mkustermann @dcharkes?

NativeFinalizers are C finalizers and aren’t allowed to execute Dart code atm. We may want to stick with that. It has the nice property that running a NativeFinalizer cannot create more finalizers or attach to an existing finalizer.

An isolate group shutdown will wait for all isolates to first shutdown. Now the NativeCallable.listener callbacks that are invoked on 3rd party threads do not need any resource from the isolate itself, they only require isolate group to be alive (they create/enter temporary isolate, do stuff and send message to a port - nothing involves the original isolate (it could be dead already) - modulo sending message to port). Now we have opted to enforce more strict checking on the callback invocation, meaning that at least the ffi callback metadata still has to be alive (and therefore the isolate).

Isolate shutdown happens in phases which are currently roughly

  • RunShutdownCallbacks(those are allowed to run dart code)
  • Disallow running dart code (after which no new native finalizers, no new attaches to them can happen)
  • Run native finalizers
  • Post bequest (if Isolate.exit() usage)
  • LowLevelShutdown (forcefully close all ports, delete ffi callback metadata, …)
  • LowLevelCleanup unregister isolate, exit isolate group, delete Embedder cleanup callback

Now we can modify this shutdown procedure the way we want - but we have a few constraints:

  • Deleting ffi callback metadata has to happen after running native finalizers (xx)
  • Closing the ports has to happen after running native finalizers (xx) - ensuring that any pending ffi C callback invocations can also successfully send the message. (xx) Native finalizer will sync with 3rd party native thread to wait for any pending ffi C callback invocations and disable invocation of any more ffi C callback invocations

We do not have to hold any locks during invocation of the native finalizers. We can grab them while we have entered the isolate, then exit the isolate, run them, then re-enter the isolate.

=> Basically what Slava said \o/

The GC can do the same thing, it can accumulate all native finalizers it wants to run, then exit safepoint operation, exit isolate group, run them, re-enter isolate group.

I don't think rewrite of .listener implementation would solve general problem anyway. We would hit the same thing with NativeCallable.isolateGroupBound (and .listener is kinda just a special case of that anyway).

Yes, so for shared isolate resources / isolate group level resources, one may have NativeCallable.isoalteGroupBound callbacks and has to ensure that on isolate group shutdown we run the NativeFinalizer.shared finalizers, which may sync with 3rd party threads (wait for pending isolate group callbacks to be done).

One would think, the procedure could be very similar with some exceptions

  • Prevent normal isolates from being created
  • Kill existing isolates (after which no normal dart isolates can execute code)
  • Grab all shared native finalizers, exit isolate group, run them, re-enter isolate group, shutdown

Though there's something interesting happening: A shared native finalizer, when invoked, may sync with a 3rd party thread, which may invoke a shared ffi callback, which will "create" temporary/shared isolate and run Dart code. Such Dart code can do arbitrary things a shared isolate can do: Access shared isolate state, spawn shared isolates, create new shared native finalizers, attach objects to shared native finalizers.

This puts us in a though spot: When we invoked native finalizer 1 it may free native resources, when invoking native finalizer 2 we sync with 3rd party thread which runs shared ffi callback, which runs shared isolate code, which can access the (previously) freed resource (lets say the thing native finalizer 1 freed is stored as Pointer<> in a shared global variable), creating a use-after-free situation.

mkustermann avatar Sep 17 '25 11:09 mkustermann

The GC can do the same thing, it can accumulate all native finalizers it wants to run, then exit safepoint operation, exit isolate group, run them, re-enter isolate group.

There's actually one issue with this: If the GC was triggered in a gc safepoint state where deopt isn't allowed, then it can lead to a deadlock: we'd exit the isolate (but it would go to gc safepoint not gc-and-deopt safepoint), run native finalizers, which sync with 3rd party thread, which create isolate, which runs Dart code, which wants to deopt. It creates deopt safepoint operation but will be deadlocked as the other isolate will never reach such safepoint.

So one would need to delay running the native finalizers to a point where deopt is possible.

A similar situation could occur with reloads: We run native finalizers on a thread where isolate is exited (and therefore at-safepoint, but not at-reloadable-safepoint), we run native finalizer, sync with 3rd party thread, which runs dart code, which goes to runtime to reload.

To avoid this situation, we need to let the thread (which may be an isolate, shared isolate, bg compiler, ...) that triggered GC continue running. We could create a task to run the native finalizers and post that task to a thread pool, letting them run in the background.

mkustermann avatar Sep 17 '25 11:09 mkustermann

This puts us in a though spot: When we invoked native finalizer 1 it may free native resources, when invoking native finalizer 2 we sync with 3rd party thread which runs shared ffi callback, which runs shared isolate code, which can access the (previously) freed resource (lets say the thing native finalizer 1 freed is stored as Pointer<> in a shared global variable), creating a use-after-free situation.

This scenario seems like user error to me. Is there a way this could happen in a reasonable use case, or accidentally?

liamappelbe avatar Sep 17 '25 22:09 liamappelbe

If we want to support deleting native callables from native finalizers via a native API (https://github.com/dart-lang/sdk/issues/61887) and deleting persistent handles from native finalizers, it would also be beneficial to delay running the native finalizers. If we run them after the GC global safepoint, we can go into the VM state (not a safe point) to delete the persistent handle hanging on to the closure.

The GC can do the same thing, it can accumulate all native finalizers it wants to run, then exit safepoint operation, exit isolate group, run them, re-enter isolate group.

If we want to support deleting native callables and persistent handles from native finalizers, we should stay in the isolate group, and only exit the safepoint operation.

dcharkes avatar Nov 24 '25 11:11 dcharkes