ndk icon indicating copy to clipboard operation
ndk copied to clipboard

ndk-glue: Move native activity behind lock and remove it in `onDestroy`

Open MarijnS95 opened this issue 4 years ago • 33 comments

Fixes #154

~When an application returns its control flow back from main() it is generally assumed to be done polling on the looper and wishes to terminate. Android is made aware of this by calling ANativeActivity_finish, otherwise it'll continue sending input events and eventually ANR as the app didn't consume them anymore.~

NativeActivity is now behind a lock, and will be removed as soon as onDestroy is called. Due to the async nature of events, make sure to hold on to the lock received from native_activity() beforehand if you wish to use it during handling of Event::Destroy.


~TODO: I assume https://developer.android.com/ndk/reference/group/native-activity#anativeactivity_finish says that ANativeActivity is invalid after this call? If so we should make fn finish(&self) take ANativeActivty by value instead (to force use of .take()) - we could even have a drop handler?~ Done!

MarijnS95 avatar Jul 24 '21 22:07 MarijnS95

TODO: I assume https://developer.android.com/ndk/reference/group/native-activity#anativeactivity_finish says that ANativeActivity is invalid after this call? If so we should make fn finish(&self) take ANativeActivty by value instead (to force use of .take())

That makes sense

we could even have a drop handler?

probably not worth overdoing this

dvc94ch avatar Jul 24 '21 22:07 dvc94ch

we could even have a drop handler?

probably not worth overdoing this

It would be clean. Not everyone might use ndk_glue (though it is unlikely) and it would lead to a dangling pointer (though that's the least of our concerns) and ANR with no way to recover if it's lost.

MarijnS95 avatar Jul 24 '21 22:07 MarijnS95

Sure if you want to do it

dvc94ch avatar Jul 24 '21 22:07 dvc94ch

@dvc94ch Done :)

MarijnS95 avatar Jul 24 '21 22:07 MarijnS95

So is the drop handler guaranteed to be executed before the mutex is released?

dvc94ch avatar Jul 24 '21 22:07 dvc94ch

@dvc94ch NATIVE_ACTIVITY is not protected by any mutex, otherwise we'd have to lock this for writing.

https://github.com/rust-windowing/android-ndk-rs/blob/49e8ed51220567dccac9fd1e8baa526e9cdac163/ndk-glue/src/lib.rs#L56

MarijnS95 avatar Jul 24 '21 22:07 MarijnS95

wait didn't this used to be a RwLock?

dvc94ch avatar Jul 24 '21 22:07 dvc94ch

I guess not, so ignore my comments :)

dvc94ch avatar Jul 24 '21 22:07 dvc94ch

Only the other bits:

https://github.com/rust-windowing/android-ndk-rs/blob/49e8ed51220567dccac9fd1e8baa526e9cdac163/ndk-glue/src/lib.rs#L49-L54

MarijnS95 avatar Jul 24 '21 22:07 MarijnS95

but it could panic right?

https://github.com/rust-windowing/android-ndk-rs/blob/49e8ed51220567dccac9fd1e8baa526e9cdac163/ndk-glue/src/lib.rs#L59

dvc94ch avatar Jul 24 '21 22:07 dvc94ch

It's probably not behind a lock because it's always "statically" set during activity execution and doesn't change (well, except when someone returns from fn main() while still having a thread alive that might access it), but all the other variables can reasonably change whenever Android feels like calling one of the callbacks.

MarijnS95 avatar Jul 24 '21 22:07 MarijnS95

I think the static mut is only safe because it is initialized at startup and never written to

dvc94ch avatar Jul 24 '21 22:07 dvc94ch

but it could panic right?

https://github.com/rust-windowing/android-ndk-rs/blob/49e8ed51220567dccac9fd1e8baa526e9cdac163/ndk-glue/src/lib.rs#L59

Yes, but only if the user calls native_activity() after returning from main(), which we can simply deem invalid and incorrect.

MarijnS95 avatar Jul 24 '21 22:07 MarijnS95

I think the static mut is only safe because it is initialized at startup and never written to

Exactly.

I think we might change unwrap() to .expect("Native activity has already been finished") to make this more clear.

MarijnS95 avatar Jul 24 '21 22:07 MarijnS95

well that's the point, it's two separate instructions, so if there are multiple threads there could be a race. But I'm probably overanalyzing this.

dvc94ch avatar Jul 24 '21 22:07 dvc94ch

So what I'm saying is the program does this:

  1. make static mut NATIVE_ACTIVITY None
  2. anything could happen including calling native_activity
  3. the drop handler is run

while 1 and 2 could actually happen at the same time meaning it's UB.

dvc94ch avatar Jul 24 '21 22:07 dvc94ch

Reading or writing the pointers is only a single instruction. The point is to prevent other threads doing so while the thread holding the lock is doing different things with it. No such case with ANativeActivity as far as I'm aware.

MarijnS95 avatar Jul 24 '21 22:07 MarijnS95

native_activity is public so it could happen. and I think winit does actually call native_activity and it runs in a separate thread.

dvc94ch avatar Jul 24 '21 22:07 dvc94ch

So what I'm saying is the program does this:

  1. make static mut NATIVE_ACTIVITY None
  2. anything could happen including calling native_activity
  3. the drop handler is run

while 1 and 2 could actually happen at the same time meaning it's UB.

The only one making NATIVE_ACTIVITY = None is the .take() here. Theoretically nothing else should run that calls .native_activity(). We don't do that in any of the ANativeActivityCallbacks (that could perhaps run after _finish() is called, I don't know), and if the user does that, they shouldn't have a looper listening to events in the first place as they have already returned from fn main().

MarijnS95 avatar Jul 24 '21 22:07 MarijnS95

native_activity is public so it could happen. and I think winit does actually call native_activity and it runs in a separate thread.

Yeah I think that's the point we're trying to make - or the convention we're trying to set - here. The crate shouldn't have any threads live after returning from fn main(), as that makes us unable to call .finish(). Since non-daemon threads are a thing we might instead require the end user to manually drop NativeActivity or run .finish()? Then we have the choice to take self by value, or keep the drop handler and recommend users to call drop(native_activity) instead. They should be able to .take() it from the Option either way (as I don't think having it around is a good idea after _finish() was called), and we should document this requirement re. #154.

MarijnS95 avatar Jul 24 '21 22:07 MarijnS95

We can possibly drop/.finish() in an atexit handler as well?

EDIT: I doubt the atexit handler runs unless all threads are gone (and the entire process terminates), which is likely not the case here otherwise the app would have vanished instead of becoming unresponsive. We might have something sticking around somewhere.

MarijnS95 avatar Jul 24 '21 22:07 MarijnS95

I think taking self by reference would avoid writing to a static mut which is usually not a good idea. Presumably after calling finish the jvm stops the process and kills all threads? The alternative would be to just put the native activity in a RwLock and just hold the lock while calling finish. Or we can leave it as is and see if there are any panics/segfaults in the wild, which probably is pretty rare in practice.

dvc94ch avatar Jul 24 '21 22:07 dvc94ch

It's funny how the smaller the PR the larger the discussion hahha

dvc94ch avatar Jul 24 '21 23:07 dvc94ch

It's funny how the smaller the PR the larger the discussion hahha

Yup the bigger churn is never an issue, but smaller controversial lines like these need to be thoroughly thought out and discussed. It's good that it's a single, isolated change in a PR anyway, and not hidden in some mega-PR :)

I think taking self by reference would avoid writing to a static mut which is usually not a good idea.

If we decide to expose drop/.finish() to the user we should definitely use by-value (move) semantics and turn the static into a lock. Allowing the user to use the NativeActivity pointer after calling ANativeActivity_finish() is probably just as bad if not worse than the race conditions that might follow.

Presumably after calling finish the jvm stops the process and kills all threads?

That's possible. Just like normal Android Activity instances the app is kept alive until explicitly calling .finish() even if no user (non-daemon) threads are spawned. That's the nature of being asynchronous and using callbacks.

In this instance automatically calling .finish() after returning from fn main() isn't the best idea, see first bulletpoint below.

The alternative would be to just put the native activity in a RwLock and just hold the lock while calling finish. Or we can leave it as is and see if there are any panics/segfaults in the wild, which probably is pretty rare in practice.

Pretty sure we'd see segfaults or other UB if the user were to continue using NativeActivity and friends after calling ANativeActivity_finish - even if Android kills all threads there's still a possibility for race conditions before that happens.

After a ~good~ night sleep, these are the main takeaways:

  • ndk_glue::main shouldn't really be called main, it's only part of the ANativeActivity_onCreate "callback". The special "finish-on-return" semantics I'm introducing here don't make much sense in that regard. Can we rename it to ndk_glue::onCreate so that it is clear that returning from this function doesn't have any shutdown guarantees?
    • Existing Rust apps might already rely on this behaviour (spawn their own threads and return).
  • #154 Is not really an issue at all, we should just document more properly that .finish() has to be called whenever the users wishes to terminate.
  • NativeActivity should either have a drop handler, or take self by value in fn finish(self) to prevent leaking dead state/pointer back to the user.

MarijnS95 avatar Jul 25 '21 10:07 MarijnS95

ndk_glue::main shouldn't really be called main, it's only part of the ANativeActivity_onCreate "callback". The special "finish-on-return" semantics I'm introducing here don't make much sense in that regard. Can we rename it to ndk_glue::onCreate so that it is clear that returning from this function doesn't have any shutdown guarantees?

That sounds good yes.

NativeActivity should either have a drop handler, or take self by value in fn finish(self) to prevent leaking dead state/pointer back to the user.

I guess it won't be possible to use it with the glue, as we only expose a fn native_activity() at the moment. we can wrap it in a RwLock and expose a fn finish_activity(), but that would be a breaking change and would probably require a PR to winit too.

dvc94ch avatar Jul 25 '21 11:07 dvc94ch

In hindsight I'm not too certain about the proposed changes above. The looper is explicitly associated with the current thread, and returning from fn main() means the looper must have been given up. It is apparently possible to pass a looper to a different thread using ALooper_acquire() and it's probably fine then to call the various poll*() functions from a different thread too.

I guess it won't be possible to use it with the glue, as we only expose a fn native_activity() at the moment. we can wrap it in a RwLock and expose a fn finish_activity(), but that would be a breaking change and would probably require a PR to winit too.

Yes a separate read and write lock function have to be exposed, or what I was experimenting with today: fn finish_native_activity() that takes and finishes the object, without ever exposing a public function with &mut access to the Option<NativeActivity>.

Perhaps it is a good idea to collect some more opinions on this complicated and controversial change. Maybe @msiglreith and @francesca64 have some suggestions?

MarijnS95 avatar Jul 25 '21 12:07 MarijnS95

hmm not too sure about these changes. I would imagine in winit we would call finish when the user passes ControlFlow::Exit and then forward the Destroy event to LoopDestroyed.

msiglreith avatar Jul 26 '21 17:07 msiglreith

@msiglreith Checking out winit it never calls .finish() for the user anywhere, probably leading to a stuck application. Crates using winit are responsible for using the ndk_glue::main macro directly themselves, making them responsible for knowing about returning from main() finishing the activity if we were to push this PR through in its current form.

What do you think about the takeaway steps listed in https://github.com/rust-windowing/android-ndk-rs/pull/160#issuecomment-886179020? In short:

  • Rename ndk_glue::main to ndk_glue::onCreate;

  • Use locks around NativeActivity and pass it by-value in .finish() or use a drop handler;

  • Document the need to call .finish() on application exit, most likely through an fn native_activity_finish() so that we don't have to move the NativeActivity to the user anywhere;

  • Use finish()/drop() in winit once this lands, whenever the user sets ControlFlow::Exit and forward Event::Destroyed per your description. Note that the documentation for onDestroy says:

    NativeActivity is being destroyed.

    At that point it's probably still a live, valid pointer (passed as sole argument to the callback) but fn native_activity() would already be None if we were to implement proper move semantics, so we have to carefully consider what happens. If ANativeActivity_finish blocks while those callbacks are being fired we can safely set it to None after it returns and keep by-reference semantics (but &mut so that the user can't call it when retrieving an immutable handle through the glue, only when building an app without glue). However, we're processing all these callbacks asynchronously when we should most likely wait inside onDestroy until the application (thread polling the event pipe) says: "okay, I'm done with the native activity now, go ahead!", just like the lock scenario in #117 / #134. That should be easy enough to solve with an RWLock as well, and advising any application that needs to use NativeActivity in Event::Destroy to get and store the read-lock beforehand...

MarijnS95 avatar Jul 27 '21 10:07 MarijnS95

@MarijnS95 I'm fine with these. Regarding finish vs drop I would prefer calling finish as it might be easier to discover/read than having the explicitly drop a static.

ANativeActivity_finish blocks while those callbacks are being fired we can safely set it to None after it returns and keep by-reference semantics

IIRC it just post some event to another thread/process and returns.

That should be easy enough to solve with an RWLock as well, and advising any application that needs to use NativeActivity in Event::Destroy to get and store the read-lock beforehand...

yeah. this seems to be the best way to move forward (even though a bit hacky).

msiglreith avatar Jul 29 '21 17:07 msiglreith

@msiglreith Thanks for your view, and apologies for letting this sit unattended for a while.

@MarijnS95 I'm fine with these. Regarding finish vs drop I would prefer calling finish as it might be easier to discover/read than having the explicitly drop a static.

~Assuming we never want to give the user ownership of the NativeActivity instance while still using the appropriate move semantics this is going to be a global fn finish_native_activity() anyway (the user won't be able to call fn finish(self) if it can't pull it out of the lock in the first place). Internally we can do it either way, but since finish is "magical" and not just cleaning up some object (it shuts down the application) it's probably better to keep it an explicit .take().finish(); indeed.~

EDIT: Realizing now that this doesn't make much sense, given what I wrote below. .finish() should be safe to call on an (immutable) reference to NativeActivity without assuming ownership, since the onDestroy callback will only be called after that. that's - just like onNativeWindowDestroyed - probably the only sensible place to set our global handle to None.

ANativeActivity_finish blocks while those callbacks are being fired we can safely set it to None after it returns and keep by-reference semantics

IIRC it just post some event to another thread/process and returns.

So the pointer to ANativeActivity stays alive after this, and the user is still allowed to call functions on it? Seeing as onDestroy receives this pointer it should still be valid, but Android might not allow to call certain functions anymore.

That should be easy enough to solve with an RWLock as well, and advising any application that needs to use NativeActivity in Event::Destroy to get and store the read-lock beforehand...

yeah. this seems to be the best way to move forward (even though a bit hacky).

It's already a pattern for NATIVE_WINDOW so it shouldn't hurt too much. If a developer relies on this they'll quickly run into a panic if the thing is already None. Either way I'll add this to the enum documentation on Event :)

MarijnS95 avatar Aug 07 '21 15:08 MarijnS95