ndk Runtime problems when including multiple versions of ndk-glue

Having multiple versions of ndk-glue in the dependency tree leads to crashes because of uninitialized NATIVE_ACTIVITY.

Example: Including ndk-glue 0.5. When entering main(), NATIVE_ACTIVITY 0.5 gets initialized. Including Oboe crate that uses ndk-glue 0.4.0. When enumerating audio devices, i get:

thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', /Users/ric/.cargo/registry/src/github.com-1ecc6299db9ec823/ndk-glue-0.4.0/src/lib.rs:59:39

I tried using [patch.crates-io] but that does not seem to work.

This bug might be related to using resolver = "2" as required by wgpu.

Dec 24 '21 15:12 zmerp

That's not really a bug, Just How Things Work™.

That said, winit added a match table to prevent this (and we should have a comment or issue around here documenting the exact same crash), so this is a very real problem.

[patch.crates-io] only works when the versions are the same, allowing you to make small local modifications. It's impossible to use this to replace ndk-glue 0.4.0 with ndk-glue 0.5.0, since they're considered different crates in the same tree.

What we can however do is the semver trick: we release one final patch-release of ndk-glue 0.4.x, reexporting the functions from ndk-glue 0.5.x (and removing the statics): that causes duplication on the crate graph but makes sure that - assuming everyone sets the minimum patch version correctly in their toml or lock file - both are interoperable.

Dec 24 '21 16:12 MarijnS95

I thought of another hack: make NATIVE_ACTIVITY public (or create a unsafe setter). This way we can import multiple ndk-glue versions and for each additional one set the the native activity manually.

Dec 24 '21 16:12 zmerp

I was thinking about something like this:

Cargo.toml:

ndk-glue = "0.5"
ndk-glue-4 = { package = "ndk-glue", version = "0.4" }

main:

let native_activity_ptr = ndk_glue::native_activity().ptr().as_ptr();
unsafe { ndk_glue_4::set_native_activity(native_activity_ptr as _) };

We can add a patch release for v0.4 and v0.5 with this new function, I think it's cleaner. I can make a PR if you want.

Dec 24 '21 16:12 zmerp

I don't really like that approach as it's quite unsafe (ie. allows users to break things at will), requires careful consideration by the caller on what ndk-glue version the static is initialized versus what version(s) use it, and will also need to be updated.

While NATIVE_ACTIVITY itself won't change during the runtime of the program (except that we should drop it at the end, see #160), there are (at the time of writing) four more statics that need all of the above including updates during the runtime of the program while Android fires callbacks.

As such I don't think the manual approach scales.

Re-exporting pub accessor functions is tricky too, as we'd need to perform those pointer casts to wrapper-structs from different ndk versions while still holding on to the same lock guard. Since the layout between all the versions is the same we could transmute the lock guards to make them hold a different ndk version's newtype/wrapper struct. However, this approach removes all the thinking for the user - it'll work out of the box if we implement it correctly.

In the end it's probably best to just update oboe to a newer ndk-glue?

Jan 05 '22 14:01 MarijnS95

In the end it's probably best to just update oboe to a newer ndk-glue?

I already sent a PR for oboe some days ago, but your last release made it already deprecated. openxrs also breaks if I update ndk-glue to 0.6 in my project. This is not sustainable. If we can't introduce cross-version initialization functions we should at least make so ndk-glue does not need to be updated so often. The ndk crate dependency should be removed from ndk-glue, and maybe move some functionality inside the ndk crate.

Maybe we need a ndk-ptrs crate for storing the static pointers. The crate should have no dependencies (apart for lazy_static), and the pointers can be untyped (c_void).

Jan 05 '22 16:01 zmerp

In the end it's probably best to just update oboe to a newer ndk-glue?

@MarijnS95 In the end, is the best approach is to set ndk = "*" and ndk-glue = "*" in libraries that use android-ndk? It's such a pain to bump every library to use the newest version of ndk-glue.

Jan 13 '22 22:01 enfipy

In the end, is the best approach is to set ndk = "" and ndk-glue = "" in libraries that use android-ndk? It's such a pain to bump every library to use the newest version of ndk-glue.

Don't think you can publish a crate with wildcard version? Believe it only works for local crates

Jan 14 '22 11:01 repi

but your last release made it already deprecated. openxrs also breaks if I update ndk-glue to 0.6 in my project. This is not sustainable.

Indeed, and increased use of ndk-glue lately shows.

The ndk crate dependency should be removed from ndk-glue

Not easily possible since ndk-glue has to know of and call/use some of the functions/types from the ndk. We have better solutions.

and maybe move some functionality inside the ndk crate.

Nah, I think the ndk crate should purely hold the (safe) abstractions to the NDK. ndk-glue is "just" a helper (like Android's "official" glue layer) that users can leverage to simplify their Android setup. Either way it'll suffer from breaking changes to the ndk there too... Pick your poison.

Maybe we need a ndk-ptrs crate for storing the static pointers. The crate should have no dependencies (apart for lazy_static), and the pointers can be untyped (c_void).

I've been thinking about the exact same, we can even apply this retroactively to previous crate-releases as an additional patch. It'll just need a dependency on one of the crates that provides a Mutex/RWLock with a mapable inner value on the *LockGuard. That way we can convert from the pointer to the typed handle, and give the user what they asked for while held inside the *LockGuard.

@msiglreith @francesca64 @dvc94ch What are your thoughts on this? I/we can start working on a little proof of concept for this, having to make sure that we don't need to release any breaking changes to it for a good long while.

Alternatively we could do away with statics entirely and work with a heap allocation to which a pointer is stored in ANativeActivity::instance. Users would then have to pass this object around to get access to the NativeActivity and NativeWindow, and code would simply not compile if differing versions of ndk/ndk-glue are combined. Safer and more predictable, but the same dependency management hell for all the library crates involved.

Jan 17 '22 16:01 MarijnS95

@MarijnS95 In the end, is the best approach is to set ndk = "*" and ndk-glue = "*" in libraries that use android-ndk? It's such a pain to bump every library to use the newest version of ndk-glue.

@enfipy As @repi mentioned such crates cannot be published... But it should be possible to publish them with a range of compatible crate revisions. Perhaps ndk-glue could set something (imaginary, I didn't compatibility-test this) like ndk = ">= 0.3, <= 0.6". Then we only need to bump these (usually only the upper bound) in a patch release. (We can probably do this regardless of introducing an ndk-ptrs/statics crate mentioned above?)

Note that an upper bound is mandatory in my eyes to prevent a situation like https://github.com/gwihlidal/vk-mem-rs/issues/54. It's much safer for us to validate a setup and push a patch release enlarging the available range of versions than to retroactively publish a new patch release with the upper bound fixed to the last working/compatible version, and yanking everything without that upper bound from crates.io.

The only thing I am not sure about is crate selection. I assume, but have not tested this, that if a crate were to select ndk = "0.5" any crate with ndk-glue with the imaginary ndk = ">= 0.3, <= 0.6" dependency from above would get just ndk 0.5.0? Though perhaps if cargo were to try and select the highest or the lowest (-Zminimal-versions) you could end up with duplicates on different versions after all. Then only Embark's cargo-deny could save you (or the aforementioned suggestion to remove statics from the crate altogether).

Jan 17 '22 16:01 MarijnS95

I'm currently leaning towards the separate ptr crate approach as it seems to minimize the risk of breaking changes compared to just using a version range for ndk.

Regarding the ANativeActivity::instance I'm not totally sure if I understand it in detail. In particular who is owning the memory storing the pointers? The ANativeActivity::instance is then only for internal usage?

Jan 17 '22 17:01 msiglreith

why do you have two versions in your dependency tree? ndk-glue only needs to be a dependency of the windowing crate you use (either winit, or your own), and you can reexport anything you need from the windowing crate. not sure why you have it more than once in your dependency graph.

Jan 17 '22 21:01 dvc94ch

Here is a list of dependencies that we use that today all depend on ndk-glue for their Android version and as such run into this problem:

app_dirs2
cpal
oboe
webbrowser
winit

actually cpal (v0.13.4) is extra bad at this as it depends on both ndk-glue v0.4.0 directly and then through oboe depends on and brings in ndk-glue v0.6.0.

That's probably a cause of recent Android crashes we've seen from this issue with multiple ndk crate dependencies (cc @hrydgard ). Probably easy enough to pin within cpal to a single version, but this overall issue is very error prone and dangerous.

Jan 17 '22 21:01 repi

those libraries should probably expose an init function that takes a jnienv. although I can see that this is annoying yes.

Jan 17 '22 21:01 dvc94ch

so how about a crate that looks like this:

pub struct AndroidContext {
    jni_env: std::ffi::c_void,
   context: std::ffi::c_void,
}
pub fn android_context() -> AndroidContext;

#[doc(hidden)]
pub unsafe intialize_android_context(ctx: AndroidContext);

ndk-glue can initialize it and all library crates can get the stuff they need from the ndk-context crate? If you spawn a new thread you can use the get_java_vm method and then attach to the new thread. this would mean that downstream crates don't depend on ndk-glue (which they shouldn't).

Jan 17 '22 22:01 dvc94ch

Ups, I guess the approach was already mentioned, not very good at reading long threads. I claimed that crates shouldn't depend on ndk-glue without mentioning why. Risking pointing out the obvious, if you want to embed your library in a kotlin app that doesn't use native activity, the kotlin app will invoke initialize_android_context, so we should make it extern "C"

Jan 17 '22 23:01 dvc94ch

@msiglreith

I'm currently leaning towards the separate ptr crate approach as it seems to minimize the risk of breaking changes compared to just using a version range for ndk.

We can do both, the dependency range should result in less minor (breaking) updates for at least ndk-glue because we're not forcing the update of another breaking-change crate dependency anymore. But then again we'd have to validate how cargo selects the dependency and also make sure at least min and max versions are tested in the CI? If we do this for multiple crates possible combinations may explode exponentially though.

Regarding the ANativeActivity::instance I'm not totally sure if I understand it in detail. In particular who is owning the memory storing the pointers? The ANativeActivity::instance is then only for internal usage?

It seems to be the typical userdata pointer that is untouched by the NDK, you're free to assign any data you want but of course have to drop it when the ANativeActivity is cleaned up, presumably in onDestroy.

@dvc94ch Thanks for those insights! It indeed seems that many crates are erroneously using ndk-glue to get hold of the ANativeActivity for JNI purposes (or in one of my test-crates: the internal_data_path()).

Hence going back to my comment and your suggestion, also taking Kotlin into account: those crates should consume the necessary objects through an init function or constructor. We could pass it the needed JNI or NDK types, and get rid of static state and public getters altogether.

On the ndk-glue side we will then pass an ANativeActivity into fn main() from fn init(), and internally hold a pointer to our own state in instance so that there's no static globals anymore. We'll still have to wrap things in locks to be able to access them concurrently from ANativeActivityCallbacks.

If there's an NDK version mismatch things will simply refuse to compile (instead of the hard-to-debug failures we're observing right now), and we can use the existing ptr/unsafe from_ptr helpers to turn one into the other (leveraging the more stable ndk-sys crate types, and if all else a raw pointer can easily be cast too).

Jan 19 '22 09:01 MarijnS95

Hence going back to my comment and your suggestion, also taking Kotlin into account: those crates should consume the necessary objects through an init function or constructor. We could pass it the needed JNI or NDK types, and get rid of static state and public getters altogether.

ndk-ctx is likely still useful. The reason being that if a cross platform crate wants to add android support they'll likely not be very happy with having to pass around the android context, as evidence by the use of ndk-glue. A simple solution is likely going to have to be provided that doesn't force users to have to add new_with_extra_android_context to all their apis. This recursively infects everything, so the results are not pretty and could lead to substandard support for android through the eco system. I think my proposal is likely the best way forward to provide the convenience required to see a larger number of crates add some android logic.

Jan 19 '22 13:01 dvc94ch

Between @MarijnS95 and @dvc94ch ideas I prefer the latter. While @MarijnS95 idea is cleaner, @dvc94ch idea is the more practical IMO. Just one nit, I think "ndk-context" is clearer than "ndk-ctx". The ndk-context pointers can be RwLock<Option<NonNull<c_void>>>, and the ndk crate can use some constructors like from_context() that return RwLock<Type>. But the raw pointers could also also be used unsafely directly from ndk-context (this would be the case for openxrs). Do you need me to make a PR?

Jan 25 '22 13:01 zmerp

It turns out that splitting off ndk-context from ndk-glue in not clean at all. ndk-context gets contaminated for example with the Rect and Event structs. I think we should introduce a NdkApp, whose public API lives inside the ndk crate, but its logic can still live inside ndk-glue. ndk-context will contain only a static instance to this object (as a pointer). NdkApp could be initialized by crates with the ndk-context pointer + ndk crate; ndk-glue will do the same internally. Please give some feedback.

Jan 25 '22 15:01 zmerp

so the idea behind the ndk-context is that it only provides stuff required for interacting with android.jar. that's essentially these vm and activity pointers. this should be sufficient to access any java apis. this means that some things you can access directly from the ndk when you have an ANativeActivity will need to be accessed using the jni. The advantage is that the code should work even if it's not called from a NativeActivity. The reason why I haven't PR'ed this yet is because it's a bit of work to test all the cases (test that it works in a multi threaded context, works from kotlin, etc).

Jan 25 '22 17:01 dvc94ch

FYI: the vm and activity pointers should be valid after main is called from ndk-glue so there is no need for a RwLock.

Jan 25 '22 17:01 dvc94ch

@dvc94ch ok, maybe it's better for you to push this forward. But where should the poll_events() live? should it be moved inside winit?

Jan 25 '22 19:01 zmerp

I don't think ndk-glue needs to be changed at all except initializing ndk-context.

Jan 25 '22 20:01 dvc94ch

anyway, don't think I'll have time to look into this before next week

Jan 25 '22 22:01 dvc94ch

Opened a PR #223

Feb 07 '22 10:02 dvc94ch

ndk-glue 0.6.1 and ndk-context 0.1.0 are released.

Feb 14 '22 17:02 dvc94ch

I think we should close this issue as soon as most popular crates migrated to ndk-context:

https://crates.io/crates/ndk-glue/reverse_dependencies

https://github.com/app-dirs-rs/app_dirs2/pull/19
https://github.com/Ralith/openxrs/pull/108
https://github.com/RustAudio/cpal/pull/641
https://github.com/katyo/oboe-rs/pull/42
webbrowser
winit (needs more than just the VM/Context, https://github.com/MarijnS95/android-ndk-rs/compare/anativeactivity-instance?)

@zarik5 When making these PRs at least link back to the original issue and PR :)

Feb 15 '22 13:02 MarijnS95

think we can close now

Dec 06 '22 20:12 dvc94ch