MAVSDK-Java
MAVSDK-Java copied to clipboard
MAVSDK 2.1.0 Seg Fault Crashes When Tearing Down Server
Environment
- Herelink Controller (Android 7.1.2)
- Gazebo Headless Simulator (https://github.com/JonasVautherin/px4-gazebo-headless)
- MAVSDK-Java 2.1.0 and Server 2.1.1
(Does not occur with 1.3.1)
Repro
- Install example app (attached as ReproMavsdkCrash.zip)
- Start the server
- Connect the Simulator to the app (on port 14540)
- Wait until aircraft attitude is shown in Logcat
- Click "stop server"
- Crash
Stack trace
2024-09-19 16:36:17.585 14170-14336 MAVLinkService com.example.repromavcrash E Got attitude from aircraft: -1.2388214 -1.231313 93.332565
2024-09-19 16:36:17.595 14170-14336 MAVLinkService com.example.repromavcrash E Got attitude from aircraft: -1.2398055 -1.2350508 93.33456
2024-09-19 16:36:17.601 14170-14336 MAVLinkService com.example.repromavcrash E Got attitude from aircraft: -1.2399287 -1.2368649 93.33481
2024-09-19 16:36:17.631 273-13156 MVE_OMX_IF media.codec E Buffer abandoned. buffer_id 6f9. (mve_omx_common_component.c:1987)
2024-09-19 16:36:17.700 14170-14208 MAVLinkService com.example.repromavcrash E Server 14551 no longer active.
2024-09-19 16:36:17.700 14170-14215 libc com.example.repromavcrash A Fatal signal 11 (SIGSEGV), code 1, fault addr 0x16a in tid 14215 (DefaultDispatch)
2024-09-19 16:36:17.701 230-230 <no-tag> debuggerd64 W debuggerd: handling request: pid=14170 uid=10064 gid=10064 tid=14215
2024-09-19 16:36:17.739 14347-14347 DEBUG pid-14347 A *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
2024-09-19 16:36:17.739 14347-14347 DEBUG pid-14347 A Build fingerprint: 'CubePilot/song/song:7.1.2/N2G47J/RU01240522:user/test-keys'
2024-09-19 16:36:17.739 14347-14347 DEBUG pid-14347 A Revision: '0'
2024-09-19 16:36:17.739 14347-14347 DEBUG pid-14347 A ABI: 'arm64'
2024-09-19 16:36:17.740 14347-14347 DEBUG pid-14347 A pid: 14170, tid: 14215, name: DefaultDispatch >>> com.example.repromavcrash <<<
2024-09-19 16:36:17.740 14347-14347 DEBUG pid-14347 A signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x16a
2024-09-19 16:36:17.740 14347-14347 DEBUG pid-14347 A x0 000000000000016a x1 00000074d905f398 x2 0000000000000000 x3 00000074d8f5d4e8
2024-09-19 16:36:17.740 14347-14347 DEBUG pid-14347 A x4 00000074d8f5d460 x5 00000075146bef4c x6 00000074d8f5d450 x7 ffffffffffffffff
2024-09-19 16:36:17.740 14347-14347 DEBUG pid-14347 A x8 00000074d905f3d0 x9 fc83543f6d416706 x10 ffffffffc4653600 x11 0000000000000000
2024-09-19 16:36:17.740 14347-14347 DEBUG pid-14347 A x12 0000000000000001 x13 00000074d905ef88 x14 0000000000000041 x15 0000000000000000
2024-09-19 16:36:17.740 14347-14347 DEBUG pid-14347 A x16 00000074db55b220 x17 00000075146bf968 x18 000000750fbf07e0 x19 000000000000016a
2024-09-19 16:36:17.740 14347-14347 DEBUG pid-14347 A x20 00000074d905f3d0 x21 0000000000000022 x22 0000000000000000 x23 0000000000000000
2024-09-19 16:36:17.740 14347-14347 DEBUG pid-14347 A x24 00000074db2d5928 x25 00000074d905f3e8 x26 00000074db2d1258 x27 00000074db2d1398
2024-09-19 16:36:17.740 14347-14347 DEBUG pid-14347 A x28 00000074db2d1450 x29 00000074d905f380 x30 00000074db526ee0
2024-09-19 16:36:17.740 14347-14347 DEBUG pid-14347 A sp 00000074d905f380 pc 00000075146bf968 pstate 0000000060000000
2024-09-19 16:36:17.743 14347-14347 DEBUG pid-14347 A
backtrace:
2024-09-19 16:36:17.743 14347-14347 DEBUG pid-14347 A #00 pc 0000000000068968 /system/lib64/libc.so (pthread_mutex_lock)
2024-09-19 16:36:17.743 14347-14347 DEBUG pid-14347 A #01 pc 00000000000c5edc /data/app/com.example.repromavcrash-1/base.apk (offset 0x8000)
Hm, any chance you can get a backtrace with more info? It looks like it happens in some lock but I'm not sure which one and when.
So I nailed down the commit that introduced this particular crash, went through the different versions on the v2.12 branch on MAVSDK and saw that this: https://github.com/mavlink/MAVSDK/commit/71126ac267a10d1c5a7fcb7a3c963688b2b32277
Reverting this change while on the 2.12.6 tag made this particular crash go away.
But I do see the intent of the change was to improve connecting to systems, so we probably don't want to remove this but instead fix this somehow...
Great find, sorry for being slow here.
Given you seem to be able to build the server and try out different versions, any chance you can build debug mode and get more output where it crashes?
Or if you have hints how I can reproduce it with the example apps that would work too.
Problem is I'm not familiar with the Java stuff but I do want to help fix this bug.
Hi @julianoes - I attached a custom example app (https://github.com/user-attachments/files/17066260/ReproMavsdkCrash.zip) that can easily repro the issue.
It can be done by,
- Start example app
- Click "start server"
- Connect the Gazeo-Headless-PX4 sim to the app
- wait for attitude to start showing up
- Hit the "stop server" button
- Reproduced crash
Thanks for that. I can reproduce it, but I'm not sure what to make of it yet.
It looks like it crashes here: https://github.com/mavlink/MAVSDK/blob/27c5c30749076d205f8f5e46622171051b255386/src/mavsdk/core/log.h#L70-L83
But that doesn't look right if you ask me as the thombstone suggests pthread_mutex_lock.
Ok, so now I realized that you're starting 3 servers, one on each UDP port. Are you actually connecting 3 vehicles? Or is this just to connect three ports? If it is to do 3 ports but one vehicle, you might want to add some mavlink-router or MAVSDK based forwarder in-between.
In any case, the crash happens for the one where nothing is connected! I'll have a closer look into this.
I've finally managed to copy in my debug build mavsdk_server.so and added some printfs.
It turns out the _should_exit flag here is not set to true, for whatever mystery reason:
https://github.com/mavlink/MAVSDK/blob/v2.12/src/mavsdk_server/src/connection_initiator.h#L35
After plenty of head scratching I figured out what was going on, here is the fix: https://github.com/mavlink/MAVSDK/pull/2417.
Fix coming with https://github.com/mavlink/MAVSDK/releases/tag/v2.12.9.
@JonasVautherin do you mind making a Java release once the artifacts are ready? Thanks!
I guess I should wait for 2.12.10 with https://github.com/mavlink/MAVSDK/pull/2421?
@JonasVautherin no https://github.com/mavlink/MAVSDK/pull/2421 is only on main, not v2.12.
That's done in mavsdk_server:2.1.3 :+1:
I see some crashes with "ReproMavsdkCrash" app even after bumping mavsdk-server to 2.1.3. Crashes happen after some consecutive server start/stop calls, or if app is relaunched after a successful drone connection without stopping the server first.
@daniel-sales ok I will have to try to reproduce that again.
Ok, so now I realized that you're starting 3 servers, one on each UDP port. Are you actually connecting 3 vehicles? Or is this just to connect three ports? If it is to do 3 ports but one vehicle, you might want to add some mavlink-router or MAVSDK based forwarder in-between.
I'm still curious on this @rayw-dronesense and @daniel-sales.
@julianoes On my end I'm starting only one server on port (14551).
Regarding the connection issue, it is reproductible with the @rayw-dronesense's example app if you:
- launch the app (I left only 14551 on
portslist atMAVLinkService.kt) - start server
- wait for drone connection and attitude logs
- close the app without hitting "stop server"
- relaunch the app
Here's the stack trace:
2024-10-08 09:22:36.406 5635-5718 Mavsdk com.example.repromavcrash I MAVSDK version: v2.12.9
2024-10-08 09:22:36.414 5635-5718 MAVSDK-Server com.example.repromavcrash D Running mavsdk_server with connection url: udp://:14551
2024-10-08 09:22:36.418 5635-5991 Mavsdk com.example.repromavcrash I New system on: 127.0.0.1:47005 (with system ID: 1)
2024-10-08 09:22:36.418 5635-5991 Mavsdk com.example.repromavcrash D New system ID: 1 Comp ID: 1
2024-10-08 09:22:36.425 5635-5991 Mavsdk com.example.repromavcrash D Component Autopilot (1) added.
2024-10-08 09:22:36.444 5635-5991 Mavsdk com.example.repromavcrash D New system ID: 42 Comp ID: 190
2024-10-08 09:22:36.449 5635-5991 Mavsdk com.example.repromavcrash D Component Ground station (190) added.
2024-10-08 09:22:36.449 5635-5991 Mavsdk com.example.repromavcrash D Discovered 1 component(s)
2024-10-08 09:22:36.510 5635-5991 Mavsdk com.example.repromavcrash D Component Camera 1 (100) added.
2024-10-08 09:22:36.516 5635-5718 Mavsdk com.example.repromavcrash I System discovered
2024-10-08 09:22:36.529 5635-5718 Mavsdk com.example.repromavcrash I Server started
2024-10-08 09:22:36.529 5635-5718 Mavsdk com.example.repromavcrash I Server set to listen on 0.0.0.0:42723
2024-10-08 09:22:36.529 5635-5718 MAVSDK-Server com.example.repromavcrash D mavsdk_server is now running, listening on port 42723
2024-10-08 09:22:36.530 5635-5718 MAVLinkService com.example.repromavcrash E Got drone connection io.mavsdk.System@e7d61f5 on port 42723
2024-10-08 09:22:36.552 5635-5748 MAVLinkService com.example.repromavcrash E Connection state changed: false
2024-10-08 09:22:36.554 5635-5748 MAVLinkService com.example.repromavcrash E Connection state changed: true
2024-10-08 09:22:36.557 5635-5748 MAVLinkService com.example.repromavcrash E Connection state changed: false
2024-10-08 09:22:36.558 5635-5748 MAVLinkService com.example.repromavcrash E Connection state changed: true
2024-10-08 09:22:36.580 5635-6011 libc com.example.repromavcrash A Fatal signal 6 (SIGABRT), code -6 in tid 6011 (DefaultDispatch)
Ok, so this is a SIGABORT and happens when something is connected, not disconnected like before.
I can't reproduce this just yet. When you say "close the app", how do you do that? I'm using the emulator and hit stop, then I can just relaunch.
My guess is that you run into a bind issue with port already bound by the previous run.
I can't reproduce this just yet. When you say "close the app", how do you do that?
Hitting AndroidStudio "stop" button or using Android device UI to close the app. App crashes on next "start server" attempt.
My guess is that you run into a bind issue with port already bound by the previous run.
That makes sense. The crashes go away when I revert PR#2388, so maybe the new autopilot discovery method is interfering on the automatic port release by system when app is destroyed?
I honestly don't see how that would be related. And I could not reproduce it that way I'm afraid.
I just made more tests here and indeed, issue is not reproductible with px4-gazebo-headless simulator. However, it is easily reproductible on my real drone setup. So here is the detailed info for more reliable repro steps:
- Herelink Controller (Android 7.1.2)
- Devkit drone (CubeOrange+ with Ardupilot 4.5.3, Herelink v1.1 and Siyi A8 mini gimbal)
- ReproMavsdkCrash app (opening only port 14551), mavsdk 2.1.0 and mavsdk-server 2.1.3
Crash happening on "start server" command after a few start/stop loops, or on next "start server" after closing the app without manual server stop.
Stack trace
2024-10-11 10:52:57.492 8080-8267 Mavsdk com.example.repromavcrash I MAVSDK version: v2.12.9
2024-10-11 10:52:57.493 8080-8267 MAVSDK-Server com.example.repromavcrash D Running mavsdk_server with connection url: udp://:14551
2024-10-11 10:52:57.502 8080-8272 Mavsdk com.example.repromavcrash I New system on: 127.0.0.1:43748 (with system ID: 1)
2024-10-11 10:52:57.502 8080-8272 Mavsdk com.example.repromavcrash D New system ID: 1 Comp ID: 1
2024-10-11 10:52:57.503 8080-8272 Mavsdk com.example.repromavcrash D Component Autopilot (1) added.
2024-10-11 10:52:57.529 8080-8272 Mavsdk com.example.repromavcrash D New system ID: 42 Comp ID: 190
2024-10-11 10:52:57.530 8080-8272 Mavsdk com.example.repromavcrash D Component Ground station (190) added.
2024-10-11 10:52:57.530 8080-8272 Mavsdk com.example.repromavcrash D Discovered 1 component(s)
2024-10-11 10:52:57.595 8080-8267 Mavsdk com.example.repromavcrash I System discovered
2024-10-11 10:52:57.611 8080-8272 Mavsdk com.example.repromavcrash D Component Unsupported component (68) added.
2024-10-11 10:52:57.614 8080-8272 Mavsdk com.example.repromavcrash D Component Camera 1 (100) added.
2024-10-11 10:52:57.628 8080-8267 Mavsdk com.example.repromavcrash I Server started
2024-10-11 10:52:57.628 8080-8267 Mavsdk com.example.repromavcrash I Server set to listen on 0.0.0.0:34613
2024-10-11 10:52:57.628 8080-8267 MAVSDK-Server com.example.repromavcrash D mavsdk_server is now running, listening on port 34613
2024-10-11 10:52:57.631 8080-8267 MAVLinkService com.example.repromavcrash E Got drone connection io.mavsdk.System@5322373 on port 34613
2024-10-11 10:52:57.641 8080-8267 System.err com.example.repromavcrash W SLF4J(W): No SLF4J providers were found.
2024-10-11 10:52:57.641 8080-8267 System.err com.example.repromavcrash W SLF4J(W): Defaulting to no-operation (NOP) logger implementation
2024-10-11 10:52:57.641 8080-8267 System.err com.example.repromavcrash W SLF4J(W): See https://www.slf4j.org/codes.html#noProviders for further details.
2024-10-11 10:52:57.705 8080-8292 art com.example.repromavcrash W Before Android 4.1, method double java.util.concurrent.ThreadLocalRandom.internalNextDouble(double, double) would have incorrectly overridden the package-private method in java.util.Random
2024-10-11 10:52:57.705 8080-8292 art com.example.repromavcrash W Before Android 4.1, method int java.util.concurrent.ThreadLocalRandom.internalNextInt(int, int) would have incorrectly overridden the package-private method in java.util.Random
2024-10-11 10:52:57.705 8080-8292 art com.example.repromavcrash W Before Android 4.1, method long java.util.concurrent.ThreadLocalRandom.internalNextLong(long, long) would have incorrectly overridden the package-private method in java.util.Random
2024-10-11 10:52:57.785 8080-8296 MAVLinkService com.example.repromavcrash E Connection state changed: false
2024-10-11 10:52:57.787 8080-8296 MAVLinkService com.example.repromavcrash E Connection state changed: true
2024-10-11 10:52:57.795 8080-8296 MAVLinkService com.example.repromavcrash E Connection state changed: false
2024-10-11 10:52:57.796 296-321 BoardControl pid-296 E Unable to create sockfd: Connection refused
2024-10-11 10:52:57.799 8080-8296 MAVLinkService com.example.repromavcrash E Connection state changed: true
2024-10-11 10:52:57.826 8080-8299 libc com.example.repromavcrash A Fatal signal 6 (SIGABRT), code -6 in tid 8299 (DefaultDispatch)
2024-10-11 10:52:57.827 232-232 <no-tag> pid-232 W debuggerd: handling request: pid=8080 uid=10064 gid=10064 tid=8299
2024-10-11 10:52:57.913 8302-8302 DEBUG pid-8302 A *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
2024-10-11 10:52:57.914 8302-8302 DEBUG pid-8302 A Build fingerprint: 'CubePilot/song/song:7.1.2/N2G47J/RU01240301:user/test-keys'
2024-10-11 10:52:57.914 8302-8302 DEBUG pid-8302 A Revision: '0'
2024-10-11 10:52:57.914 8302-8302 DEBUG pid-8302 A ABI: 'arm64'
2024-10-11 10:52:57.915 8302-8302 DEBUG pid-8302 A pid: 8080, tid: 8299, name: DefaultDispatch >>> com.example.repromavcrash <<<
2024-10-11 10:52:57.915 8302-8302 DEBUG pid-8302 A signal 6 (SIGABRT), code -6 (SI_TKILL), fault addr --------
2024-10-11 10:52:57.915 8302-8302 DEBUG pid-8302 A x0 0000000000000000 x1 000000000000206b x2 0000000000000006 x3 0000000000000008
2024-10-11 10:52:57.915 8302-8302 DEBUG pid-8302 A x4 000000000000006e x5 0000000000000000 x6 0000007d09f70110 x7 0000000000000000
2024-10-11 10:52:57.915 8302-8302 DEBUG pid-8302 A x8 0000000000000083 x9 ffffffffffffffdf x10 0000000000000000 x11 0000000000000001
2024-10-11 10:52:57.915 8302-8302 DEBUG pid-8302 A x12 ffffffffffffffff x13 0000000000000020 x14 000000000000000c x15 0000000000000000
2024-10-11 10:52:57.915 8302-8302 DEBUG pid-8302 A x16 0000007d44dd3ee0 x17 0000007d44d7d364 x18 0000007d423f07e0 x19 0000007d09f704f8
2024-10-11 10:52:57.915 8302-8302 DEBUG pid-8302 A x20 0000000000000006 x21 0000007d09f70450 x22 0000000000000000 x23 0000007d2526e8c0
2024-10-11 10:52:57.916 8302-8302 DEBUG pid-8302 A x24 0000007d2526e8a0 x25 0000007d39b00fa0 x26 0000007d39a252d0 x27 0000000000000000
2024-10-11 10:52:57.916 8302-8302 DEBUG pid-8302 A x28 0000000000000000 x29 0000007d09f6fc60 x30 0000007d44d7a80c
2024-10-11 10:52:57.916 8302-8302 DEBUG pid-8302 A sp 0000007d09f6fc40 pc 0000007d44d7d36c pstate 0000000060000000
2024-10-11 10:52:57.930 8302-8302 DEBUG pid-8302 A
backtrace:
2024-10-11 10:52:57.930 8302-8302 DEBUG pid-8302 A #00 pc 000000000006b36c /system/lib64/libc.so (tgkill+8)
2024-10-11 10:52:57.930 8302-8302 DEBUG pid-8302 A #01 pc 0000000000068808 /system/lib64/libc.so (pthread_kill+64)
2024-10-11 10:52:57.930 8302-8302 DEBUG pid-8302 A #02 pc 0000000000023d68 /system/lib64/libc.so (raise+24)
2024-10-11 10:52:57.930 8302-8302 DEBUG pid-8302 A #03 pc 000000000001c7ec /system/lib64/libc.so (abort+52)
2024-10-11 10:52:57.930 8302-8302 DEBUG pid-8302 A #04 pc 0000000000003e0c /data/app/com.example.repromavcrash-1/base.apk (offset 0x147d000)
---------------------------- PROCESS ENDED (8080) for package com.example.repromavcrash ----------------------------
Ok, any chance you can debug this yourself by building the .so file in debug mode and copying it in as these instructions: https://github.com/mavlink/MAVSDK-Java?tab=readme-ov-file#debugging-without-pushing-to-maven
That's essentially what I did to debug this. It should hopefully tell us which abort call is happening.
It's been inactive for a while and we've been on MAVSDK 3 for months. Let's close this.