nmos-cpp icon indicating copy to clipboard operation
nmos-cpp copied to clipboard

Segmentation fault after many IS-05 requests

Open wsneijers opened this issue 5 years ago • 9 comments

After updating receiver resources many times (overnight) a segmentation fault occours. The specific error:

>=thread-created,id="2",group-id="i1"
sThread 2 created.
>~"[New Thread 134.266]\n"
s[New Thread 134.266]
>~"[Switching to Thread 134.266]\n"
>*stopped,reason="signal-received",signal-name="SIGSEGV",signal-meaning="Segmentation fault",frame={addr="0x0f4cebb0",func="__PRETTY_FUNCTION__.4855",args=[],from="/mnt/SAMSUNG-SSD-1TB/Git/ancore-sw/sdk/sysroots/ppce300c3-oe-linux/usr/lib/libavahi-client.so.3"},thread-id="2",stopped-threads="all",core="0"
dNOTE: INFERIOR SPONTANEOUS STOP
sStopped.
dState changed from InferiorRunOk(11) to InferiorStopOk(14) [master]
<826importPlainDumpers off
dHANDLING SIGNAL SIGSEGV
sStopped: Segmentation fault (Signal SIGSEGV).

Unfortunately the dependencies (ex. libraries) are compiled with -o0. So no list in GDB. I do have a callstack though:

>~"#0  0x0f4cebb0 in __PRETTY_FUNCTION__.4855 () from /mnt/SAMSUNG-SSD-1TB/Git/ancore-sw/sdk/sysroots/ppce300c3-oe-linux/usr/lib/libavahi-client.so.3\n"
>~"#1  0x0f4ceba4 in __PRETTY_FUNCTION__.5048 () from /mnt/SAMSUNG-SSD-1TB/Git/ancore-sw/sdk/sysroots/ppce300c3-oe-linux/usr/lib/libavahi-client.so.3\n"
>~"#2  0x0f48ec60 in _dbus_list_foreach (list=0x489c98f8, function=0xf4ceb64 <__PRETTY_FUNCTION__.5024+4>, data=0xfffffffe) at /usr/src/debug/dbus/1.6.18-r0/dbus-1.6.18/dbus/dbus-list.c:759\n"
>~"#3  0x0f487f50 in _dbus_timeout_list_set_functions (timeout_list=timeout_list@entry=0x489c98f8, add_function=add_function@entry=0x0, remove_function=remove_function@entry=0x0, toggled_function=toggled_function@entry=0x0, data=data@entry=0x0, free_data_function=free_data_function@entry=0x0) at /usr/src/debug/dbus/1.6.18-r0/dbus-1.6.18/dbus/dbus-timeout.c:284\n"
>~"#4  0x0f48809c in _dbus_timeout_list_free (timeout_list=0x489c98f8) at /usr/src/debug/dbus/1.6.18-r0/dbus-1.6.18/dbus/dbus-timeout.c:214\n"
>~"#5  0x0f46d470 in _dbus_connection_last_unref (connection=0x5fdd5010) at /usr/src/debug/dbus/1.6.18-r0/dbus-1.6.18/dbus/dbus-connection.c:2690\n"
>~"#6  0x0f4c6b50 in filter_func (bus=<optimized out>, message=0x5fb36dd0, userdata=0x5fcfe7a0) at /usr/src/debug/avahi/0.6.31-r11.1/avahi-0.6.31/avahi-client/client.c:219\n"
>~"#7  0x10126cf4 in std::_Destroy<std::shared_ptr<pplx::details::_Task_impl<unsigned char> >*> (__first=<error reading variable: Cannot access memory at address 0x7>, __last=<error reading variable: Cannot access memory at address 0xb>) at /mnt/SAMSUNG-SSD-1TB/Git/ancore-sw/sdk/sysroots/ppce300c3-oe-linux/usr/include/c++/bits/stl_construct.h:126\n"
>1468^done

Any help would be appreciated. Thanks!

wsneijers avatar Jul 25 '19 07:07 wsneijers

Sorry for delay in responding. The call stack looks to be related to DNS-SD tasks that use Avahi (which is the only dependency of nmos-cpp that uses D-Bus). Scheduling and processing of DNS-SD tasks are completely unrelated to processing of IS-05 requests. So it's hard to draw conclusions from this so far... If you were able to reproduce with more debug information, memory utilisation, etc. that would hopefully help determine the cause. Thanks.

garethsb avatar Jul 29 '19 09:07 garethsb

No Problem, not quick in responding myself as well :). We noticed we were using an older version of avahi (0.6.31) and tried switching to 0.7. Though I think it ran better in the end it still crashed over night. I did get a d-bus warning though:

process 83: arguments to dbus_pending_call_free_data_slot() were incorrect, assertion "*slot_p >= 0" failed in file /build/tmp/work/ppce300c3-oe-linux/dbus/1.6.18-r0/dbus-1.6.18/dbus/dbus-pending-call.c line 790.
This is normally a bug in some application using the D-Bus library.
  D-Bus not built with -rdynamic so unable to print a backtrace

I'm going to try compiling d-bus and avahi with debug symbols now. Hopefully to get some more information.

wsneijers avatar Jul 31 '19 06:07 wsneijers

Hi @wsneijers, were you able to discover any more about this issue? It'd be good to know where the problem lies. Thanks.

garethsb avatar Aug 12 '19 17:08 garethsb

@garethsb-sony, We are still trying to reproduce the error with debugging enabled. However it appears that the error disappears when use a executable with debugging.

wsneijers avatar Aug 14 '19 14:08 wsneijers

Sorry for the long wait. We are still trying to reproduce the issue with GDB attached (for some reason it doesn't crash with a debugger). It also happens while not interacting with the IS-05 API. I do have some more detailed logging though:

2036-01-03 16:28:46.605: too much info: 1303376992: Immediate activation requested for receiver: cc38271a-6b64-51a2-893f-ebbcc73a34ff
2036-01-03 16:28:46.614: too much info: 1303376992: Notifying connection thread
2036-01-03 16:28:46.625: info: 1235219552: Processing immediate activation for receiver: cc38271a-6b64-51a2-893f-ebbcc73a34ff
2036-01-03 16:28:46.627: info: 1235219552: Received callback for resource cc38271a-6b64-51a2-893f-ebbcc73a34ff of type receiver
[232132.375831] nmosd[208]: unhandled signal 11 at 000014af nip 0f4732b0 lr 0f473e30 code 30001
2036-01-03 16:28:46.627: more info: 1235219552: Resource found index=1
2036-01-03 16:28:46.636: too much info: 1303376992: Notifying connection API - immediate activation completed
2036-01-03 16:28:46.645: info: 1387263072: Returning 8 matching receivers
2036-01-03 16:28:46.650: more info: 1286599776: Sending response
2036-01-03 16:28:46.659: more info: 1328542816: Sending response
2036-
2036-01-01 00:16:31.178: warning: 1513092192: Did not discover a suitable Registration API via DNS-SD
2036-01-01 00:16:32.184: more info: 1445983328: Trying Registration API discovery for about 30.000 seconds
2036-01-01 00:16:32.184: info: 1445983328: Attempting discovery of a Registration API
2036-01-01 00:16:32.185: more info: 1311765600: DNSServiceBrowse for regtype: _nmos-register._tcp domain:  on interface: 0
2036-01-01 00:16:32.192: more info: 1437594720: DNSServiceBrowse for regtype: _nmos-registration._tcp domain:  on interface: 0
2036-01-01 00:16:32.253: too much[  998.012585] nmosd[213]: unhandled signal 4 at 49a46a58 nip 49a46a58 lr 0f46ef58 code 30001
 info: 1437594720: After DNSServiceBrowse, DNSServiceProcessResu[  998.026526] nmosd[198]: unhandled signal 11 at 0000010a nip 0f4732b0 lr 0f473e30 code 30001
lt succeeded

wsneijers avatar Aug 19 '19 08:08 wsneijers

I'm sorry, how frustrating for you!

Apart from that you're operating in the far future (2036!), the logs look normal until penultimate line when we get:

2036-01-01 00:16:32.253: too much info: 1437594720: After DNSServiceBrowse, DNSServiceProcessResult succeeded

... interleaved with:

[  998.012585] nmosd[213]: unhandled signal 4 at 49a46a58 nip 49a46a58 lr 0f46ef58 code 30001
[  998.026526] nmosd[198]: unhandled signal 11 at 0000010a nip 0f4732b0 lr 0f473e30 code 30001

So that's signal 4, SIGILL, illegal instruction, followed by signal 11, SIGSEGV, invalid memory reference/segmentation fault.

I'm sorry, I'm not sure how to debug that...

garethsb avatar Aug 19 '19 08:08 garethsb

FWIW, the timestamps in the signal messages appear to be 5-6 seconds after the nmos-cpp logging (which is queued and output on a separate logging thread by default), which means we're missing some context at the point that it dies. (Assuming boot time is 2036-01-01 00:00:00!)

garethsb avatar Aug 19 '19 08:08 garethsb

Just to add more to the conversation. We are seeing this behavior on a PowerPC (big endian). We are also implementing the stack now on an ARM based product (little endian). Don't know if that would make a difference. As of now we haven't seen the behavior on the ARM based system.

wsneijers avatar Aug 19 '19 09:08 wsneijers

Unfortunately I personally have no experience with nmos-cpp on PowerPC, only Intel and ARM. I guess it would be good to know for certain that the SIGILL is as a result of memory corruption or the compiler actually having emitted an instruction that isn't supported on the device.

garethsb avatar Aug 19 '19 10:08 garethsb