Segmentation fault: 11 on mac when playback paused
Hi, this is a minor issue, but one I'd like to help get resolved.
Source bundle v0.18.1 compiled on macOS 10.14.6. Running snapclient in command line (instead of daemon to capture output) with no arguments required. When music ends or is paused the following error output occurs
[Notice] No chunk received for 5000ms. Closing Audio Queue.
Segmentation fault: 11
The segfault does not always appear immediately, sometimes a few seconds or minutes after the music stops. FYI when running as a daemon the process would automatically restart, so in normal conditions this issue might not be obvious – however it's a bug that I'd like to see squashed. The segmentation fault (signal 11) indicates an attempt to write to memory that has not been allocated to the process for writing. A common reason for this is when array bounds are exceeded.
From the fact this seems to occur when waiting for more audio to arrive, I was looking at the code in the coreaudio player worker loop. Although I could see the code for initAudioQueue, I was getting a bit lost in the Stream code where it looks for chunks.
If I were to add debug code in here, to try and pin down where the fault is actually being raised, do you have any suggestion of where else to look (e,g. where in stream.cpp)? And is there an example of good debug practice in another module that you'd suggest I follow? Thanks
You could also try to run the client using Valgrind:
valgrind --leak-check=yes path/to/snapclient ... 2> output.log
It helped me to find the memory issues in https://github.com/badaix/snapcast/pull/546.
What I prefer over valgrind is the address sanitizer, just build the client with make TARGET=MACOS SANITIZE=address. It gives a detailed stack trace when the application crashes. For a segmentation fault a regular debug build and run in gdb should help: make DEBUG=1.
I doubt that it is related to the stream class, I guess it's rather related to the initAudioQueue and uninitAudioQueue stuff. You can alse enable debug logging with --debug.
If this is some hard-to-fix bug, the easy solution would be to keep the player initialized and return silence as for instance the Oboe player does:
oboe::DataCallbackResult OboePlayer::onAudioReady(oboe::AudioStream* /*oboeStream*/, void* audioData, int32_t numFrames)
{
...
if (!stream_->getPlayerChunk(audioData, delay, numFrames))
{
// LOG(INFO) << "Failed to get chunk. Playing silence.\n";
memset(audioData, 0, numFrames * stream_->getFormat().frameSize());
}
else
{
adjustVolume(static_cast<char*>(audioData), numFrames);
}
return oboe::DataCallbackResult::Continue;
}
I don't know if it is the best practice/good style to free the player resource or not. I did it since the beginning in the ALSA player and took it over for the CoreAudio player.
For something that's 'part of the infrastructure' like snapcast, and is left running, waiting for sound, I think that you have chosen Good Practice by freeing up resources rather than hogging them. I'd feel wrong in suggesting you undo that work.
You have both given me good diagnostic pointers (especially as Mac is not my traditional dev workground) so I'll work forward with those. Thanks for your support, I'll post updates as I diagnose.
Although I was excited to learn of valgrind, the main project is not yet compatible with recent macos versions, and the fork I tried errored as soon as snapclient picked up the stream. So I went for Johannes's rebuild suggestion and got the more detailed error
2020-03-10 15-51-23.411 [Notice] (playerCallback) No chunk received for 5000ms. Closing Audio Queue.
AddressSanitizer:DEADLYSIGNAL
=================================================================
==27258==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000018 (pc 0x00010084c155 bp 0x700009584b30 sp 0x700009584600 T2)
==27258==The signal is caused by a WRITE memory access.
==27258==Hint: address points to the zero page.
#0 0x10084c154 (snapclient:x86_64+0x100118154)
#1 0x10084a7ea (snapclient:x86_64+0x1001167ea)
#2 0x10080c6f5 in void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (Player::*)(), Player*> >(void*) (snapclient:x86_64+0x1000d86f5)
#3 0x7fff68dd22ea in _pthread_body (libsystem_pthread.dylib:x86_64+0x32ea)
#4 0x7fff68dd5248 in _pthread_start (libsystem_pthread.dylib:x86_64+0x6248)
#5 0x7fff68dd140c in thread_start (libsystem_pthread.dylib:x86_64+0x240c)
==27258==Register values:
rax = 0x00000000000044e8 rbx = 0x00007000095849e0 rcx = 0x0000100000000000 rdx = 0x0000000000000008
rdi = 0x0000000000000018 rsi = 0x0000100000000000 rbp = 0x0000700009584b30 rsp = 0x0000700009584600
r8 = 0x000060f000007840 r9 = 0x0000700009584801 r10 = 0x000000000000007d r11 = 0x0000000000000002
r12 = 0x0000700009584600 r13 = 0x00001e00012b08ce r14 = 0x0000700009584828 r15 = 0x00001e00012b08c0
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (snapclient:x86_64+0x100118154)
Thread T2 created by T1 here:
#0 0x1009d102d in wrap_pthread_create (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x5402d)
#1 0x10080c54e (snapclient:x86_64+0x1000d854e)
#2 0x100808a86 (snapclient:x86_64+0x1000d4a86)
#3 0x1008251d6 (snapclient:x86_64+0x1000f11d6)
#4 0x1007cd40a (snapclient:x86_64+0x10009940a)
#5 0x1007ce25b (snapclient:x86_64+0x10009a25b)
#6 0x100801505 in void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (ClientConnection::*)(), ClientConnection*> >(void*) (snapclient:x86_64+0x1000cd505)
#7 0x7fff68dd22ea in _pthread_body (libsystem_pthread.dylib:x86_64+0x32ea)
#8 0x7fff68dd5248 in _pthread_start (libsystem_pthread.dylib:x86_64+0x6248)
#9 0x7fff68dd140c in thread_start (libsystem_pthread.dylib:x86_64+0x240c)
Thread T1 created by T0 here:
#0 0x1009d102d in wrap_pthread_create (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x5402d)
#1 0x10080135e (snapclient:x86_64+0x1000cd35e)
#2 0x1007c2f69 (snapclient:x86_64+0x10008ef69)
#3 0x10082a8da (snapclient:x86_64+0x1000f68da)
#4 0x100747764 (snapclient:x86_64+0x100013764)
#5 0x7fff68bde3d4 in start (libdyld.dylib:x86_64+0x163d4)
==27258==ABORTING
Abort trap: 6
Unfortunately, to my untrained eye, that gives little indication as to the source of the issue. So I guess I will have to come back to this later with the debug make and gdb.
After staring at the error message for a while, I came to the conclusion that my eyes must be also untrained rookies :) Seems that the code is inspired from this: https://gist.github.com/andormade/1360885/6b239efc7bc59a7dd038b9543b98bdd6afd40559
I have two guesses: I don't know what type AudioQueueRef is (pointer, reference, normal stack variable):
void CoreAudioPlayer::initAudioQueue()
{
...
AudioQueueRef queue;
AudioQueueNewOutput(&format, callback, this, CFRunLoopGetCurrent(), kCFRunLoopCommonModes, 0, &queue);
...
CFRunLoopRun();
}
Maybe this "reference?" get's invalid when leaving the scope/run-loop. It could be worth trying to make it a class member. My other guess: CoreAudio expects data to be enqueued within the callback, which doesn't happen in the "No chunks" situation:
void CoreAudioPlayer::playerCallback(AudioQueueRef queue, AudioQueueBufferRef bufferRef)
{
...
char* buffer = (char*)bufferRef->mAudioData;
if (!pubStream_->getPlayerChunk(buffer, delay, frames_))
{
if (chronos::getTickCount() - lastChunkTick > 5000)
{
LOG(NOTICE) << "No chunk received for 5000ms. Closing Audio Queue.\n";
// Maybe we need a last "AudioQueueEnqueueBuffer(queue, bufferRef, 0, NULL);" here
uninitAudioQueue(queue);
return;
}
memset(buffer, 0, buff_size_);
}
else
{
lastChunkTick = chronos::getTickCount();
adjustVolume(buffer, frames_);
}
AudioQueueEnqueueBuffer(queue, bufferRef, 0, NULL);
if (!active_)
{
uninitAudioQueue(queue);
}
}
@artmg do you have any new findings regarding this issue?
Sorry Johannes, this I have not used this system since your last post, so not had the opportunity to try out the proposed code. I will warm this one up again, and let you know. Art.
I have recompiled from your latest develop branch, and will proceed with testing and debugging, and if needed try your proposed code mods.
I have not been able to reproduce this issue after compiling from develop branch at commit https://github.com/badaix/snapcast/commit/860c0e5c8f57c3b04fe641cc168a3a3330305fe2 - Closing as resolved
This issue seems to have come back, so I tried to catch exceptions in the uninitAudioQueue with the commit https://github.com/artmg/snapcast/commit/8afe7e52f5ba0859f02c293fcc79157086eab756
2020-06-28 15-31-31.053 [Debug] (Stats) Chunk: -5 -1 -2 -2 29 136 0
2020-06-28 15-31-32.051 [Debug] (Stats) Chunk: -4 -3 -2 -2 39 138 0
2020-06-28 15-31-41.276 [Info] (Stream) abs(age > 500): 8799969
2020-06-28 15-31-41.276 [Debug] (Stats) Chunk: 87999 -2 -2 -2 43 126 0
2020-06-28 15-31-41.276 [Notice] (CoreAudioPlayer) No chunk received for 5000ms. Closing Audio Queue.
2020-06-28 15-31-41.276 [Debug] (CoreAudioPlayer) Freeing up audio queue
2020-06-28 15-31-41.276 [Debug] (Stream) age > 0: 8700ms
2020-06-28 15-31-41.276 [Notice] (CoreAudioPlayer) No chunk received for 5000ms. Closing Audio Queue.
2020-06-28 15-31-41.276 [Debug] (CoreAudioPlayer) Freeing up audio queue
libc++abi.dylib: terminating with uncaught exception of type CAException
Abort trap: 6
Because the exception is still uncaught (despite my try-catch overkill) I assume that the issue is not in that code but somewhere else instead.
Because the fault is happening in a different part of the code I guess I will now try making with debug and running it with gdb
soooooo, having finally gotten it running in gdb, this output doesn't mean so much to me :(
libc++abi.dylib: terminating with uncaught exception of type CAException
[New Thread 0x1a2b of process 846]
[New Thread 0x1b03 of process 846]
[New Thread 0x2703 of process 846]
[New Thread 0x2803 of process 846]
Thread 3 received signal SIGABRT, Aborted.
[Switching to Thread 0xd1f of process 846]
0x00007fff6019e2c2 in ?? () from /usr/lib/system/libsystem_kernel.dylib
(gdb)
So please do let me know how I can get some kind of indication of what is calling that library and why it's falling over when trying. I appreciate your support.
I realised that gdb must not be getting all the symbols from the system objects, and I came across lldb which gave me more useful information...
$ lldb -- snapclient --logfilter *:debug
(lldb) run
<snip>
2020-06-28 17-53-08.246 [Notice] (CoreAudioPlayer) No chunk received for 5000ms. Closing Audio Queue.
2020-06-28 17-53-08.246 [Debug] (CoreAudioPlayer) Freeing up audio queue
2020-06-28 17:53:08.246309+0100 snapclient[2510:46800] [aqsrv] 69: Exception caught in (null) - error -66671
2020-06-28 17:53:08.246355+0100 snapclient[2510:46800] [aqsrv] 69: Exception caught in (null) - error -66671
2020-06-28 17:53:08.246469+0100 snapclient[2510:46800] CAMutex.cpp:101:Lock: CAMutex::Lock: Could not lock the mutex
libc++abi.dylib: terminating with uncaught exception of type CAException
Process 2510 stopped
* thread #2, stop reason = signal SIGABRT
frame #0: 0x00007fff6019e2c2 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
-> 0x7fff6019e2c2 <+10>: jae 0x7fff6019e2cc ; <+20>
0x7fff6019e2c4 <+12>: movq %rax, %rdi
0x7fff6019e2c7 <+15>: jmp 0x7fff60198453 ; cerror_nocancel
0x7fff6019e2cc <+20>: retq
Target 0: (snapclient) stopped.
Does that perhaps give a better indication of which part of the player codebase is responsible for calling but not handling this exception? I'm happy to slap on some more try-catches if I know where.
Maybe it helps to stop the CFRunLoop first, to avoid that the play callback is being called during uninit:
void CoreAudioPlayer::uninitAudioQueue(AudioQueueRef queue)
{
CFRunLoopStop(CFRunLoopGetCurrent());
AudioQueueStop(queue, false);
AudioQueueDispose(queue, false);
pubStream_->clearChunks();
}
Closing due to inactivity