snapcast Segmentation fault: 11 on mac when playback paused

Hi, this is a minor issue, but one I'd like to help get resolved.

Source bundle v0.18.1 compiled on macOS 10.14.6. Running snapclient in command line (instead of daemon to capture output) with no arguments required. When music ends or is paused the following error output occurs

[Notice] No chunk received for 5000ms. Closing Audio Queue.
Segmentation fault: 11

The segfault does not always appear immediately, sometimes a few seconds or minutes after the music stops. FYI when running as a daemon the process would automatically restart, so in normal conditions this issue might not be obvious – however it's a bug that I'd like to see squashed. The segmentation fault (signal 11) indicates an attempt to write to memory that has not been allocated to the process for writing. A common reason for this is when array bounds are exceeded.

From the fact this seems to occur when waiting for more audio to arrive, I was looking at the code in the coreaudio player worker loop. Although I could see the code for initAudioQueue, I was getting a bit lost in the Stream code where it looks for chunks.

If I were to add debug code in here, to try and pin down where the fault is actually being raised, do you have any suggestion of where else to look (e,g. where in stream.cpp)? And is there an example of good debug practice in another module that you'd suggest I follow? Thanks

Mar 10 '20 10:03 artmg

You could also try to run the client using Valgrind:

valgrind --leak-check=yes path/to/snapclient ... 2> output.log

It helped me to find the memory issues in https://github.com/badaix/snapcast/pull/546.

Mar 10 '20 11:03 cmfcmf

What I prefer over valgrind is the address sanitizer, just build the client with make TARGET=MACOS SANITIZE=address. It gives a detailed stack trace when the application crashes. For a segmentation fault a regular debug build and run in gdb should help: make DEBUG=1. I doubt that it is related to the stream class, I guess it's rather related to the initAudioQueue and uninitAudioQueue stuff. You can alse enable debug logging with --debug.

Mar 10 '20 12:03 badaix

If this is some hard-to-fix bug, the easy solution would be to keep the player initialized and return silence as for instance the Oboe player does:

oboe::DataCallbackResult OboePlayer::onAudioReady(oboe::AudioStream* /*oboeStream*/, void* audioData, int32_t numFrames)
{
...
    if (!stream_->getPlayerChunk(audioData, delay, numFrames))
    {
        // LOG(INFO) << "Failed to get chunk. Playing silence.\n";
        memset(audioData, 0, numFrames * stream_->getFormat().frameSize());
    }
    else
    {
        adjustVolume(static_cast<char*>(audioData), numFrames);
    }

    return oboe::DataCallbackResult::Continue;
}

I don't know if it is the best practice/good style to free the player resource or not. I did it since the beginning in the ALSA player and took it over for the CoreAudio player.

Mar 10 '20 12:03 badaix

For something that's 'part of the infrastructure' like snapcast, and is left running, waiting for sound, I think that you have chosen Good Practice by freeing up resources rather than hogging them. I'd feel wrong in suggesting you undo that work.

You have both given me good diagnostic pointers (especially as Mac is not my traditional dev workground) so I'll work forward with those. Thanks for your support, I'll post updates as I diagnose.

Mar 10 '20 14:03 artmg

Although I was excited to learn of valgrind, the main project is not yet compatible with recent macos versions, and the fork I tried errored as soon as snapclient picked up the stream. So I went for Johannes's rebuild suggestion and got the more detailed error

2020-03-10 15-51-23.411 [Notice] (playerCallback) No chunk received for 5000ms. Closing Audio Queue.
AddressSanitizer:DEADLYSIGNAL
=================================================================
==27258==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000018 (pc 0x00010084c155 bp 0x700009584b30 sp 0x700009584600 T2)
==27258==The signal is caused by a WRITE memory access.
==27258==Hint: address points to the zero page.
    #0 0x10084c154  (snapclient:x86_64+0x100118154)
    #1 0x10084a7ea  (snapclient:x86_64+0x1001167ea)
    #2 0x10080c6f5 in void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (Player::*)(), Player*> >(void*) (snapclient:x86_64+0x1000d86f5)
    #3 0x7fff68dd22ea in _pthread_body (libsystem_pthread.dylib:x86_64+0x32ea)
    #4 0x7fff68dd5248 in _pthread_start (libsystem_pthread.dylib:x86_64+0x6248)
    #5 0x7fff68dd140c in thread_start (libsystem_pthread.dylib:x86_64+0x240c)

==27258==Register values:
rax = 0x00000000000044e8  rbx = 0x00007000095849e0  rcx = 0x0000100000000000  rdx = 0x0000000000000008
rdi = 0x0000000000000018  rsi = 0x0000100000000000  rbp = 0x0000700009584b30  rsp = 0x0000700009584600
 r8 = 0x000060f000007840   r9 = 0x0000700009584801  r10 = 0x000000000000007d  r11 = 0x0000000000000002
r12 = 0x0000700009584600  r13 = 0x00001e00012b08ce  r14 = 0x0000700009584828  r15 = 0x00001e00012b08c0
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (snapclient:x86_64+0x100118154)
Thread T2 created by T1 here:
    #0 0x1009d102d in wrap_pthread_create (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x5402d)
    #1 0x10080c54e  (snapclient:x86_64+0x1000d854e)
    #2 0x100808a86  (snapclient:x86_64+0x1000d4a86)
    #3 0x1008251d6  (snapclient:x86_64+0x1000f11d6)
    #4 0x1007cd40a  (snapclient:x86_64+0x10009940a)
    #5 0x1007ce25b  (snapclient:x86_64+0x10009a25b)
    #6 0x100801505 in void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (ClientConnection::*)(), ClientConnection*> >(void*) (snapclient:x86_64+0x1000cd505)
    #7 0x7fff68dd22ea in _pthread_body (libsystem_pthread.dylib:x86_64+0x32ea)
    #8 0x7fff68dd5248 in _pthread_start (libsystem_pthread.dylib:x86_64+0x6248)
    #9 0x7fff68dd140c in thread_start (libsystem_pthread.dylib:x86_64+0x240c)

Thread T1 created by T0 here:
    #0 0x1009d102d in wrap_pthread_create (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x5402d)
    #1 0x10080135e  (snapclient:x86_64+0x1000cd35e)
    #2 0x1007c2f69  (snapclient:x86_64+0x10008ef69)
    #3 0x10082a8da  (snapclient:x86_64+0x1000f68da)
    #4 0x100747764  (snapclient:x86_64+0x100013764)
    #5 0x7fff68bde3d4 in start (libdyld.dylib:x86_64+0x163d4)

==27258==ABORTING
Abort trap: 6

Unfortunately, to my untrained eye, that gives little indication as to the source of the issue. So I guess I will have to come back to this later with the debug make and gdb.

Mar 10 '20 16:03 artmg

After staring at the error message for a while, I came to the conclusion that my eyes must be also untrained rookies :) Seems that the code is inspired from this: https://gist.github.com/andormade/1360885/6b239efc7bc59a7dd038b9543b98bdd6afd40559

I have two guesses: I don't know what type AudioQueueRef is (pointer, reference, normal stack variable):

void CoreAudioPlayer::initAudioQueue()
{
...
    AudioQueueRef queue;
    AudioQueueNewOutput(&format, callback, this, CFRunLoopGetCurrent(), kCFRunLoopCommonModes, 0, &queue);
...
    CFRunLoopRun();
}

Maybe this "reference?" get's invalid when leaving the scope/run-loop. It could be worth trying to make it a class member. My other guess: CoreAudio expects data to be enqueued within the callback, which doesn't happen in the "No chunks" situation:

void CoreAudioPlayer::playerCallback(AudioQueueRef queue, AudioQueueBufferRef bufferRef)
{
...
    char* buffer = (char*)bufferRef->mAudioData;
    if (!pubStream_->getPlayerChunk(buffer, delay, frames_))
    {
        if (chronos::getTickCount() - lastChunkTick > 5000)
        {
            LOG(NOTICE) << "No chunk received for 5000ms. Closing Audio Queue.\n";

// Maybe we need a last "AudioQueueEnqueueBuffer(queue, bufferRef, 0, NULL);" here

            uninitAudioQueue(queue);
            return;
        }
        memset(buffer, 0, buff_size_);
    }
    else
    {
        lastChunkTick = chronos::getTickCount();
        adjustVolume(buffer, frames_);
    }

    AudioQueueEnqueueBuffer(queue, bufferRef, 0, NULL);

    if (!active_)
    {
        uninitAudioQueue(queue);
    }
}

Mar 10 '20 17:03 badaix

@artmg do you have any new findings regarding this issue?

May 08 '20 11:05 badaix

Sorry Johannes, this I have not used this system since your last post, so not had the opportunity to try out the proposed code. I will warm this one up again, and let you know. Art.

May 08 '20 11:05 artmg

I have recompiled from your latest develop branch, and will proceed with testing and debugging, and if needed try your proposed code mods.

May 08 '20 19:05 artmg

I have not been able to reproduce this issue after compiling from develop branch at commit https://github.com/badaix/snapcast/commit/860c0e5c8f57c3b04fe641cc168a3a3330305fe2 - Closing as resolved

May 10 '20 13:05 artmg

This issue seems to have come back, so I tried to catch exceptions in the uninitAudioQueue with the commit https://github.com/artmg/snapcast/commit/8afe7e52f5ba0859f02c293fcc79157086eab756

2020-06-28 15-31-31.053 [Debug] (Stats) Chunk: -5	-1	-2	-2	29	136	0
2020-06-28 15-31-32.051 [Debug] (Stats) Chunk: -4	-3	-2	-2	39	138	0
2020-06-28 15-31-41.276 [Info] (Stream) abs(age > 500): 8799969
2020-06-28 15-31-41.276 [Debug] (Stats) Chunk: 87999	-2	-2	-2	43	126	0
2020-06-28 15-31-41.276 [Notice] (CoreAudioPlayer) No chunk received for 5000ms. Closing Audio Queue.
2020-06-28 15-31-41.276 [Debug] (CoreAudioPlayer) Freeing up audio queue
2020-06-28 15-31-41.276 [Debug] (Stream) age > 0: 8700ms
2020-06-28 15-31-41.276 [Notice] (CoreAudioPlayer) No chunk received for 5000ms. Closing Audio Queue.
2020-06-28 15-31-41.276 [Debug] (CoreAudioPlayer) Freeing up audio queue
libc++abi.dylib: terminating with uncaught exception of type CAException
Abort trap: 6

Because the exception is still uncaught (despite my try-catch overkill) I assume that the issue is not in that code but somewhere else instead.

Because the fault is happening in a different part of the code I guess I will now try making with debug and running it with gdb

Jun 28 '20 14:06 artmg

soooooo, having finally gotten it running in gdb, this output doesn't mean so much to me :(

libc++abi.dylib: terminating with uncaught exception of type CAException
[New Thread 0x1a2b of process 846]
[New Thread 0x1b03 of process 846]
[New Thread 0x2703 of process 846]
[New Thread 0x2803 of process 846]

Thread 3 received signal SIGABRT, Aborted.
[Switching to Thread 0xd1f of process 846]
0x00007fff6019e2c2 in ?? () from /usr/lib/system/libsystem_kernel.dylib
(gdb)

So please do let me know how I can get some kind of indication of what is calling that library and why it's falling over when trying. I appreciate your support.

Jun 28 '20 15:06 artmg

I realised that gdb must not be getting all the symbols from the system objects, and I came across lldb which gave me more useful information...

$ lldb -- snapclient --logfilter *:debug
(lldb) run
<snip>
2020-06-28 17-53-08.246 [Notice] (CoreAudioPlayer) No chunk received for 5000ms. Closing Audio Queue.
2020-06-28 17-53-08.246 [Debug] (CoreAudioPlayer) Freeing up audio queue
2020-06-28 17:53:08.246309+0100 snapclient[2510:46800] [aqsrv] 69: Exception caught in (null) - error -66671
2020-06-28 17:53:08.246355+0100 snapclient[2510:46800] [aqsrv] 69: Exception caught in (null) - error -66671
2020-06-28 17:53:08.246469+0100 snapclient[2510:46800] CAMutex.cpp:101:Lock:  CAMutex::Lock: Could not lock the mutex
libc++abi.dylib: terminating with uncaught exception of type CAException
Process 2510 stopped
* thread #2, stop reason = signal SIGABRT
    frame #0: 0x00007fff6019e2c2 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fff6019e2c2 <+10>: jae    0x7fff6019e2cc            ; <+20>
    0x7fff6019e2c4 <+12>: movq   %rax, %rdi
    0x7fff6019e2c7 <+15>: jmp    0x7fff60198453            ; cerror_nocancel
    0x7fff6019e2cc <+20>: retq
Target 0: (snapclient) stopped.

Does that perhaps give a better indication of which part of the player codebase is responsible for calling but not handling this exception? I'm happy to slap on some more try-catches if I know where.

Jun 28 '20 17:06 artmg

Maybe it helps to stop the CFRunLoop first, to avoid that the play callback is being called during uninit:

void CoreAudioPlayer::uninitAudioQueue(AudioQueueRef queue)
{
    CFRunLoopStop(CFRunLoopGetCurrent());
    AudioQueueStop(queue, false);
    AudioQueueDispose(queue, false);
    pubStream_->clearChunks();
}

Sep 07 '20 07:09 badaix

Closing due to inactivity

Mar 25 '24 09:03 badaix