BackgroundMusic
BackgroundMusic copied to clipboard
StopIO in BGMDriver is not triggered
Hi, thank you so much for the amazing work!
I found an issue with BGM_Device's StopIO: sometimes it's not called when apps stop using the BGM device, and I cannot find the reason why that happens. Therefore, the IO is left always running, consuming CPU and battery.
I tried to insert a LogWarning
on the first line of BGM_StopIO
and it does not even get called when this problem happens.
Any thoughts on why this might be happening? Thanks in advance.
Was able to reproduce the bug. The same problem happens with StartIO too which causes an app to not be connected to the Background Music at all.
Steps to reproduce:
- Add
LogWarning()
s into BGM_StartIO and BGM_StopIO. - Open some app (make sure that only that no other app is connected except this one though) that allows you to select audio devices (such as Skype) and switch devices back and forth (BGM Device <-> Built-in speaker) while playing test sound. Monitor the Console app's log.
- At some point you will see that the BGM_StartIO and BGM_StopIO are not called (happens once in ~5-7 tries).
Tested this on Big Sur 11.1.
@kyleneideck Do you have any ideas? As far as I understand the AudioDriverPlugin's StartIO and StopIO are both called directly by HAL. What could be affecting those calls? Could WillDoIOOperation()
be the cause?
Thanks for the report. I think this has probably happened for me, too. I'd guess it might have started happening when Big Sur came out. It also seems like it might happen when the system wakes up from sleep, or at least happen more often when it does.
I actually had a similar bug in some related work where the HAL wasn't calling StopIO
for every client when the driver deactivated/unpublished the device. It only happened on Big Sur. The fix was just to reset the client state (mClients
) when the driver reactivated the device.
I'll look into applying the same changes to BGMDriver when I get a chance, since they might be helpful for this bug. The extra debug logging should help, at least. It logs more details about which clients are starting and stopping IO, so we'll be able to tell from the logs which clients BGMDriver thinks are still doing IO when they've actually stopped.
Hopefully the HAL is giving our driver some other indication that all clients have stopped IO or something like that. It might help to try logging every notification the driver gets from the HAL to see if we can use one of those to tell when BGMDriver should stop IO even though the HAL might not call StopIO
for every client.
If the HAL really doesn't give us any indication, I'm not sure if there's anything we can do other than filing Radars about it. The only workaround I can think of is that we might be able to use kAudioServerPlugInIOOperationThread
. If the HAL passes it to our BGM_EndIOOperation
, but doesn't call BGM_StopIO
for the same client, we can probably assume that it should have. And similarly for BGM_BeginIOOperation
and BGMStartIO
.
Actually, I just had a look at the code for that and it looks like it might already do that. So maybe that means the HAL isn't calling BGM_EndIOOperation
either.
@alexmil23 Thanks, having repro steps is really helpful. I'll try them out when I get a chance and let you know what I find out.
It's interesting that you need to switch between devices to get the bug to happen. Presumably that would be making the app remove and re-add itself as a CoreAudio client each time. Does that mean that you can't reproduce the bug just by making an app stop and start IO repeatedly?
I think I might have had the HAL fail to call BGM_StartIO
as well. At least, I've got a spindump from a time when audio wasn't playing and should have been. Although, it might have been that the audio had been playing and then stopped for no reason, even though the app playing the audio wouldn't have stopped and restarted IO. I guess I should have taken some notes with it.
The spindump doesn't show too much, unfortunately. Just that BGMDevice wasn't running IO. But that does help narrow down the problem and indicates that the bug might have been caused by StartIO
not being called when it was supposed to be.
One other thing I should mention is that there's a bug (I think it's a bug) in BGMDriver where it doesn't allow for the possibility of one client starting multiple IO runs at the same time. That is, if the HAL calls BGM_StartIO
twice for the same client and then calls BGM_StopIO
once for that client, BGMDriver will assume that client has stopped IO completely. I haven't actually tested it yet, so I'm not sure if clients even can do that, but I think BGMDriver should still handle that case correctly to be safe.
I don't think that would explain why this bug is occurring, though, since it wouldn't keep the HAL from calling BGM_StopIO
.
Could WillDoIOOperation() be the cause?
I can't see anything in our WillDoIOOperation
that looks immediately suspicious to me. It more-or-less always returns the same values for each IO operation. Is there something about it that looks suspicious to you? It might help to have a look at some other driver's implementations of WillDoIOOperation
to see what they do differently.
@kyleneideck Thanks for the quick reply!
Presumably that would be making the app remove and re-add itself as a CoreAudio client each time. Does that mean that you can't reproduce the bug just by making an app stop and start IO repeatedly?
I tried to add LogWarning()
s to the BGM_AddDeviceClient
and BGM_RemoveDeviceClient
and it looked like device clients were only added when an audio app (e.g. Skype or music player) was launched or terminated on the latest Big Sur. In other words, device clients are not added or removed when you select the BGM device in audio settings on Big Sur (they are added only on startup or termination of a music app process), at least on my system.
Switching between devices in the audio app's settings seems to only invoke BGM_StartIO
and BGM_StopIO
which means that you can actually reproduce the bug by making an app stop and start IO repeatedly.
I also forgot to mention an additional step that I took to reproduce the bug: I disabled the logic that sets the BGM Device as system-wide default. Otherwise, it's hard to keep track of apps that interact/do IO with it.
Is there something about it that looks suspicious to you?
It doesn't look suspicious to me either, now that I took a look. Also, looked at similar implementations in other projects and they are all pretty much the same.
The fix was just to reset the client state (mClients) when the driver reactivated the device.
Please, could you elaborate on this a little more?
Thank you.
Man, looks like a nasty bug. Thanks for looking into this @kyleneideck and describing in detail @alexmil23. So I tried tweaking the code based on your comments above to test some of the assumptions.
Here are my findings:
-
Changed the sync StartIO calls from QueueSync to QueueAsync. It didn't help.
-
When I commented out all logic in
BGM_Device::StartIO
,StopIO
,BeginIOOperation
,EndIOOperation
the problem disappeared (start and stop io calls are not getting skipped anymore). That makes me think that something inside those methods causes this! Maybe a deadlock somewhere there? -
Another notable thing is that when I tried to restart CoreAudio Daemon and followed the steps to reproduce the bug, I saw
IOWorkLoop overload
fromHALC_IOContext
in the log saying that it skipped cycles due to an overload.
So now, the two best guesses on what might be happening here that I have: Deadlock somewhere in BGM_TaskQueue OR Methods taking too much time to execute which causes the overload. I timed them and they sometimes take 5-7 milliseconds to execute which might be slow according to #367.
@neilm-code Do you still have the code where you commented out the logic in BGM_Device::StartIO
, StopIO
, BeginIOOperation
, EndIOOperation
by any chance? I finally got around to trying it out, but the HAL still didn't always send the kAudioServerPlugInIOOperationThread
IO operation when it was supposed to.
If you do still have it, could you post a patch or push it to a branch or something?
The way I'm reproducing the bug is by starting IO in two different apps and then stopping IO in one of them (without BGMApp open). As far as I can tell, the HAL doesn't give BGMDriver any indication that one of the apps has stopped IO, except that it presumably stops doing IO operations for that client.
So if the app that's still running IO was BGMApp, BGMDriver wouldn't know that it should update its kAudioDeviceCustomPropertyDeviceIsRunningSomewhereOtherThanBGMApp
property and that would mean BGMApp wouldn't know to stop running IO as well.
The fix was just to reset the client state (mClients) when the driver reactivated the device.
Please, could you elaborate on this a little more?
I haven't gotten around to merging it yet (or figuring out whether it's actually useful to merge), but you can see the commit here: https://github.com/anteo/BackgroundMusic/commit/7d13ea0bb0f688a075c3fe2c4a6fe8146e6c86d0