fluidsynth
fluidsynth copied to clipboard
RFC: Support for auto-suspend / idle-handling
This is a RFC pull-request that adds experimental support for auto-suspend / idle-handling to FluidSynth.
Motivation behind this change is the fact that quite a few people have already expressed interest in a FluidSynth server process that uses as little CPU time as possible when not "in use", i.e. not actually rendering MIDI. It would also help myself with my musical instrument, as it would help to reduce CPU load and therefore improve battery life if the instrument is switched on but not actually in use at the moment.
The basic approach is based on the assumption that the renderer state only changes with events in the rvoice event queue. So as soon as there are no active voices in the renderer and no events in the queue, we consider it an "idle" run and add the current number of blocks towards the idle_block_threshold. If the threshold is reached, the synth is marked as idle. As soon as another event gets added to the queue, the renderer is marked busy again and any drivers waiting on the idle state get notified.
Example support for this idle handling is added to the ALSA and PulseAudio drivers. ALSA works really well, reducing the CPU time for alsa and FS to 0%. ~~PulseAudio still consumes CPU time in idle wait, probably due to the use of the "simple" PA API. But that could probably be fixed fairly easily.~~ (Edit: I've also changed the PA driver so that the connection to the server is dropped during idle times. This effectively reduces CPU consumption of PA and FS to 0).
Also, sample timers obviously don't work if no renderer calls are made during idle wait, so the player uses the system timer if idle-timeout is set. But having an idle-timeout with midi files passed on the command-line is probably not a sensible use-case anyway.
Again... this is meant as a request-for-comments. I don't think the added calls in the renderer path introduce a noticeable overhead for people not using this feature, but that remains to be tested. And maybe I've missed parts of the synth that definitely require frequent renderer calls even if there is nothing to be rendered at the moment?
Looking forward to your feedback!
This pull request introduces 4 alerts when merging 20fed814128090c6ff90f75baaebfaf01c669121 into 056e29ea595d6a47b70614b3e2ce35612af15507 - view on LGTM.com
new alerts:
- 4 for FIXME comment
This pull request introduces 4 alerts when merging ea3c20d2de9a7fc7ccf68c450bbf35d7e66a1352 into 056e29ea595d6a47b70614b3e2ce35612af15507 - view on LGTM.com
new alerts:
- 4 for FIXME comment
Some more notes about the implementation: It is obviously not a fool-proof auto-suspend. With a very short timeout value, there is the chance that audio from long-running effects (like a very long echo or reverb) is cut off. But the user has the ability to choose a higher timeout value if such long running effects are used.
It could be made more "fool-proof" by checking each sample in each block against FLUID_NOISE_FLOOR to determine if the renderer is really outputting silence. But my thinking here is that simply checking for active_voices and the event queue is good enough. Manually checking for silent samples would probably mean too large a runtime-cost, even if we would stop on the first value above the noise floor.
I would actually like to make one step backwards and ask: What causes fluidsynth to consume CPU time when rendering silence?
I just executed fluidsynth -a jack /usr/share/sounds/sf2/FluidR3_GM.sf2
ran it for about 1 minute and watched it in VTune (I used jack because alsa goes down all the way to pulse audio which causes heavy overhead in my case):
Reverb and Chorus are quite expensive. When disabling them it looks like that:
The [Outside any known module]
seems to be deep in kernel space where all calls from fread, malloc, and syscall end up.
So I would be interested: Can you confirm that turning off reverb and chorus significantly reduces CPU usage for the fluidsynth-server-background process? And if this is the case, I would prefer to start optimizing the chrous / reverb handling when rendering silence (possibly by also using FLUID_NOISE_FLOOR, yes).
So I would be interested: Can you confirm that turning off reverb and chorus significantly reduces CPU usage for the fluidsynth-server-background process?
Of course it does. But that's not really the problem this idle-timeout is fixing. It really comes down to reducing the overhead of the sound server and the audio driver thread. For example: if I comment out the call to fluid_synth_write_float
in the PulseAudio driver (and memset the audio buffer to 0 once) and then let the audio driver render silence, the PulseAudio server still consumes between 15-25% CPU time, the audio driver thread about 3-7% CPU time on my machine. And that is due to the very low period-size default of 64 we have on Linux. While that is great for low-latency, the short period itself causes a significant load.
FluidSynths CPU consumption is negligible compared to this overhead. So in my opinion, optimizing the reverb and chorus effects for lower CPU-consumption would be nice, but does not really help here.
For plain ALSA things look a bit better. "Normal" idle (i.e. without auto-suspend) uses around 6% CPU time. If I comment out the rendering function from the alsa driver and simply render silence, then CPU consumption for the audio driver thread is around 4% CPU time. So 2/3 of the CPU consumption happens in the driver loop itself. Mind you that is on my fairly old dual-core i5 and with a period-size of 940 (my soundcard or alsa setup doesn't want to go lower it seems). I haven't checked on my embedded ARM CPU yet, but I suspect the number will be much higher, making this idle-suspend a very interesting feature. And here I definitely want (and get) a small period size of 64, making the render loop much more costly.
if I comment out the call to fluid_synth_write_float in the PulseAudio driver, the PulseAudio server still consumes between 15-25% CPU time, the audio driver thread about 3-7% CPU time on my machine. If I comment out the rendering function from the alsa driver and simply render silence, then CPU consumption for the audio driver thread is around 4% CPU time
Ok, so Pulse is way too high. Let's concentrate on alsa for now. And here it would be nice to know where the CPU usage comes from: does it come from overhead in the alsa world, or is it something we could optimize in fluidsynth?
And what's the CPU usage in jack driver? That would be interesting to know because of the callback based approach.
(I'm just a bit cautious, trying to understand things step by step, this feels like a potential premature optimization issue to me.)
Most of the usage seems to originate from snd_pcm_hw_writei()
fluidsynth -a alsa -R0 -C0 -o audio.alsa.device=hw:1 -r 48000 FluidR3_GM.sf2
Increasing the bufsize expectedly decreases the CPU usage
fluidsynth -a alsa -R0 -C0 -o audio.alsa.device=hw:1 -O float -r 48000 -z8192 FluidR3_GM.sf2
So if you need the small bufsize of 64 then yes, the only way is to either reduce calls to the driver API or using a different driver (jack? will test later).
Also note that fluid_synth_write_s16_channels
is quite expensive:
It looks like reading the rand_table
is expensive. Perhaps it could help to transpose the indices for better cache usage. Just mentioning it, not sure if that's an issue in your case at all.
I'm just a bit cautious, trying to understand things step by step, this feels like a potential premature optimization issue to me.
Yes, very valid point. What really triggered me to start looking into auto-suspend was the fact that multiple people complained about the high CPU usage of PulseAudio, especially when rendering silence. The fact that it might help me with battery consumption as well is only an added bonus, a slight optimization. So I'm all for looking at other optimization areas first. However... I still think suspending audio output completely if there is no activity in FS is a good idea in general, regardless of any optimizations we can make in other areas. And the required changes to make this happen were much smaller and add much less extra complexity than I anticipated (unless I missed something major where the missing renderer calls cause problems).
So I will go ahead and experiment with two things: using the memory-mapped API variant for ALSA and implementing the asynchronous API for PulseAudio. I don't expect huge improvements from the memory-mapped ALSA variant, even though it will save copying the audio buffer, as it will require more API-calls and therefore more kernel/userspace context switches. But will see.
And as for PulseAudio: the high CPU usage here is really bugging me. And I think it's quite a relevant problem for us, as I think PulseAudio is the most used sound-server on Linux desktops at the moment. It seems like Pulse has a huge overhead via the simple API which becomes very noticeable with small period-sizes. I'll try to experiment with the asynchronous API to see if that is more efficient. If it's not... then we could consider raising the default period-size for Pulse. After all... people with low-latency requirements will most likely not choose PulseAudio as their sound server.
In 38e6fff is a very simple mmap-based ALSA driver format and handler. First analysis shows no improvement with regard to CPU consumption, quite the contrary (as expected).
This pull request introduces 4 alerts when merging 38e6fff9e039f386435a1cf8509ee0d44f01e45e into 056e29ea595d6a47b70614b3e2ce35612af15507 - view on LGTM.com
new alerts:
- 4 for FIXME comment
In 38e6fff is a very simple mmap-based ALSA driver format and handler. First analysis shows no improvement with regard to CPU consumption, quite the contrary (as expected).
And now the experimental ALSA mmap driver should work with lower period sizes as well. Still not "complete", as it lacks quite a bit of error handling. But should be enough to determine if mmap-ALSA drivers would offer us any benefits.
This pull request introduces 4 alerts when merging 0e45713418b1e5ad14558f36521f6bd3f4cf7965 into 56034e7f2b0166cb7b8103756edb699c935daa2c - view on LGTM.com
new alerts:
- 4 for FIXME comment
This pull request introduces 4 alerts when merging 9be14b901584cac0deef419f50221c1795ce2449 into 56034e7f2b0166cb7b8103756edb699c935daa2c - view on LGTM.com
new alerts:
- 4 for FIXME comment
Ok... I think I understand why PulseAudio has such high CPU load with default FluidSynth settings. Default settings are sample-rate 44100, period-size 64 and enabled adjust-latency mode. This will configure the PA buffer attributes tlength
and minreq
as follows:
bufattr.tlength = 512; // (64 * sizeof(float) * 2)
bufattr.minreq = -1;
If minreq
is -1, then PA will automatically set it to tlength / 4
. So that means 128 bytes.
Internally PA reasons about latency in time, not buffer sizes. With our samplespec
setup, 512 bytes tlength
mean 1.45ms and the 128 bytes minreq
mean 0.36ms.
And now PAs adjust-latency handling comes into play. It tries to configure the sink (i.e. ALSA hardware) to such a latency that the overall latency of the whole stream gets close to the latency requested in tlength
. To do that it subtracts a safety margin of 2 * minreq
from tlength and then divides that by 2:
sink_usecs = (tlength_usecs - minreq_usecs * 2) / 2;
The result is, that PulseAudio tries to set the ALSA latency to 0.36 ms. Well, it doesn't actually, because the minimum latency in PA is 0.5ms. So it tries to use that. And fails, obviously, because that is an ridiculously low latency. But it notices the underruns during playback and dynamically adjusts the latency up to the lowest value where no underruns happen anymore.
The result is that FluidSynth pushes PulseAudio to its limits. And it not only affects the PulseAudio server and FluidSynth, but potentially other sources that use the same sink as FluidSynth.
Long story short: In my opinion, we should raise the default period-size for PulseAudio to a much higher value, at least 512 or probably even 1024. I don't think anybody requiring a latency below 20ms is going to use PulseAudio as an audio backend anyway.
And even with a period-size of 1024, the PA server still consumes a significant amount of CPU time in addition to the rendering thread(s) in FluidSynth. So in my opinion, having an idle timeout here makes a lot of sense.
Additional note: with a period size above 4096 (~ 100ms latency) I start to see that the PA server actually uses less CPU time than FluidSynth does to render my test MIDI file. Below 4096 it consumes significantly more CPU time than FS.
Interesting. So if audio.pulseaudio.adjust-latency
is TRUE, you would go for a default period size of 4096 for pulse? That would be ok for me.
Sorry for this late answer.
Motivation behind this change is the fact that quite a few people have already expressed interest in a FluidSynth server process that uses as little CPU time as possible when not "in use", i.e. not actually rendering MIDI.
Marcus's experimentation seems demonstrate that CPU waste time is done mainly by PA. In this case the solution should be fixed in PA code (and not by a workaround in FS audio rendering code).
Interesting. So if audio.pulseaudio.adjust-latency is TRUE, you would go for a default period size of 4096 for pulse? That would be ok for me.
Indeed, that could be an adequate behaviour as PA audio driver is the interface between FS audio rendering and PA server.
It would also help myself with my musical instrument, as it would help to reduce CPU load and therefore improve battery life if the instrument is switched on but not actually in use at the moment.
Ok. But what happens if the musician restarts playing a new song while FS is in its "sleeping / waiting state" ?. Can we be sure that the first note of this new song will be always played correctly in time ?
Sorry for leaving this PR hanging for so long. I still think the approach is quite valid and solves the problem if people use FluidSynth as a server process, running always in the backround. But I do wonder if it is worth investing more time here. What do you think?
Well, given this is already open since 2 years and other than a single thumbs up nobody has shown interest or provided feedback, seems like it's not worth. Sry.
Ok, then I'll close this PR.