libsidplayfp
libsidplayfp copied to clipboard
Make filter::clock functions branchless for a 15-20% performance improvement
I've used a special technique in my branch that pre-calculates a single 8-bit mask for filter type and filter mix in the writeRES_FILT and writeMODE_VOL functions so that the two filter::clock functions have zero branches in them.
void Filter::writeRES_FILT ( uint8_t res_filt )
{
filterModeRouting = ( filterModeRouting & 0xF0 ) | ( res_filt & 0x0F );
currentResonance = resonance[ res_filt >> 4 ];
updateMixing ();
}
//-----------------------------------------------------------------------------
void Filter::writeMODE_VOL ( uint8_t mode_vol )
{
filterModeRouting = ( filterModeRouting & 0x0F ) | ( mode_vol & 0xF0 );
currentVolume = volume[ mode_vol & 0x0F ];
updateMixing ();
}
//-----------------------------------------------------------------------------
Here is the clock function for the 8580:
inline uint16_t clock ( float voice1, float voice2, float voice3 ) override
{
// index 0 = unfiltered, index 1 = filtered
int Vsum[ 2 ] = { 0, 0 };
// Mix the voices according to the filter mode
{
const auto fltMd = filterModeRouting & 0xF;
Vsum[ fltMd & 1 ] += fmc.getNormalizedVoice ( voice1 );
Vsum[ ( fltMd >> 1 ) & 1 ] += fmc.getNormalizedVoice ( voice2 );
Vsum[ ( fltMd >> 2 ) & 1 ] += fmc.getNormalizedVoice ( voice3 ) & voice3Mask;
Vsum[ fltMd >> 3 ] += Ve;
}
// Apply filter
{
Vhp = currentSummer[ currentResonance[ Vbp ] + Vlp + Vsum[ 1 ] ];
Vbp = hpIntegrator.solve ( Vhp );
Vlp = bpIntegrator.solve ( Vbp );
}
// Mix filter outputs
{
const auto fltMd = ( ( filterModeRouting >> 4 ) & 7 ) ^ 7;
Vsum[ fltMd & 1 ] += Vlp;
Vsum[ ( fltMd >> 1 ) & 1 ] += Vbp;
Vsum[ fltMd >> 2 ] += Vhp;
}
return currentVolume[ currentMixer[ Vsum[ 0 ] ] ];
}
So filterMoudRouting is the an uint8_t value containing the filter-mix in bits 0-3 and bits 4-6 contain the filter mode.
The int voice3Mask is initialized as INT_MAX and gets updated in updateMixing like so
voice3Mask = ( filterModeRouting & 0x84 ) == 0x80 ? 0 : -1;
I hope you consider this for inclusion. It also makes the code shorter but a bit harder to understand.
If you want to keep the ability to mute individual voices, you would need two more voice-masks. Turning the filter off would be trivial, as you would only need to apply a mask to the filterModeRouting in writeRES_FILT. That function gets called a lot less often than clock.
You could also add another specialization for when the user turns the filter completely off that doesn't do any filter mixing or filter calculations at all. It could be reduced to:
virtual inline uint16_t clock ( float voice1, float voice2, float voice3 )
{
const auto Vsum = fmc.getNormalizedVoice ( voice1 )
+ fmc.getNormalizedVoice ( voice2 )
+ ( fmc.getNormalizedVoice ( voice3 ) & voice3Mask )
+ Ve;
return currentVolume[ currentMixer[ Vsum ] ];
}
I plan to do that when a tune doesn't use the filter. I have to store one bit per sub-tune in a database, but then the performance improvement is much bigger (around 50% faster).