libsidplayfp Make filter::clock functions branchless for a 15-20% performance improvement

Make filter::clock functions branchless for a 15-20% performance improvement

Open reFX-Mike opened this issue 1 year ago • 1 comments

I've used a special technique in my branch that pre-calculates a single 8-bit mask for filter type and filter mix in the writeRES_FILT and writeMODE_VOL functions so that the two filter::clock functions have zero branches in them.

void Filter::writeRES_FILT ( uint8_t res_filt )
{
	filterModeRouting = ( filterModeRouting & 0xF0 ) | ( res_filt & 0x0F );

	currentResonance = resonance[ res_filt >> 4 ];

	updateMixing ();
}
//-----------------------------------------------------------------------------

void Filter::writeMODE_VOL ( uint8_t mode_vol )
{
	filterModeRouting = ( filterModeRouting & 0x0F ) | ( mode_vol & 0xF0 );

	currentVolume = volume[ mode_vol & 0x0F ];

	updateMixing ();
}
//-----------------------------------------------------------------------------

Here is the clock function for the 8580:

inline uint16_t clock ( float voice1, float voice2, float voice3 ) override
{
	// index 0 = unfiltered, index 1 = filtered
	int		Vsum[ 2 ] = { 0, 0 };

	// Mix the voices according to the filter mode
	{
		const auto	fltMd = filterModeRouting & 0xF;

		Vsum[ fltMd & 1 ]		+= fmc.getNormalizedVoice ( voice1 );
		Vsum[ ( fltMd >> 1 ) & 1 ]	+= fmc.getNormalizedVoice ( voice2 );
		Vsum[ ( fltMd >> 2 ) & 1 ]	+= fmc.getNormalizedVoice ( voice3 ) & voice3Mask;
		Vsum[ fltMd >> 3 ]		+= Ve;
	}

	// Apply filter
	{
		Vhp = currentSummer[ currentResonance[ Vbp ] + Vlp + Vsum[ 1 ] ];
		Vbp = hpIntegrator.solve ( Vhp );
		Vlp = bpIntegrator.solve ( Vbp );
	}

	// Mix filter outputs
	{
		const auto	fltMd = ( ( filterModeRouting >> 4 ) & 7 ) ^ 7;

		Vsum[ fltMd & 1 ]		+= Vlp;
		Vsum[ ( fltMd >> 1 ) & 1 ]	+= Vbp;
		Vsum[ fltMd >> 2 ]		+= Vhp;
	}

	return currentVolume[ currentMixer[ Vsum[ 0 ] ] ];
}

So filterMoudRouting is the an uint8_t value containing the filter-mix in bits 0-3 and bits 4-6 contain the filter mode. The int voice3Mask is initialized as INT_MAX and gets updated in updateMixing like so

voice3Mask = ( filterModeRouting & 0x84 ) == 0x80 ? 0 : -1;

I hope you consider this for inclusion. It also makes the code shorter but a bit harder to understand.

If you want to keep the ability to mute individual voices, you would need two more voice-masks. Turning the filter off would be trivial, as you would only need to apply a mask to the filterModeRouting in writeRES_FILT. That function gets called a lot less often than clock.

Aug 27 '24 20:08 reFX-Mike

You could also add another specialization for when the user turns the filter completely off that doesn't do any filter mixing or filter calculations at all. It could be reduced to:

virtual inline uint16_t clock ( float voice1, float voice2, float voice3 )
{
	const auto	Vsum	= fmc.getNormalizedVoice ( voice1 )
				+ fmc.getNormalizedVoice ( voice2 )
				+ ( fmc.getNormalizedVoice ( voice3 ) & voice3Mask )
				+ Ve;

	return currentVolume[ currentMixer[ Vsum ] ];
}

I plan to do that when a tune doesn't use the filter. I have to store one bit per sub-tune in a database, but then the performance improvement is much bigger (around 50% faster).

Aug 27 '24 20:08 reFX-Mike

libsidplayfp libsidplayfp copied to clipboard

Make filter::clock functions branchless for a 15-20% performance improvement

libsidplayfp
libsidplayfp copied to clipboard