logue-osc icon indicating copy to clipboard operation
logue-osc copied to clipboard

fm64.cpp OSC_CYCLE, modulation block idea

Open aminixduser opened this issue 4 years ago • 15 comments

Most operators don't have the feedback flag set, change the comparison to not ALG_FBK_MASK and go into the modulation block, and have the else branch with feedback. might save a few cycles per operator

Inside that switch on the "i" value, case 0, break, case 1, only MOD6_MASK, break, case 2, MOD6, MOD5 mask, break case 3 MOD6,MOD5,MOD4 , break case 4 ifdef OPSIX MOD6, MOD5, MOD4, endif MOD3, break case 5 ifdef OPSIX MOD6, endif MOD5, MOD4, MOD3, MOD2 break for non-OPSIX, saves many compares and conditional load/adds because the DX7 alg's only modulate operator 2 from operator 3

aminixduser avatar Nov 22 '20 20:11 aminixduser

I tried a bit different approach: first check &(ALG_FBK_MASK-1) since we have nothing to do if there are no modulators; then check if not ALG_FBK_MASK else- process feedback; Since I did a typo, I reverted the change - there was no big benefit.

Actually moving feedback check order does not make much sence. With ARM 2-bit branch prediction, 1 of the 6th operator feedback will never be predicted regardless of the position in the if-then-else construct.

Switch will generate a lot of jumps, such spaghetti won't run faster than current ITE construct

dukesrg avatar Nov 23 '20 19:11 dukesrg

@aminixduser Different operator-modulator relations could speed up the whole process if the operator loop is rolled out. But this will cost about 1.5K of the code size :(

dukesrg avatar Nov 23 '20 20:11 dukesrg

that's quite a large code growth. would it be possible to put most of the process in functions to reduce the growth, that would probably impact speed w/ extra loads and stores though :(

aminixduser avatar Nov 24 '20 05:11 aminixduser

Yes non-inline function will make ip changes and slow down all. One more option is to roll out matrix per algorithm. But again it would be 32 compare-branch construct per operator, or indexed table lookup and call function. I don't think ether of this will be faster than the current test-execute conditionally. Debugger is needed to check this for sure, but my stuff is again stuck somewhere in transit :(

dukesrg avatar Nov 24 '20 06:11 dukesrg

Feedback could be moved from if-then-else to simple if-then so it will be the same bit test and conditional 2 adds with no branching. But the main matrix still needs two checks -

  1. that is modulated at all
  2. that it is not modulated by feedback (which is precalculated) I can't find single condition check for that case without intducing noe more precalculated algorithm value which will use one more register.

dukesrg avatar Nov 24 '20 07:11 dukesrg

if you are looking for a algorithm bit mask bit, you can use the ALG_MOD1_MASK bit because it isn't used. Operator 0 is never a modulation or feedback source.

aminixduser avatar Nov 24 '20 14:11 aminixduser

@aminixduser Looks like I found the way: if ((int8_t)(algorithm[i]<<1) > 0) { }

upd: nope, since shift does not update flags for signed test, additional cmp is needed - so the same number of commands.

dukesrg avatar Nov 24 '20 15:11 dukesrg

how about this... compile and take a look, more branches, but I think it could save cycles depending on how good branch prediction is. This would use goto and a label after the else { feedback add } block

!feedback { if (s_algorithm[i] & ALG_MOD6_MASK) modw0 += s_opval[0]; if (i <=1) { goto post_fb; }; if (s_algorithm[i] & ALG_MOD5_MASK) modw0 += s_opval[1]; if (i ==2) { goto post_fb; }; if (s_algorithm[i] & ALG_MOD4_MASK) modw0 += s_opval[2]; if (i ==3) { goto post_fb; }; if (s_algorithm[i] & ALG_MOD3_MASK) modw0 += s_opval[3]; if (i ==4) { goto post_fb; }; if (s_algorithm[i] & ALG_MOD2_MASK) modw0 += s_opval[4]; } else { fb block } post_fb:

aminixduser avatar Nov 25 '20 04:11 aminixduser

@aminixduser I was just thinking about this but in the different way. Since by design carrier always have a greater number, we have actually 15 possible modulations per sample, instead of 30 that are checked now. Adding a loop that will iterate upto the curent operator will technically add the same 15 conditional checks and explicit branching for the loop with additional register use and increment. Your example does actually the same cycle being rolled out - no increment, but still have the same 30 condition checks for mod matrix calculation per sample. And it have the same explicit branches. ARM branch prediction is 2-bit, so for 6-op and 5 modulation is not a helper at all. But there is a trick to check, though most likely technically possible only with pure asm, just a metacode:

jmp [l6 - (l2-l1) * i]
l1:
if (s_algorithm[i] & ALG_MOD2_MASK) modw0 += s_opval[4];
l2:
if (s_algorithm[i] & ALG_MOD3_MASK) modw0 += s_opval[3];
l3:
if (s_algorithm[i] & ALG_MOD4_MASK) modw0 += s_opval[2];
l4:
if (s_algorithm[i] & ALG_MOD5_MASK) modw0 += s_opval[1];
l5:
if (s_algorithm[i] & ALG_MOD6_MASK) modw0 += s_opval[0];
l6:

Overhead is:

  • 1 additional reg
  • 6 * 2 shifted sub (for 10 bytes of one operator check-modulate),
  • 6 add (for l6 displacement)
  • 6 non-conditional dynamic address branches total: +24 ops per sample and roughly -15 conditional modulations (*4 for total ops skipped or *2 if with IT the conditionally executed ops skipped with no penalty) = 6 or 36 ops saved. How many this will be in cycles - IDK :(

upd: found it https://www.keil.com/support/man/docs/armasm/armasm_dom1359731162809.htm in conditional execution skipped ops consumes 1 cycle. So assuming load ops consumes 6 cycles, we're currently saving 5 cycles on each skipped load in mod. matrix, i.e. 65 cycles are saved by current IT construct. On the other hand jump construct will consume at least 6 * (2 * 1 + 1 + 3) but save at least 15 * (1 + 1 + 6 + 1) i.e. about 100 cycles vs 65 cycles saved. Need to count with the goto version, on the first view it should be in par with the current in cycles. I've found finally Cortex-M4 timings! https://developer.arm.com/documentation/100166/0001/Programmers-Model/Instruction-set-summary/Table-of-processor- instructions

dukesrg avatar Nov 25 '20 06:11 dukesrg

With the new timings, the (best case with no contention)

1c lsls	r1, r0, #30
1c itt	mi
2c ldrmi.w	r1, [r4, #680]	; 0x2a8
1c addmi	r3, r3, r1

  1. current implementation: 5 * 5 (maximum modulation applied) + 4 * 10 (minimum not modulated matrix) + 4 * 15 (not applied when mod>carrier)= 125 cycles

  2. goto case: 5 * 5 + 4 * 10 + 1*15 (compare) + 1 * 15(not performeb branch) + (1+3) * 5(branch, worst case) = 115 cycles

  3. matrix jump case: 5 * 5 + 4 * 10 + 1 * 5 (multiply) +1 * 5 (add) + (1+3) * 5 (branch) = 95 cycles

  4. matrix with table branch (TBB) 5 * 5 + 4 * 10 + (2+3) * 5 (branch) = 90 cycles

dukesrg avatar Nov 25 '20 09:11 dukesrg

Nice work, that is a good improvement! I had another idea pop in my head. Why not use the MOD1_MASK bit, change its meaning to "NO_MOD and NO_FB" that way for cases like algo 32, you could branch over the whole thing for many operators.

aminixduser avatar Nov 25 '20 13:11 aminixduser

That will be just like at least on OP modulation can be skipped in the worst case. I currently implementer variant 4 and it will do the same in case the "at least one OP" could be actually OP6 =)

dukesrg avatar Nov 25 '20 16:11 dukesrg

@aminixduser implemented in 63a25f4 though since not all the loop in the asm, some overhead introduced by compiler

  • 1 or 2 movs (modw0 preserved/restored for unknown reason)
  • 1 ldr (s_opval displacement explicitly loaded, while originally one of the registers used as a base offset for such loads) Approx 4 cycles per operator so about 115 cycles - not much.

GCC construct of indexed goto did not compile like that.

dukesrg avatar Nov 25 '20 18:11 dukesrg

Nice work, that is a good improvement! I had another idea pop in my head. Why not use the MOD1_MASK bit, change its meaning to "NO_MOD and NO_FB" that way for cases like algo 32, you could branch over the whole thing for many operators.

Implemented in 70464d0 at least 2 to 4 cycles saved. Also changed the feedback operator source representation that reduces precalculation slightly (in the next commit)

dukesrg avatar Nov 26 '20 21:11 dukesrg

@aminixduser Have a better idea. Imagine we have all opval and modw in registers. It wil be just

5c tbb
1c lsls
1c it
1c addmi
...

  1. matrix with table branch (TBB) and all in registers 3 * 5 + 3 * 10 + (2+3) * 5 (branch) = 70 cycles

But currently it will produce str/ldr...str/ldr or push/ldr...str/pop which is (2*2(1+6)) = up to 28 cycles. So it will make sence if all loop is in pure assembly otherwise handling C will require register save/restore penalty.

dukesrg avatar Dec 02 '20 08:12 dukesrg