atk-sc3
atk-sc3 copied to clipboard
Investigate porting ATK-FOA to SC from UGen code
As observed by @timblechmann, ATK-FOA could get a speed boost by porting to SC to take advantage of optimizations in the dsp graph, where there are sse/avx vectorizations, and multiplications with 0/1 are optimized out.
Initial benchmarks show improvements:
Conditions:
- Continuously running synths
- One instance of FoaEncode.ar( PinkNoise.ar(1), FoaEncoderMatrix.newOmni );
- 3 instances of CtkControl.lfo (for param mod)
- 1 instance of In.ar(inbus1, 4) - In.ar(inbus2, 4); (for crude testing that the outputs are identical)
- The above synths are negligable: <1%
- A/B Tested synths
- 20 x FoaPush.ar( In.ar(inbus, 4), angle, theta, phi, amp);
- 20 x MokaFOA.push( In.ar(inbus, 4) * amp, angle, theta, phi);
- Distro Release of SC
Results ATK
- Non-modulated: avg 10-13%, pk 24%
- Modulated: avg 26-33%, pk 54% Moka
- Non-modulated: avg ~2.4-2.9%, pk 5%
- Modulated: avg 2.6-3.7%, pk 6.4%
Recent tests show improvements of between 2x - 4.7x depending on the UGen.
There are issues to resolve before deciding if it's worth porting:
- It looks like when modulation is at AR, the signals are very nearly identical, with error related to the difference in precision of the operators used by SC (sin/cos/ */ -/+, etc.) vs. the c++ code in the UGens (Ugens look to be higher precision). Need to look into this... Here are images of the output of the transforms.
(ATK-UGen)
(SC)
- At control rate, there are differences that appear to be related to differences in the interpolation method used because the difference shows up as a ripple, with a periodicity of 64 samples, departing and converging on 0. The ugens may be using cubic interp while SC may be using linear, or something similar. Here is an image of the difference between the outputs the two implementations of the transform.
- In a couple of the UGens, e.g. the Focus(N) UGens, there is a difference that shows up at apparent zero-crossings of the modulated "angle" parameter. This could have to do with the wrapping of negative to positive changes, or some other wrapping issue. I recall @joshpar had to take special care to handle this...
Here is an image of the difference between the outputs the two implementations of the transform (Focus).
(64 samples shown)
- TBD: If output is (nearly) identical at AR modulation rates, how does CPU time compare? (the above benchmarks are at KR)
TBD: If output is (nearly) identical at AR modulation rates, how does CPU time compare? (the above benchmarks are at KR)
Results show less advantage at AR, but still pretty consistently ~2x faster than current ATK implementation.
// KR modulation of all transform params (angle, theta, phi, where applicable) // 15-sample CPU average over 4 sec
avgCPU for directY: 3.8922178904215 FoaDirectY: 13.096549224854 ratio: 3.365 avgCPU for tilt: 4.1747585932414 FoaTilt: 10.377758216858 ratio: 2.486 avgCPU for focusZ: 3.8568762143453 FoaFocusZ: 15.724580256144 ratio: 4.077 avgCPU for zoomY: 4.1443750699361 FoaZoomY: 10.720642662048 ratio: 2.587 avgCPU for zoomZ: 4.1646455128988 FoaZoomZ: 10.79509455363 ratio: 2.592 avgCPU for rotate: 4.0821704228719 FoaRotate: 9.4572704633077 ratio: 2.317 avgCPU for pressZ: 4.3713499069214 FoaPressZ: 13.369442431132 ratio: 3.058 avgCPU for push: 6.5535450617472 FoaPush: 34.429843902588 ratio: 5.254 avgCPU for focus: 6.9997715314229 FoaFocus: 32.5939356486 ratio: 4.656 avgCPU for focusY: 3.9356636842092 FoaFocusY: 15.359981282552 ratio: 3.903 avgCPU for pushX: 4.2960601488749 FoaPushX: 13.176390457153 ratio: 3.067 avgCPU for pressY: 4.0189873854319 FoaPressY: 13.472814623515 ratio: 3.352 avgCPU for pressX: 4.0665680885315 FoaPressX: 12.967267862956 ratio: 3.189 avgCPU for directZ: 4.1716379801432 FoaDirectZ: 12.813135019938 ratio: 3.071 avgCPU for pushY: 4.1591040611267 FoaPushY: 12.943291854858 ratio: 3.112 avgCPU for pushZ: 3.7096843560537 FoaPushZ: 12.333229700724 ratio: 3.325 avgCPU for zoom: 6.2228276570638 FoaZoom: 33.447308095296 ratio: 5.375 avgCPU for press: 6.3306551615397 FoaPress: 31.999982706706 ratio: 5.055 avgCPU for tumble: 4.135467561086 FoaTumble: 10.033804893494ratio: 2.426 avgCPU for focusX: 4.0670563220978 FoaFocusX: 13.977882067362 ratio: 3.437 avgCPU for directX: 4.2036932309469 FoaDirectX: 12.095028940837 ratio: 2.877 avgCPU for zoomX: 4.0228819052378 FoaZoomX: 10.213698895772 ratio: 2.539
//AR modulation of all params (angle, theta, phi, where applicable) // 15-sample CPU average over 4 sec
avgCPU for directY: 5.817141977946 FoaDirectY: 12.422338930766 ratio: 2.135 avgCPU for tilt: 5.1942962646484 FoaTilt: 10.160934003194 ratio: 1.956 avgCPU for focusZ: 6.6086461702983 FoaFocusZ: 14.85988928477 ratio: 2.249 avgCPU for zoomY: 5.5710364659627 FoaZoomY: 10.856606610616 ratio: 1.949 avgCPU for zoomZ: 5.1938027699788 FoaZoomZ: 10.014154561361 ratio: 1.928 avgCPU for rotate: 5.2405638694763 FoaRotate: 9.9568091074626 ratio: 1.9 avgCPU for pressZ: 6.4533769607544 FoaPressZ: 13.856771405538 ratio: 2.147 avgCPU for push: 13.29306195577 FoaPush: 30.377778879801 ratio: 2.285 avgCPU for focus: 12.576449203491 FoaFocus: 32.970167414347 ratio: 2.622 avgCPU for focusY: 5.6360649426778 FoaFocusY: 15.23735071818 ratio: 2.704 avgCPU for pushX: 6.3163091659546 FoaPushX: 12.025527572632 ratio: 1.904 avgCPU for pressY: 6.2234305699666 FoaPressY: 12.92509059906 ratio: 2.077 avgCPU for pressX: 6.2191113154093 FoaPressX: 13.958075332642 ratio: 2.244 avgCPU for directZ: 5.601010799408 FoaDirectZ: 14.337109565735 ratio: 2.56 avgCPU for pushY: 6.221021493276 FoaPushY: 10.742026869456 ratio: 1.727 avgCPU for pushZ: 6.4694593747457 FoaPushZ: 13.076047960917 ratio: 2.021 avgCPU for zoom: 12.099684842428 FoaZoom: 34.168221028646 ratio: 2.824 avgCPU for press: 13.388120969137 FoaPress: 34.977306747437 ratio: 2.613 avgCPU for tumble: 5.5153895378113 FoaTumble: 10.999972279867ratio: 1.994 avgCPU for focusX: 6.5750583012899 FoaFocusX: 15.978525161743 ratio: 2.43 avgCPU for directX: 6.0198050816854 FoaDirectX: 13.980832862854 ratio: 2.322 avgCPU for zoomX: 5.7388275782267 FoaZoomX: 10.184682210286 ratio: 1.775
It may also me reasonable to optimize the UGens to avoid some of those cases as well. If the precision is better in the UGens, I can explore that a bit. At the moment, the UGens are pretty brute force - and I know how to do the code better now. Josh
On Dec 14, 2016, at 11:49 AM, Michael McCrea [email protected] wrote:
As observed by @timblechmann https://github.com/timblechmann, ATK-FOA could get a speed boost by porting to SC to take advantage of optimizations in the dsp graph, where there are sse/avx vectorizations, and multiplications with 0/1 are optimized out.
Initial benchmarks show improvements:
Conditions:
Continuously running synths One instance of FoaEncode.ar( PinkNoise.ar(1), FoaEncoderMatrix.newOmni ); 3 instances of CtkControl.lfo (for param mod) 1 instance of In.ar(inbus1, 4) - In.ar(inbus2, 4); (for crude testing that the outputs are identical) The above synths are negligable: <1% A/B Tested synths 20 x FoaPush.ar( In.ar(inbus, 4), angle, theta, phi, amp); 20 x MokaFOA.push( In.ar(inbus, 4) * amp, angle, theta, phi); Distro Release of SC Results ATK
Non-modulated: avg 10-13%, pk 24% Modulated: avg 26-33%, pk 54% Moka Non-modulated: avg ~2.4-2.9%, pk 5% Modulated: avg 2.6-3.7%, pk 6.4% Recent tests show improvements of between 2x - 4.7x depending on the UGen.
There are issues to resolve before deciding if it's worth porting:
It looks like when modulation is at AR, the signals are very nearly identical, with error related to the difference in precision of the operators used by SC (sin/cos/ */ -/+, etc.) vs. the c++ code in the UGens (Ugens look to be higher precision). Need to look into this... Here are images of the output of the transforms. https://cloud.githubusercontent.com/assets/923342/21159754/334324de-c137-11e6-9195-6ecb1cda2df8.png (ATK-UGen)
https://cloud.githubusercontent.com/assets/923342/21159773/3e4d20c8-c137-11e6-821d-5040b0c191de.png (SC)
At control rate, there are differences that appear to be related to differences in the interpolation method used because the difference shows up as a ripple, with a periodicity of 64 samples, departing and converging on 0. The ugens may be using cubic interp while SC may be using linear, or something similar. Here is an image of the difference between the outputs the two implementations of the transform. https://cloud.githubusercontent.com/assets/923342/21159433/0e6e9702-c136-11e6-920e-079abc97b17a.png In a couple of the UGens, e.g. the Focus(N) UGens, there is a difference that shows up at apparent zero-crossings of the modulated "angle" parameter. This could have to do with the wrapping of negative to positive changes, or some other wrapping issue. I recall @joshpar https://github.com/joshpar had to take special care to handle this... Here is an image of the difference between the outputs the two implementations of the transform (Focus).
https://cloud.githubusercontent.com/assets/923342/21159503/537de050-c136-11e6-9a99-f338d75682a4.png (64 samples shown)
TBD: If output is (nearly) identical at AR modulation rates, how does CPU time compare? (the above benchmarks are at KR) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ambisonictoolkit/atk-sc3/issues/60, or mute the thread https://github.com/notifications/unsubscribe-auth/AADTrE8gZiwwN3BtiUPtG98ZP1BNobJsks5rIEhTgaJpZM4LMPqp.
/* Joshua D. Parmenter http://www.realizedsound.net/josh/
“Every composer – at all times and in all cases – gives his own interpretation of how modern society is structured: whether actively or passively, consciously or unconsciously, he makes choices in this regard. He may be conservative or he may subject himself to continual renewal; or he may strive for a revolutionary, historical or social palingenesis." - Luigi Nono */
Thanks, Josh. It would be great if you had a look to see if there were any optimizations possible. I'm still pretty green on the C++ side, but I'm happy to put in the work if you have some leads on potential optimizations. From a "purist" perspective, it would be great to retain the precision of the UGens and get an efficiency bump!
On Wed, Dec 14, 2016 at 1:34 PM, Joshua Parmenter [email protected] wrote:
It may also me reasonable to optimize the UGens to avoid some of those cases as well. If the precision is better in the UGens, I can explore that a bit. At the moment, the UGens are pretty brute force - and I know how to do the code better now. Josh
On Dec 14, 2016, at 11:49 AM, Michael McCrea [email protected] wrote:
As observed by @timblechmann https://github.com/timblechmann, ATK-FOA could get a speed boost by porting to SC to take advantage of optimizations in the dsp graph, where there are sse/avx vectorizations, and multiplications with 0/1 are optimized out.
Initial benchmarks show improvements:
Conditions:
Continuously running synths One instance of FoaEncode.ar( PinkNoise.ar(1), FoaEncoderMatrix.newOmni ); 3 instances of CtkControl.lfo (for param mod) 1 instance of In.ar(inbus1, 4) - In.ar(inbus2, 4); (for crude testing that the outputs are identical) The above synths are negligable: <1% A/B Tested synths 20 x FoaPush.ar( In.ar(inbus, 4), angle, theta, phi, amp); 20 x MokaFOA.push( In.ar(inbus, 4) * amp, angle, theta, phi); Distro Release of SC Results ATK
Non-modulated: avg 10-13%, pk 24% Modulated: avg 26-33%, pk 54% Moka Non-modulated: avg ~2.4-2.9%, pk 5% Modulated: avg 2.6-3.7%, pk 6.4% Recent tests show improvements of between 2x - 4.7x depending on the UGen.
There are issues to resolve before deciding if it's worth porting:
It looks like when modulation is at AR, the signals are very nearly identical, with error related to the difference in precision of the operators used by SC (sin/cos/ */ -/+, etc.) vs. the c++ code in the UGens (Ugens look to be higher precision). Need to look into this... Here are images of the output of the transforms. <https://cloud.githubusercontent.com/assets/ 923342/21159754/334324de-c137-11e6-9195-6ecb1cda2df8.png> (ATK-UGen)
<https://cloud.githubusercontent.com/assets/ 923342/21159773/3e4d20c8-c137-11e6-821d-5040b0c191de.png> (SC)
At control rate, there are differences that appear to be related to differences in the interpolation method used because the difference shows up as a ripple, with a periodicity of 64 samples, departing and converging on 0. The ugens may be using cubic interp while SC may be using linear, or something similar. Here is an image of the difference between the outputs the two implementations of the transform. <https://cloud.githubusercontent.com/assets/ 923342/21159433/0e6e9702-c136-11e6-920e-079abc97b17a.png> In a couple of the UGens, e.g. the Focus(N) UGens, there is a difference that shows up at apparent zero-crossings of the modulated "angle" parameter. This could have to do with the wrapping of negative to positive changes, or some other wrapping issue. I recall @joshpar < https://github.com/joshpar> had to take special care to handle this... Here is an image of the difference between the outputs the two implementations of the transform (Focus).
<https://cloud.githubusercontent.com/assets/ 923342/21159503/537de050-c136-11e6-9a99-f338d75682a4.png> (64 samples shown)
TBD: If output is (nearly) identical at AR modulation rates, how does CPU time compare? (the above benchmarks are at KR) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/ ambisonictoolkit/atk-sc3/issues/60>, or mute the thread < https://github.com/notifications/unsubscribe-auth/ AADTrE8gZiwwN3BtiUPtG98ZP1BNobJsks5rIEhTgaJpZM4LMPqp>.
/* Joshua D. Parmenter http://www.realizedsound.net/josh/
“Every composer – at all times and in all cases – gives his own interpretation of how modern society is structured: whether actively or passively, consciously or unconsciously, he makes choices in this regard. He may be conservative or he may subject himself to continual renewal; or he may strive for a revolutionary, historical or social palingenesis." - Luigi Nono */
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ambisonictoolkit/atk-sc3/issues/60#issuecomment-267163533, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4WzvKapvJaiJRfYnagX5zlvoRTOK_tks5rIGDkgaJpZM4LMPqp .
Hi @joshpar
I'm curious what optimizations you have in mind. A quick look at the UGen source, you've commented a couple of areas: the phase unwrapping, and checking for 0s in the matrix calculation. Are these the sorts of things you had in mind or other optimizations you've learned?
I'm happy to look into it if you could point me in the right direction, but if it comes to bit shifting/masking and all that, I'm pretty much in the dark...
The efficiency gains are likely in the matrix operations, but for what it's worth, I had a look at the phase unwrapping, and there may be a way to slim it down.
In SC, it looks like: gist - phase unwrapping
Regarding the precision differences, James Harkins points out:
[email protected] via lists.bham.ac.uk 8:32 PM (15 hours ago)
to sc-dev ---- On Mon, 15 May 2017 06:51:58 +0800 [email protected] wrote ----
It appears that there are precision differences between the UGen vs. pseudo-ugen implementations,
UGen outputs and inputs are single-precision floats. If you're using doubles in your UGen internals, that might explain it.
and some of the parameter smoothing introduces discontinuities depending how values are wrapped (say across the 0<>2pi transition).
I've seen this in other contexts, and never came up with a good workaround.
hjh
Given the new v5.0.0 HOA context, optimization of FOA UGens might be considered redundant.
It would be useful to test / review performance of equivalent HOA1 pseudo-UGens against the parallel FOA UGens.