stabilizer Analyze low batch size timing

Analyze low batch size timing

Open ryan-summers opened this issue 2 years ago • 7 comments

Analyze the timing requirements when using the DMA sample acquisition architecture for ADC/DAC operations for low batch sizes (e.g. 1 or 2).

If possible, we may want to eliminate the Peripheral data -> RAM DMA operation, as this would eliminate processing overhead in the loop. Instead, for these low batch counts, the data can just be manually transacted with the peripherals directly.

May 16 '22 09:05 ryan-summers

DS4_QuickPrint12

The above capture was completed using toggling of the USART3 RX/TX lines while using a batch size of 1.

TX was enabled at the start of the DSP process() function call and de-asserted at the end.
RX was enabled immediately before getting the ADC/DAC data buffers and servicing the DBM DMA transfer. It was the disabled immediately inside of the closure processing said buffers.
- The second RX pulse is caused by RX being asserted immediately before data is transferred to the ethernet livestream. It is then deasserted immediately after the DBM/DMA transfer closure completes

As can be seen, the whole DSP process takes approximately 1.9uS, which comes to a maximum sampling rate of approximately 526KHz. Of that, servicing the DBM DMA transfers for data requires about 420ns.

If there was no DBM DMA transfer servicing required, the existing livestream / DSP routines require 1.48uS, which corresponds to a maximum sampling rate of ~676KHz. However, even without DBM DMA, there would still be some small amount of time required to read/write the SPI peripheral data registers, so in reality, the overhead would be slightly more.

Rough breakdown of time requirements within DSP processing for a batch size of 1:

pie title Process time breakout (Batch size = 1)
    "DSP Routines": 900
    "Get DMA Buffers": 440
    "Prepare livestream": 400
    "Update Telemetry": 120
    "Exit": 20
    "Entry": 120

May 16 '22 11:05 ryan-summers

Interesting. I seem to remember much less time for DSP. ~1000 insns is a lot. Might be worthwhile to check back against https://github.com/quartiq/stabilizer/blob/0fd442e67f9a0543894c053d2b40c7b9e7ca55e8/src/main.rs#L247-L284 (caveat: old hardware I think). Ah. I think the big difference in DSP load is the signal generator. Also do generally use nightly and the cortex-m/inline-asm feature. I've found DWT CYCCNT to be a nicer tool for these measurement that GPIO toggling. I think it could well be less overhead.

May 16 '22 14:05 jordens

My calculations are showing the DSP section taking approximately ~360 insns - the rest of the overhead here is from the various other things we've put into the DSP routing, such as telemetry, signal generation, DMA servicing, etc.

May 16 '22 14:05 ryan-summers

The 1.9 µs you measure are about 760 insns for "DSP Routines". That doesn't include DMA servicing and telemetry, right?

May 16 '22 14:05 jordens

Ah. No. The 1.9 µs you call "DSP process" is not "DSP routines".

May 16 '22 14:05 jordens

Isn't signal generation part of "DSP Routines" in your measurement?

May 16 '22 14:05 jordens

"DSP Routines" is inclusive of signal generation - it's the amount of time the closure on the ADCs/DACs run:

// Start timer
(adc0, adc1, dac0, adc1).lock(() {
    // Stop & Reset timer, this is "Get DMA Buffers"
})
// Stop & Reset timer, this is called "DSP Routines"

telemetry.latest_adcs = [adcs[0][0], adcs[1][0]];
telemetry.latest_dacs = [dacs[0][0], dacs[1][0]];
// Stop timer, this is "Update telemetry"

I'll try to get a full diff just to show things. I want to rework it so these calculations just get reported via telemetry instead of manually probing debug pins as well.

May 16 '22 15:05 ryan-summers

stabilizer stabilizer copied to clipboard

Analyze low batch size timing

stabilizer
stabilizer copied to clipboard