stabilizer icon indicating copy to clipboard operation
stabilizer copied to clipboard

Analyze low batch size timing

Open ryan-summers opened this issue 2 years ago • 7 comments

Analyze the timing requirements when using the DMA sample acquisition architecture for ADC/DAC operations for low batch sizes (e.g. 1 or 2).

If possible, we may want to eliminate the Peripheral data -> RAM DMA operation, as this would eliminate processing overhead in the loop. Instead, for these low batch counts, the data can just be manually transacted with the peripherals directly.

ryan-summers avatar May 16 '22 09:05 ryan-summers

DS4_QuickPrint12

The above capture was completed using toggling of the USART3 RX/TX lines while using a batch size of 1.

  • TX was enabled at the start of the DSP process() function call and de-asserted at the end.
  • RX was enabled immediately before getting the ADC/DAC data buffers and servicing the DBM DMA transfer. It was the disabled immediately inside of the closure processing said buffers.
    • The second RX pulse is caused by RX being asserted immediately before data is transferred to the ethernet livestream. It is then deasserted immediately after the DBM/DMA transfer closure completes

As can be seen, the whole DSP process takes approximately 1.9uS, which comes to a maximum sampling rate of approximately 526KHz. Of that, servicing the DBM DMA transfers for data requires about 420ns.

If there was no DBM DMA transfer servicing required, the existing livestream / DSP routines require 1.48uS, which corresponds to a maximum sampling rate of ~676KHz. However, even without DBM DMA, there would still be some small amount of time required to read/write the SPI peripheral data registers, so in reality, the overhead would be slightly more.

Rough breakdown of time requirements within DSP processing for a batch size of 1:

pie title Process time breakout (Batch size = 1)
    "DSP Routines": 900
    "Get DMA Buffers": 440
    "Prepare livestream": 400
    "Update Telemetry": 120
    "Exit": 20
    "Entry": 120

ryan-summers avatar May 16 '22 11:05 ryan-summers

Interesting. I seem to remember much less time for DSP. ~1000 insns is a lot. Might be worthwhile to check back against https://github.com/quartiq/stabilizer/blob/0fd442e67f9a0543894c053d2b40c7b9e7ca55e8/src/main.rs#L247-L284 (caveat: old hardware I think). Ah. I think the big difference in DSP load is the signal generator. Also do generally use nightly and the cortex-m/inline-asm feature. I've found DWT CYCCNT to be a nicer tool for these measurement that GPIO toggling. I think it could well be less overhead.

jordens avatar May 16 '22 14:05 jordens

My calculations are showing the DSP section taking approximately ~360 insns - the rest of the overhead here is from the various other things we've put into the DSP routing, such as telemetry, signal generation, DMA servicing, etc.

ryan-summers avatar May 16 '22 14:05 ryan-summers

The 1.9 µs you measure are about 760 insns for "DSP Routines". That doesn't include DMA servicing and telemetry, right?

jordens avatar May 16 '22 14:05 jordens

Ah. No. The 1.9 µs you call "DSP process" is not "DSP routines".

jordens avatar May 16 '22 14:05 jordens

Isn't signal generation part of "DSP Routines" in your measurement?

jordens avatar May 16 '22 14:05 jordens

"DSP Routines" is inclusive of signal generation - it's the amount of time the closure on the ADCs/DACs run:

// Start timer
(adc0, adc1, dac0, adc1).lock(() {
    // Stop & Reset timer, this is "Get DMA Buffers"
})
// Stop & Reset timer, this is called "DSP Routines"

telemetry.latest_adcs = [adcs[0][0], adcs[1][0]];
telemetry.latest_dacs = [dacs[0][0], dacs[1][0]];
// Stop timer, this is "Update telemetry"

I'll try to get a full diff just to show things. I want to rework it so these calculations just get reported via telemetry instead of manually probing debug pins as well.

ryan-summers avatar May 16 '22 15:05 ryan-summers