mksocfpga
mksocfpga copied to clipboard
hm2 read(), write() functions need signficant time
observed running irqtest.hal, on terasic de0, times are in nS:
halcmd: show funct
Exported Functions:
Comp Inst CodeAddr Arg Users Type Name Runtime Maxtime
66 00000000b6d3fe1d 0000000000000000 0 user delinst 0 0
75 76 00000000b53c06d9 0000000000041bc8 0 xthread hm2_de0n.0.pet_watchdog 0 0
75 76 00000000b53a9e39 0000000000041bc8 1 xthread hm2_de0n.0.read 59193 107665
75 76 00000000b53a9fc1 0000000000041bc8 1 xthread hm2_de0n.0.read_gpio 11070 34502
75 76 00000000b53c3291 0000000000041bc8 1 xthread hm2_de0n.0.waitirq 861718 999784
75 76 00000000b53a9ef1 0000000000041bc8 1 xthread hm2_de0n.0.write 57493 133106
75 76 00000000b53a9ff5 0000000000041bc8 1 xthread hm2_de0n.0.write_gpio 21101 53182
66 00000000b6d3fcfd 0000000000000000 0 user newinst 0 0
1099 1100 00000000b53858d1 00000000b6546820 1 xthread rate 1310 23361
1110 1111 00000000b5374891 00000000b6546848 1 xthread servo-period 970 25661
1118 1119 00000000b5363969 00000000b6546858 1 xthread sp-stats 1420 26741
1085 00000000b5396be1 00000000b65467e0 1 thread test.update 2830 24781
read() and write() are almost 60uS - together that's 120uS or 12% of the cycle budget at 1kHz; the waitirq() time basically shows the budget left for motion, kins etc
at 5kHz we have only about 50uS left for actual work:
halcmd: show funct
Exported Functions:
Comp Inst CodeAddr Arg Users Type Name Runtime Maxtime
66 00000000b6d3fe1d 0000000000000000 0 user delinst 0 0
75 76 00000000b53c06d9 0000000000041bc8 0 xthread hm2_de0n.0.pet_watchdog 0 0
75 76 00000000b53a9e39 0000000000041bc8 1 xthread hm2_de0n.0.read 55762 107665
75 76 00000000b53a9fc1 0000000000041bc8 1 xthread hm2_de0n.0.read_gpio 10461 41402
75 76 00000000b53c3291 0000000000041bc8 1 xthread hm2_de0n.0.waitirq 49953 999784
75 76 00000000b53a9ef1 0000000000041bc8 1 xthread hm2_de0n.0.write 53602 133106
75 76 00000000b53a9ff5 0000000000041bc8 1 xthread hm2_de0n.0.write_gpio 20501 56482
66 00000000b6d3fcfd 0000000000000000 0 user newinst 0 0
1099 1100 00000000b53858d1 00000000b6546820 1 xthread rate 990 27521
1110 1111 00000000b5374891 00000000b6546848 1 xthread servo-period 950 32741
1118 1119 00000000b5363969 00000000b6546858 1 xthread sp-stats 1250 26741
1085 00000000b5396be1 00000000b65467e0 1 thread test.update 2321 33982
this suggests looking into where the time is spent
hm, the above samples were from the RIP build which is compiled with -O0 so gdb doesnt stumble across reordered code
with -O2 (as with the package build) things look a bit better (at 5Khz):
halcmd: show funct
Exported Functions:
Comp Inst CodeAddr Arg Users Type Name Runtime Maxtime
66 00000000b66eeba9 0000000000000000 0 user delinst 0 0
75 76 00000000b562f369 000000000003bbc8 0 xthread hm2_de0n.0.pet_watchdog 0 0
75 76 00000000b5620db9 000000000003bbc8 1 xthread hm2_de0n.0.read 36661 99374
75 76 00000000b5620ced 000000000003bbc8 1 xthread hm2_de0n.0.read_gpio 4301 47512
75 76 00000000b5630f41 000000000003bbc8 1 xthread hm2_de0n.0.waitirq 125696 1041507
75 76 00000000b5620d09 000000000003bbc8 1 xthread hm2_de0n.0.write 33001 114825
75 76 00000000b5620cc5 000000000003bbc8 1 xthread hm2_de0n.0.write_gpio 7861 52882
66 00000000b66eedc5 0000000000000000 0 user newinst 0 0
1099 1100 00000000b54dd651 00000000b6668820 1 xthread rate 770 42592
1110 1111 00000000b54cc651 00000000b6668848 1 xthread servo-period 850 28551
1118 1119 00000000b54bb6c1 00000000b6668858 1 xthread sp-stats 1220 43932
1085 00000000b54ee8d9 00000000b66687e0 1 thread test.update 2050 42902
played with perf top:
67.44% [kernel] [k] arch_cpu_idle
3.85% [kernel] [k] _raw_spin_unlock_irq
2.16% hm2_soc_ol.so [.] hm2_soc_read
2.07% hostmot2.so [.] hm2_tram_read
1.71% [kernel] [k] _raw_spin_unlock_irqrestore
1.29% hm2_soc_ol.so [.] hm2_soc_write
1.27% hostmot2.so [.] hm2_ioport_update
1.13% hostmot2.so [.] hm2_tram_write
0.80% hostmot2.so [.] dummy_queue_read
0.71% hostmot2.so [.] hm2_ioport_gpio_read
0.70% hostmot2.so [.] hm2_encoder_process_tram_read
0.66% hostmot2.so [.] hm2_ioport_gpio_prepare_tram_write
0.64% hostmot2.so [.] hm2_stepgen_prepare_tram_write
0.50% [kernel] [k] __apbt_read_clocksource
0.50% hostmot2.so [.] hm2_ioport_gpio_write
0.49% hostmot2.so [.] hm2_watchdog_write
0.44% hostmot2.so [.] hm2_sserial_read_pins
0.43% hostmot2.so [.] hm2_sserial_prepare_tram_write
0.37% hostmot2.so [.] hm2_ioport_gpio_process_tram_read
0.37% hostmot2.so [.] dummy_queue_write
0.31% hostmot2.so [.] hm2_stepgen_write
0.30% [kernel] [k] ktime_get_ts64
I'll have a look at the hm2_soc_read and hm2_soc_write functions tomorrow. I think this byte-wise reading is not necessary.
If you want to burst read we need to make sure the firmware can handle it. For example, I don't think the AXI lite interface I implemented can do burst reading out of the box.
At the driver level they implement a translation ram, I think which is an attempt to avoid calling write repeatedly in the driver level. We could expose a function that reads the entire range specified by the tram instead of individual reads and writes if you think it's function call overhead from hostmot2 down to hm2 llio.
One other thing to note:
The AXI bus and avalon bus are clocked much slower than the processor. For instance, I can't go any faster than 100MHz on the bus without timing closure issues. That's probably what is causing the hold up. I'm not sure what the avalon bus is clocked at, but I imagine it is something similar.
batching might help, yes - not sure how invasive that would be
for now I concentrate on dumbing-down - replacing these supersmart loops by memcpy (which is super-optimized anyway) shaves another 10-20% off hm2_soc_read/hm2_soc_write..
so with that we're in the 70uS ballpark for all four functs - that's ok for now
Wow, that's a lot. With compiler optimizations turned on? The loops were a guard of some kind to protect misaligned reads, but I don't think they're entirely necessary. I'm a fan of memcpy replacement..
yes, -O2 (default)
perf top shows the candidate list to look at - yes, streamlining hm2_tram_* looks promising:
2.00% hostmot2.so [.] hm2_tram_read
1.32% hostmot2.so [.] hm2_ioport_update
1.24% hostmot2.so [.] hm2_tram_write
0.70% hostmot2.so [.] hm2_encoder_process_tram_read
0.70% hostmot2.so [.] hm2_ioport_gpio_read
0.65% hostmot2.so [.] hm2_stepgen_prepare_tram_write
perf is a great tool, recommended (I use linux-perf-4.6 from jessie-backports on the Altera)
Is it necessary to call both ioport_update and tram_read / write? I think that just duplicates the same access to the hm2 firmware.
This is what I'm driving at https://github.com/machinekit/machinekit/blob/master/src/hal/drivers/mesa-hostmot2/hostmot2.c#L78. The read/write functions automatically pull/push in the GPIO status.
Is this in your config?
addf hm2_de0n.0.read servo
addf hm2_de0n.0.read_gpio servo
addf test.update servo
addf rate servo
addf hm2_de0n.0.write servo
addf hm2_de0n.0.write_gpio servo
Would that be a duplicate GPIO read write per cycle to the firmware?
Yeah, I'm willing to bet that's a duplicate interaction. I don't call the gpio directly in the configs because they are updated automatically. I think the gpio update functions are there if you don't need all of the other stuff hm2 provides and want to save the read time because of the slower bus speeds I mentioned (or even worse, bus speed across slow ethernet).
Maybe try removing those two calls. The other thing I do is remove any extra modules from the config firmware, which reduces the amount of data passed back and forth across the slower buses. E.g, instead of 10 stepgens, only instantiate 4, 5, or 6.
well, our website has the answer..
hm2_<BoardType>.<BoardNum>.read_gpio
Read the GPIO input pins. Note that the effect of this function is a subset of the effect of the .read() function described above. Normally only .read() is used. The only reason to call this function is if you want to do GPIO things in a faster-than-servo thread. (This function is not available on the 7i43 due to limitations of the EPP bus.)
hm2_<BoardType>.<BoardNum>.write_gpio
Write the GPIO control registers and output pins. Note that the effect of this function is a subset of the effect of the .write() function described above. Normally only .write() is used. The only reason to call this function is if you want to do GPIO things in a faster-than-servo thread. (This function is not available on the 7i43 due to limitations of the EPP bus.)
summary: if one calls hm2....read, hm2...read_gpio is not needed, same for write/write_gpio
it's interesting to compare the IRQ- and non-IRQ versions of perf output
perf_4.6 top -d 10, 5kHz: current hm2 funct usage is at ca 60uS so doable I'd say
nevertheless a couple of things can be read out of the below stats:
- there is significant overhead through the uio driver: I think the _raw_spin_unlock_irq and _raw_spin_unlock_irqrestore stem from uio_read/uio_write functions - if one could waste a core for an RT thread, a polling solution would incur less overhead
- hm2_tram_read/write are the prime candidates - it looks like they are O(n) with the number of pins+instances so cutting down on unneeded I/O would help
- the good news is that using rtapi_get_time() in thread_task has no significant impact on the budget (_rtapi_get_time + sys_clock_gettime + ktime_get_ts64 + __apbt_read_clocksource)
IRQ version:
Samples: 41K of event 'cpu-clock', Event count (approx.): 10370343750
Overhead Shared Object Symbol
51.85% [kernel] [k] arch_cpu_idle
5.89% [kernel] [k] _raw_spin_unlock_irq
4.10% hostmot2.so [.] hm2_tram_read
3.77% [kernel] [k] _raw_spin_unlock_irqrestore
2.87% hostmot2.so [.] hm2_tram_write
1.38% hostmot2.so [.] hm2_ioport_update
1.27% hostmot2.so [.] hm2_ioport_gpio_prepare_tram_write
1.20% libc-2.19.so [.] 0x0005f0ec
1.14% hostmot2.so [.] hm2_encoder_process_tram_read
1.13% libc-2.19.so [.] 0x0005f0e4
1.08% hm2_soc_ol.so [.] hm2_soc_read
1.06% hm2_soc_ol.so [.] hm2_soc_write
0.90% hostmot2.so [.] hm2_stepgen_prepare_tram_write
0.89% [kernel] [k] __apbt_read_clocksource
0.86% hostmot2.so [.] hm2_sserial_prepare_tram_write
0.80% hostmot2.so [.] hm2_ioport_gpio_process_tram_read
0.79% libc-2.19.so [.] 0x0005f0d4
0.75% hostmot2.so [.] hm2_stepgen_write
0.67% hostmot2.so [.] hm2_sserial_read_pins
0.53% [kernel] [k] ktime_get_ts64
0.52% hostmot2.so [.] hm2_encoder_update_control_register.isra.0
0.50% hostmot2.so [.] hm2_stepgen_process_tram_read
0.47% [kernel] [k] preempt_count_sub
0.45% [kernel] [k] vector_swi
0.45% [kernel] [k] preempt_count_add
0.42% hal_lib.so [.] thread_task
0.30% libc-2.19.so [.] 0x0005f0c8
0.30% [kernel] [k] migrate_enable
0.29% [kernel] [k] sys_clock_gettime
0.29% [kernel] [k] check_preemption_disabled
0.29% hostmot2.so [.] hm2_encoder_instance_update_position.isra.1
0.28% hostmot2.so [.] hm2_write
0.27% [kernel] [k] rcu_note_context_switch
0.27% hostmot2.so [.] hm2_dpll_write
0.23% hostmot2.so [.] dummy_queue_read
0.23% rtapi.so [.] _rtapi_get_time
0.22% [kernel] [k] rt_spin_lock
0.21% [kernel] [k] schedule
0.21% libc-2.19.so [.] 0x000178e6
0.19% hostmot2.so [.] hm2_encoder_write
0.19% [uio] [k] uio_read
0.19% [kernel] [k] migrate_disable
0.19% [kernel] [k] rcu_idle_exit
For a higher level overview, try: perf top --sort comm,dso
for comparision, without IRQ - task at 5Khz:
Samples: 725K of event 'cpu-clock', Event count (approx.): 112470974030
Overhead Shared Object Symbol
84.83% [kernel] [k] arch_cpu_idle
6.69% hostmot2.so [.] hm2_encoder_process_tram_read
1.92% hostmot2.so [.] hm2_encoder_instance_update_position.isra.1
1.67% hostmot2.so [.] hm2_stepgen_process_tram_read
0.99% hostmot2.so [.] hm2_resolver_process_tram_read
0.36% hostmot2.so [.] hm2_read
0.28% hostmot2.so [.] 0x000017e4
0.21% librtapi_math.so.0 [.] rtapi_fabs
0.18% hostmot2.so [.] 0x000017e8
0.10% libc-2.19.so [.] strcmp
0.10% hostmot2.so [.] __floatdidf
0.08% hostmot2.so [.] __aeabi_dadd
0.08% libc-2.19.so [.] memset
0.05% libpthread-2.19.so [.] __pthread_rwlock_unlock
0.05% perf_4.6 [.] 0x00090708
0.05% perf_4.6 [.] 0x000b7f2c
0.04% perf_4.6 [.] 0x00093f00
0.03% perf_4.6 [.] 0x00093f02
0.03% [kernel] [k] _raw_spin_unlock_irq
0.03% perf_4.6 [.] 0x000918fc
0.03% libpthread-2.19.so [.] __pthread_rwlock_wrlock
0.03% perf_4.6 [.] 0x00091d2c
0.03% libpthread-2.19.so [.] pthread_mutex_trylock
0.03% libpthread-2.19.so [.] __pthread_rwlock_rdlock
0.03% perf_4.6 [.] 0x00096a6e
0.03% perf_4.6 [.] 0x00006a5c
0.02% [kernel] [k] stmmac_mdio_read
0.02% perf_4.6 [.] 0x00096a88
0.02% libpthread-2.19.so [.] pthread_mutex_lock
0.02% perf_4.6 [.] 0x00093efc
0.02% perf_4.6 [.] 0x00096a52
0.02% perf_4.6 [.] 0x00096a6a
0.02% perf_4.6 [.] 0x00093ef0
0.02% perf_4.6 [.] 0x00096a62
0.02% hostmot2.so [.] hm2_ioport_gpio_process_tram_read
0.02% perf_4.6 [.] 0x00096a74
0.02% perf_4.6 [.] 0x0001d35e
0.02% libslang.so.2.3.0 [.] SLsmg_write_chars
0.01% libavahi-common.so.3.5.3 [.] avahi_unescape_label
0.01% perf_4.6 [.] 0x00091d72
0.01% perf_4.6 [.] 0x00091d78
0.01% libpthread-2.19.so [.] __pthread_mutex_unlock_usercnt
0.01% perf_4.6 [.] 0x00096a50
hm2_tram_read/write are the prime candidates - it looks like they are O(n) with the number of pins+instances so cutting down on unneeded I/O would help
This aligns with what I've seen in the firmware and the driver code. It makes sense intuitively, the more pins/modules you add, the more you need to swap data back and forth in a linearly increasing fashion.
there is significant overhead through the uio driver: I think the _raw_spin_unlock_irq and _raw_spin_unlock_irqrestore stem from uio_read/uio_write functions - if one could waste a core for an RT thread, a polling solution would incur less overhead
I wonder if this is from the driver or from some interaction with the interrupt controller. For instance, all of these interrupts are shared interrupts. If you have more than one interrupt registered, based on that document you linked from Linus, all registered interrupt handlers get notified, and they check if it is theirs to handle. Perhaps by turning on our uio interrupt, it's showing the additional overhead of that notification framework along with the overhead of the uio driver. That leads me to wonder:
What are the debug config options in your kernel config? Lot's of debugging turned on?
From your testing and my testing, it looks like we sacrifice some processing time for less interrupt jitter - a fair trade in my mind.
@dkhughes - that is the stock cv kernel - I think perf should work just alike on the zynq
the tradeoff is ok - I'm just exploring options
assume we were to do an alternative polling solution for external timing - let's think through if/how that would be possible. Naive approach:
- do NOT enable the IRQ in the DTBO
- IRQStatus: set Bit1 "Irq mask" to 1
- ClearIRQ: write to clear the IRQ bit
- in the wait function, loop until ClearIRQ becomes non-zero, then write to ClearIRQ to reset
does that make sense? would we cause an unclaimed IRQ making the kernel complain ".. nobody cared" ?
does that make sense? would we cause an unclaimed IRQ making the kernel complain ".. nobody cared" ?
No, that should be totally fine, just don't register the IRQ with the controller. If the interrupt isn't registered it shouldn't cause any trouble. That being said, we probably need to go back to the polling version of hm2_uio driver that was dkms built just to be safe.
Then, we need to poll the IRQ status register bit '0' (hm2's status register, not the ARM interrupt controller) in waitirq in a tight spin loop. When it goes true (or times out) we break out, and write clear interrupt to hm2 firmware just like we are doing now. It will eat an entire thread, but with the 2 cores I think we're fine. We might have our cake and eat it to in that case, since it would skip the uio_irq portion entirely, we just deal directly with the firmware.
CONFIG_DEBUG_INFO=y
Might be worth a shot of turning down the debug level, and maybe turn on the quiet option.
The picture is a bit different on the zynq, not sure I see a common pattern yet which warrants action..
enable jessie-backports and install linux-perf
irqtest.hal/5kHz, perf_4.6 top --cpu=0 -d 10
Samples: 1M of event 'cycles:ppp', Event count (approx.): 27839070423
Overhead Shared Object Symbol
19.99% [kernel] [k] cpuidle_enter_state
15.57% [kernel] [k] _raw_spin_unlock_irq
2.27% hm2_soc_ol.so [.] hm2_soc_write
2.16% [kernel] [k] tick_nohz_idle_enter
1.75% [kernel] [k] _raw_spin_unlock_irqrestore
1.62% hostmot2.so [.] hm2_stepgen_instance_prepare_tram_write
1.52% hostmot2.so [.] hm2_ioport_gpio_prepare_tram_write
1.51% hostmot2.so [.] hm2_ioport_gpio_process_tram_read
1.45% hostmot2.so [.] hm2_ioport_update
1.15% hm2_soc_ol.so [.] hm2_soc_read
1.09% hostmot2.so [.] hm2_stepgen_instance_position_control
1.07% [kernel] [k] run_ksoftirqd
1.07% [kernel] [k] vector_swi
1.02% [kernel] [k] walk_stackframe
0.93% [kernel] [k] preempt_count_add
0.87% [kernel] [k] sys_clock_gettime
0.79% [kernel] [k] check_preemption_disabled
0.76% hostmot2.so [.] hm2_stepgen_process_tram_read
0.76% [kernel] [k] unwind_frame
0.75% hostmot2.so [.] hm2_stepgen_prepare_tram_write
0.72% [kernel] [k] schedule
0.71% [kernel] [k] preempt_count_sub
0.70% [kernel] [k] ktime_get_ts64
That aligns with what I mentioned before about the slow bus speed at the processor -> fpga interface. The zynq side is configured with a 100MHz axi interface. I'd have to double check, but I think the cv interface is only clocked at 50MHz, which would double your transfer time before any (if there is any) extra overhead incurred by an avalon->axi merge.
re the tight spin loop.. starvation is always an issue whith an RT thread and spinning, and the cycles are lost for good
one way could be to adopt an old idea of PCW and do the following:
- after executing the last funct, calculate how much of the cycle budget is left - we do have the current timestamps already (for example's sake: leftover=300uS)
- use a safe upper limit for latency, (for example margin=100uS)
- execute a wait for (leftover - margin = 200uS) - that would be cycles gained for other threads
- when the wait returns, start spinning for IRQbit set
- if the margin is set to -1, never wait=always spin - so we can test how effective this is
On 8/22/2016 10:23 AM, Michael Haberler wrote:
re the tight spin loop.. starvation is always an issue whith an RT thread and spinning, and the cycles are lost for good
The way to make things fast is to avoid reads.
If it seriously becomes an issue to read the required data from the FPGA (which is not a big stretch, considering how fast processors run these days, and that the processor is pretty much stalled for the duration of the required reads), the hardware needs to initiate writes instead of the CPU doing reads:
- Periodic hardware timer tick happens
- Hardware latches all required values at this point in time
- Hardware performs bus-mastering DMA into main memory, typically via a circular buffer or similar mechanism
- Hardware generates an interrupt, letting the CPU and real-time thread know there is new work to be done
The overhead of doing bus-mastering DMA is a pain, but it provides much better performance. Writes are pipelined throughout modern interconnect fabrics, can happen in large bursts, and generally help keep system interconnect throughput high. Reads generally bring everything to a screeching halt for the full bus turn-around time (read requested to data returned).
There are some memory semantics you need to be careful with when doing DMA, but nothing much more than what we already have to do to use circular buffers in HAL on multi-core machines.
For now, switching to memcpy seems like the low-hanging fruit. :)
Charles Steinkuehler [email protected]
I don't need the extra performance yet, but I have to say, full blown dma between the host and the fpga sounds like fun. I'd have to read the TRM a lot more for the zynq first...
Maybe after some of the other wishlist things get ticked off...like allowing aux gpio when cutter comp is turned on...