SpinalHDL icon indicating copy to clipboard operation
SpinalHDL copied to clipboard

Provide some API to speed up SpinalSim hardware signal read

Open Dolu1990 opened this issue 2 years ago • 10 comments

I just mesured the time needed to read a signal value using the verilator backend :

bt1.toLong   => 40 ns

vs

manager.getLong(signal1) => 5 ns,

Redondant code avoided :

        val manager = SimManagerContext.current.manager
        val signal1 = manager.raw.userData.asInstanceOf[ArrayBuffer[Signal]](bt1.algoInt)

So, adding some API to provide a optimized signal access could realy help speeding up things for performance critical testbench

CPU used to test, AMD 5800X3D

Dolu1990 avatar Jul 20 '23 14:07 Dolu1990

Something like :

//Once in sim
val proxy = dut.mem.node.bus.a.address.simProxy()
..
//Many time in sim 
val value = proxy.toLong // 5 ns overhead instead of 40

SimProxy being in spinal.core.sim package :

  implicit class SimBitVectorPimper(bt: BitVector) {
    class SimProxy(bt : BitVector){
      val manager = SimManagerContext.current.manager
      val signal = manager.raw.userData.asInstanceOf[ArrayBuffer[Signal]](bt.algoInt)
      val alwaysZero = bt.getBitsWidth == 0
      def toLong = if(alwaysZero) 0 else manager.getLong(signal)
    }
    def simProxy() = new SimProxy(bt)
}

Dolu1990 avatar Jul 20 '23 14:07 Dolu1990

I think the ClockDomain Apis such as waitSampling will speed up more if change like this

xiaochuang-lxc avatar Aug 05 '23 16:08 xiaochuang-lxc

The threadfull API would not realy get faster, as most of the overhead there is into doing JVM thread pack / unpack and switching threads i guess.

Dolu1990 avatar Aug 05 '23 16:08 Dolu1990

I do a very simple by overwriting the waitSampling(),it can speed up form: test0:100000 times call 1092ms—>1027ms test1:1000000 times call 9868ms—>9112ms

xiaochuang-lxc avatar Aug 05 '23 17:08 xiaochuang-lxc

Ahhh i was expecting less difference ^^

Dolu1990 avatar Aug 05 '23 17:08 Dolu1990

I just did the check with modifying forkStimulus of a ClockDomain, gives ~10% in a testbench that does not much apart from that clock... I guess core stuff like the Stream/Flow drivers could also benefit from using a proxy...

OT: When looking at that stuff I checked the difference between using sleep in an endless loop vs. setting up repeated calls with delayed, that performance difference was pretty astounding: sleep is ~20 times slower...

andreasWallner avatar Aug 06 '23 22:08 andreasWallner

Is it possible to modify the logic of toLong to avoid getting things too complicated?

Readon avatar Aug 17 '23 09:08 Readon

How about use the scala-inline pulgin with @inline definition to speed up toLong method?

Readon avatar Aug 17 '23 10:08 Readon

Is it possible to modify the logic of toLong to avoid getting things too complicated?

As far as I understand: It is currently as close as possible to the toLong method, but: the sim functionality is provided by e.g. the SimBitVectorPimper, and a new one of them is created for each implicit conversion that happens in the code (i.e. every use of a BitVector where you call a function from the Pimper). I think there is no place to store the proxy persistently (and not have to deal with the ThreadLocal that is the bottleneck here (as far as I've understood).

andreasWallner avatar Aug 17 '23 20:08 andreasWallner

@andreasWallner

OT: When looking at that stuff I checked the difference between using sleep in an endless loop vs. setting up repeated calls with delayed, that performance difference was pretty astounding: sleep is ~20 times slower...

Right, unfortunatly, the JVM doesn't provide any support for coroutine / user space thread, so the only way to implement that kind of feature was to use JVM threads and switch between them :/

The SpinalSim threaded API is realy slow compared to the callback based API

How about use the scala-inline pulgin with @inline definition to speed up toLong method?

Could be tired

Dolu1990 avatar Aug 17 '23 20:08 Dolu1990