VexRiscv icon indicating copy to clipboard operation
VexRiscv copied to clipboard

RV32IMC with IBusCachedPlugin

Open MarekPikula opened this issue 4 years ago • 11 comments

Hi, first of all congratulations for your impressive work. I'm currently evaluating different RISC-V cores and VexRiscv is my favourite so far.

I have a problem when trying to configure Briey to IMC ISA. I've tried Murax before with IBusSimplePlugin(compressedGen = true) and it worked just fine, but it doesn't seem to work with IBusCachedPlugin. When I set compressedGen = true in Briey I have the following error:

[Runtime] SpinalHDL v1.3.6    git head : 9bf01e7f360e003fac1dd5ca8b8f4bffec0e52b8
[Runtime] JVM max memory : 2444.5MiB
[Runtime] Current date : 2019.10.16 15:19:49
[Progress] at 0.000 : Elaborate components
PcManagerSimplePlugin is now useless

**********************************************************************************************
[Warning] Elaboration failed (0 error).
          Spinal will restart with scala trace to help you to find the problem.
**********************************************************************************************

[Progress] at 1.023 : Elaborate components
PcManagerSimplePlugin is now useless
Exception in thread "main" java.lang.Exception: Missing inserts : INSTRUCTION_ANTICIPATED
	at vexriscv.Pipeline$class.build(Pipeline.scala:95)
	at vexriscv.VexRiscv.build(VexRiscv.scala:86)
	at vexriscv.Pipeline$$anonfun$1.apply$mcV$sp(Pipeline.scala:161)
	at vexriscv.Pipeline$$anonfun$1.apply(Pipeline.scala:161)
	at vexriscv.Pipeline$$anonfun$1.apply(Pipeline.scala:161)
	at spinal.core.ClockDomain.apply(ClockDomain.scala:306)
	at spinal.core.Component$$anonfun$prePop$1.apply(Component.scala:124)
	at spinal.core.Component$$anonfun$prePop$1.apply(Component.scala:123)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at spinal.core.Component.prePop(Component.scala:123)
	at spinal.core.Component.delayedInit(Component.scala:138)
	at vexriscv.VexRiscv.<init>(VexRiscv.scala:86)
	at vexriscv.demo.Briey$$anon$3$$anon$4.<init>(Briey.scala:400)
	at vexriscv.demo.Briey$$anon$3.delayedEndpoint$vexriscv$demo$Briey$$anon$3$1(Briey.scala:395)
	at vexriscv.demo.Briey$$anon$3$delayedInit$body.apply(Briey.scala:346)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at spinal.core.ClockingArea.delayedInit(Area.scala:84)
	at vexriscv.demo.Briey$$anon$3.<init>(Briey.scala:346)
	at vexriscv.demo.Briey.delayedEndpoint$vexriscv$demo$Briey$1(Briey.scala:346)
	at vexriscv.demo.Briey$delayedInit$body.apply(Briey.scala:270)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at spinal.core.Component.delayedInit(Component.scala:131)
	at vexriscv.demo.Briey.<init>(Briey.scala:270)
	at vexriscv.demo.Briey$$anonfun$main$1.apply(Briey.scala:497)
	at vexriscv.demo.Briey$$anonfun$main$1.apply(Briey.scala:496)
	at spinal.core.internals.PhaseCreateComponent.impl(Phase.scala:1920)
	at spinal.core.internals.PhaseContext.doPhase(Phase.scala:195)
	at spinal.core.internals.SpinalVerilogBoot$$anonfun$singleShot$10.apply(Phase.scala:2156)
	at spinal.core.internals.SpinalVerilogBoot$$anonfun$singleShot$10.apply(Phase.scala:2154)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at spinal.core.internals.SpinalVerilogBoot$.singleShot(Phase.scala:2154)
	at spinal.core.internals.SpinalVerilogBoot$.apply(Phase.scala:2090)
	at spinal.core.Spinal$.apply(Spinal.scala:311)
	at spinal.core.SpinalConfig.generateVerilog(Spinal.scala:142)
	at vexriscv.demo.Briey$.main(Briey.scala:496)
	at vexriscv.demo.Briey.main(Briey.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.intellij.rt.execution.application.AppMainV2.main(AppMainV2.java:131)

I'm using current master with IntelliJ IDE.

MarekPikula avatar Oct 16 '19 13:10 MarekPikula

Thanks :)

It isn't realy a bug, but a set of incompatible feature. Basicaly, with the Briey default configuration, the cache use two cycle (https://github.com/SpinalHDL/VexRiscv/blob/master/src/main/scala/vexriscv/demo/Briey.scala#L69) which give the instruction in the decode stage, but there is also the INSTRUCTION_ANTICIPATED which provide the future value of the decode instruction, which allow the register file to use a syncronus ram read using as address the INSTRUCTION_ANTICIPATED to produce the RS1 RS2 in the decode stage.

The issue with the RVC in that case, is that the RVC decompression is only done in the decode stage, which do not allow to produce the INSTRUCTION_ANTICIPATED value used by the reg file.

There is multiple ways to workaround that. If you are using a FPGA with distributed ram capability, i would just pass the RegFilePlugin from SYNC to ASYNC. Else you can ask the IBusCachePlugin twoCycleRam and twoCycleCache to false. Else you can set the https://github.com/SpinalHDL/VexRiscv/blob/master/src/main/scala/vexriscv/plugin/IBusCachedPlugin.scala#L36 to true, in the CPU config, which would add an additional stage in the fetch pipeline which will allow INSTRUCTION_ANTICIPATED to be generated.

Dolu1990 avatar Oct 16 '19 15:10 Dolu1990

After disabling twoCycleRam and twoCycleCache it works like a charm :+1:

I'd propose to add information about it in readme and possibly add cached IMC to standard configurations and size/performance breakdown, since IMC is quite popular configuration.

Besides what makes IMC less performant than IM? I'm running CoreMark and it's about 5% slower. It's not a huge difference, but I'm curious, especially that for Syntacore scr1 it's the other way around (IM is about 2% less performant than IMC). These are not huge differences, but still it's interesting to know what is the cause of this. VexRiscv is my first SpinalHDL project I'm looking at, so I don't have enough competence to look around the code and figure it out myself.

MarekPikula avatar Oct 17 '19 10:10 MarekPikula

A performance hit due to RVC which should impact most implementations (and at least VexRiscv) is the fact that branch/jump on 32 bits unaligned instruction will result in the fetcher having to fetch two words. before being able to deliver that 32 bits instruction.

You can imagine other cases where it require to read two different lines of the cache, or event two diferrent MMU TLB.

I don't realy know about the Syntacore scr1 architecture. RVC performance improvement for coremark could come from op-fusion ? But else i don't realy see any reason. Excepted maybe less i$ trashing, but normaly coremark should fully fit into 4 KB i$ d$. But scr1 is cacheless right ?

Dolu1990 avatar Oct 17 '19 11:10 Dolu1990

To be honest I didn't go too deep into scr1 architecture, but from what I can see there is no caching.

FYI I'm using scr1 under PULPissimo with their logarithmic interconnect for memory, which is very fast (thus caching is not that required). I'll check if there is any performance difference for Ibex, which I was also evaluating.

MarekPikula avatar Oct 17 '19 12:10 MarekPikula

Same story for Ibex. IM is about 1,5% less performant than IMC. If I'll find some time maybe I'll try to fit cacheless VexRiscv in PULPissimo and do some benchmarks to see what is the difference.

MarekPikula avatar Oct 17 '19 12:10 MarekPikula

logarithmic interconnect for memory

What is that :D ?

IM is about 1,5% less performant than IMC

That's weird, i'm curious to understand why XD I mean technicaly speaking, RVC is a subset of RV. So maybe that's because of the compiler creativity ?

Dolu1990 avatar Oct 17 '19 12:10 Dolu1990

You can read more about log interconnect in Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices article (here is a copy I've found if you don't have access to IEEE).

Compiler creativity has nothing to do in this particular case, because on all platforms I'm testing, I'm executing the same exact code compiled using the same exact compiler with the same exact flags, so the only difference is the core and SoC platform.

MarekPikula avatar Oct 17 '19 12:10 MarekPikula

I noticed a similar case on the picorv32: when RVC is enabled, dhrystone performance is slightly lower when running identical IM32 binary.

tomverbeure avatar Oct 17 '19 12:10 tomverbeure

For me it's exactly the same core (so full IMC), but running either IMC or IM code, so the only variable compiled code.

MarekPikula avatar Oct 17 '19 13:10 MarekPikula

@Dolu1990 for your reference here is log interconnect code: pulp-platform/L2_tcdm_hybrid_interco.

MarekPikula avatar Oct 17 '19 14:10 MarekPikula

And more about log interconnect: https://iis-people.ee.ethz.ch/~arahimi/papers/DATE11.pdf

MarekPikula avatar Oct 18 '19 09:10 MarekPikula