SIMTight icon indicating copy to clipboard operation
SIMTight copied to clipboard

How to port the SIMTight project to PYNQ-Z2 FPGA board?

Open Honourable-A opened this issue 1 year ago • 9 comments

Hi, as you know DE10 pro FPGA is very costly board. I have a PYNQ-Z2 FPGA board (board details are here https://www.tulembedded.com/FPGA/ProductsPYNQ-Z2.html). I want to do some research on SIMTight. Please tell me how I can port the project to this low end board. I am asking about the exact steps because I am not an expert on FPGA design porting.

Honourable-A avatar Oct 19 '22 11:10 Honourable-A

Hi, unfortunately we don't currently have the resources to port or maintain a port to Xilinx devices. Perhaps the ability to run in simulation is still useful to you? I foresee three issues in porting:

  1. Quad-port BRAMs available on Stratix 10 may have like-for-like replacements in modern Xilinx devices but possibly not PYNQ. This is not a synthesis issue (there are pure Verilog versions of these components available for any FPGA) but an efficiency one (the pure Verilog components may map down to registers rather than BRAMs).

  2. The DRAM bandwidth on the PYNQ is lower, so one would probably halve the DRAM bus width and the number of vector lanes in Config.h.

  3. We use an Intel clock-crossing primitive to put the CPU and SIMT cores in different clock domains, but this isn't really necessary and could simply be removed.

mn416 avatar Oct 20 '22 08:10 mn416

Hi, thanks for your suggestions. I have a question about simulation because you mentioned it. I understand that there is a SIMTight simulator but it probably simulates the CPU only and not the SIMT cores. Please rectify me if my understanding is inaccurate.

Honourable-A avatar Oct 20 '22 09:10 Honourable-A

It simulates the entire SoC including CPU, SIMT core, memory subsystem, UART and DRAM.

The drawback is that simulation is (of course) slow compared to FPGA. Therefore, in simulation the benchmarks are run only for small data-set sizes. This can lead to underloading of the system, and a dip in IPC. So it may be desirable to increase the data-set sizes slightly in simulation until the point where the benchmarks are performing at an IPC level close to the following level obtained from FPGA:

Samples/VecAdd (build): ok
Samples/VecAdd (run): ok [IPC=29.26,Instrs=9126880,Cycles=311871,DRAMAccs=189100,Retries=23227,Susps=0]
Samples/Histogram (build): ok
Samples/Histogram (run): ok [IPC=31.14,Instrs=7153216,Cycles=229718,DRAMAccs=32994,Retries=4702,Susps=0]
Samples/Reduce (build): ok
Samples/Reduce (run): ok [IPC=31.56,Instrs=6358334,Cycles=201496,DRAMAccs=64101,Retries=733,Susps=0]
Samples/Scan (build): ok
Samples/Scan (run): ok [IPC=30.33,Instrs=222357876,Cycles=7330080,DRAMAccs=162304,Retries=45776,Susps=0]
Samples/Transpose (build): ok
Samples/Transpose (run): ok [IPC=31.28,Instrs=5648320,Cycles=180567,DRAMAccs=50240,Retries=2481,Susps=0]
Samples/MatVecMul (build): ok
Samples/MatVecMul (run): ok [IPC=28.88,Instrs=10864608,Cycles=376171,DRAMAccs=139968,Retries=5969,Susps=0]
Samples/MatMul (build): ok
Samples/MatMul (run): ok [IPC=31.40,Instrs=144054240,Cycles=4588073,DRAMAccs=89472,Retries=82750,Susps=0]
InHouse/BlockedStencil (build): ok
InHouse/BlockedStencil (run): ok [IPC=27.01,Instrs=48971680,Cycles=1812934,DRAMAccs=212416,Retries=10757,Susps=0]
InHouse/StripedStencil (build): ok
InHouse/StripedStencil (run): ok [IPC=31.45,Instrs=35541920,Cycles=1129937,DRAMAccs=175360,Retries=2345,Susps=0]
InHouse/VecGCD (build): ok
InHouse/VecGCD (run): ok [IPC=4.23,Instrs=10955517,Cycles=2591078,DRAMAccs=20350,Retries=892,Susps=0]

mn416 avatar Oct 20 '22 10:10 mn416

Thanks again for the clarification. Is there any document or user guide to use this simulator? My intention is to develop an OS for SIMTight but I am not sure if I can use this simulator or how I can use it.

Honourable-A avatar Oct 20 '22 11:10 Honourable-A

These are the only docs at the moment:

  • https://github.com/CTSRD-CHERI/SIMTight/blob/master/README.md
  • https://github.com/CTSRD-CHERI/SIMTight/blob/master/doc/NoCL.md

The first one does explain how to use the simulator. The second one discusses software interfaces.

mn416 avatar Oct 20 '22 13:10 mn416

Can you please tell me what is Mailbox and what is ITCM in the SoC diagram? Also please tell me how the CPU and SIMT are connected. Also, is it possible to run applications on the CPU and SIMT at the same time? There is an UART(USB) connection to the CPU. What is the purpose of this connection? Thanks

Honourable-A avatar Nov 17 '22 16:11 Honourable-A

We hope to improve SIMTight's documentation over the next year. Hopefully, I will be able to address such questions as part of that process.

mn416 avatar Nov 20 '22 04:11 mn416

Thanks for your answer. I have another doubt about the scalarisation. How do you implement dynamic scalarisation in hardware? Do you detect it in simple or host core or do you detect it in SIMT? Is there any existing literature on dynamic scalarisation which you can direct me to? As per the description, the entire warp is executed on a single execution unit in a single cycle because of scalarisation. But please tell me what a single execution unit mean. Is it a signle hardware thread inside a block? Also according to the description, it operates in parallel with the main vector pipeline. Please tell me how it is done because currently syncronous kernel invocation is avaialble. That means the host can run only one kernel and waits till it finishes and scalar optimized kernel must finish before any other kerenel can run. Also what is a main vector pipeline? I know that I asked too many questions, but if you can kindly shed some light that will be very helpful.

Honourable-A avatar Nov 21 '22 10:11 Honourable-A

Again, I'll try to address these questions in the upcoming documentation process. Briefly:

  • Regarding existing work on scalarisation, there is lots. To mention two: this GPGPU architecture book and this ISCA'13 paper.
  • SIMTight's SIMT core contains a scalar pipeline and a vector pipeline, both independent of the host CPU which is not part of the SIMT core in any way.

mn416 avatar Nov 21 '22 11:11 mn416