learn-fpga
learn-fpga copied to clipboard
Congratulations and Ideas
Dear Bruno,
my congratulations for squeezing a RV32I core into the Icestick !
I read your Verilog files with joy and I wish to share an idea on how to save a few more LUTs for more peripherals: Try an "one-hot" IO address decoder. You have few IO registers only, so you can reserve one address line for each of your peripheral registers and save LUTs on comparisons with the full IO address. This also allows to set multiple IO registers at once.
You can also insert a hardware random number generator by using a ring oscillator.
Maybe you wish to check out Mecrisp-Ice from mecrisp.sourceforge.net in file mecrisp-ice-1.8/hx1k/icestorm/j1a.v for my peripheral set in use on the Icestick. Mecrisp-Ice is a Forth compiler running on a stack processor, which is a descendant of Swapforth and the J1a CPU by James Bowman. I think you can borrow a few of the ideas !
If you manage to map the SPI flash into the memory bus within the available LUTs, similiar to the memory interface in Picosoc, I would be happy to officially port Mecrisp-Quintus (a RISC-V Forth which needs about 24kb flash and 4 kb RAM) to your FemtoRV32 on the Icestick.
Hats off and best wishes from Germany, Matthias
PS: Completely removing the rdRAM wire in your memory design somehow saved 20 LUTs.
Dear Matthias,
Thank you very much for your comments and ideas, I'm very glad to have some feedback.
- I'll try your "one-hot" IO address : I have up to ten peripherals and 8 IO address lines only, but I can probably reorganize them, or change a bit the memory layout so that everything fits in there.
- Thank you very much for the link to Mecrisp: I was aware of J1 (it was one of my starting points ! great source of inspiration, showing that CPU on IceStick is possible. I also borrowed their UART). I'm fascinated by the designs that pack so much functionality into so tiny devices.
- Yes, mapping the SPI flash would be a must ! I still need to learn a lot (I was not aware that there was so much available space in it).
- About rdRAM, I thought that unused signals were optimized out, good to know, thanks !
Best wishes, -- Bruno
Hi Matthias, I just switched to "one-hot" adressing mode, and yes, it saved a lot of LUTs ! Thank you very much for your comment. I'm now working on squeezing a bit the IO space by merging things (so that largest offset remains <= 1024, to be able to write to an IO using a single SW instr.).
Dear Bruno,
you are welcome, I am glad these ideas were useful for you.
If you do loads and stores with an offset relative to the zero-register x0, you get quick access to two "zero pages", which split into the very low addresses (positive offset) and the very high addresses (negative offset). A nice place in the memory map for RAM and IO.
The Icestick is a rewarding target, squeezing designs into it feels like the FPGA equivalent of a sizecoding contest. You are really pushing the envelope here !
Matthias
"zero-page" sounds very 6502-ish to me :-) To me, there is still a lot of mystery about what eats-up LUTs / what saves LUTs, sometimes the behavior is very counter-intuitive, any hint / general rules for that ?
Same observation here, it's difficult to predict, and details sometimes change when updating to a newer Yosys release.
Try to imagine how you would implement something in TTL logic gates with an soldering iron. The "one hot" address decoder is just "io-write and address_line[x]", one gate, one LUT. HX1K has LUTs with 4 inputs, which is a good fit. If you have comparisons with multiple bits, you likely will need more LUTs. Greater/less comparisons require a carry chain, which usually requires more logic than equal/unequal.
The document "iCE 2017-08 Technology Library" (www.latticesemi.com/view_document?document_id=52206) will give you an overview of the functions directly available on the FPGA. If you can map your logic directly to these primitives, you'll get a good resource mileage.
Additionally, always specify the widths of everything. If not specified, Verilog mandates to use the most wide width possibly involved, which for example results in a logic operation carried out with full 32 bits internally, consuming LUTs, and the result is then truncated afterwards to the desired output width.
Yosys usually removes unused parts of the logic, but I assume the memory read wire somehow was pattern matched into a standard block RAM implementation and constant folding optimisation failed therefore. Completely unused blocks are optimised away; but try removing unused logic which is connected at one end to logic which is in use.
Reordering of source lines, especially in CASEZ constructs, sometimes yields mysterious results in terms of LUT usage. I think this is because the internal ordering affects further optimisation steps during synthesis. Specify "don't care" values with "?".
Matthias
PS: Yes, it's the same trick as on 6502.
@Mecrisp , do you know where I can find some documentation about the SPI flash used in the IceStick ? (I tryed interfacing a design from: https://github.com/smunaut/ice40-playground/blob/master/cores/spi_flash/rtl/spi_flash_reader.v without success, but I must admit I do not understand what I'm doing)
Thank you in advance, -- Bruno.
Yes, it is a vanilla SPI flash chip, part number N25Q032A.
https://www.micron.com/-/media/client/global/Documents/Products/Data%20Sheet/NOR%20Flash/Serial%20NOR/N25Q/n25q_32mb_1_8v_65nm.pdf
If you are fluent in Forth, have a look at mecrisp-ice-1.8/hx1k/nucleus.fs for how to read a sector from this chip.
Thanks ! I have seen the Forth functions in mecrisp, but I do not understand Forth ! (but I'll try, looks like my Hewlett Packard calculator, stack based, push operands then operation), Any reference with a good introduction to Forth ? (I have a feeling that it will be easier than digging in micron's datasheet :-)
A small intro to give you an idea and the classic introductory text:
https://jeelabs.org/article/1612b/ https://www.forth.com/starting-forth/
But I think it would be much easier for you to search for Arduino code to interface vanilla SPI flash memory chips, as they have a standard interface.
Here is a better datasheet. You need the "read data" command 03 (and usually the "release power-down" command AB, which you can omit on Icestick).
https://www.winbond.com/resource-files/w25q128jv%20spi%20revc%2011162016.pdf
Thank you very much for all these links, it helped a lot ! Now we have mapped IO to read the SPI flash. Comming next: memory interface with address valid<->RAM ready handshaking, to be able to directly execute code from there (hope it won't eat up too many LUTs...)
I found another place to save a few LUTs:
3'b100: out = ($signed(in1) < $signed(in2)); // BLT
3'b101: out = ($signed(in1) >= $signed(in2)); // BGE
3'b110: out = (in1 < in2); // BLTU
3'b111: out = (in1 >= in2); // BGEU
Every of these comparisons requires a 32/33 bit subtraction, but all conditions can be generated by using one subtraction only:
wire [16:0] minus = {1'b1, ~st0} + st1 + 1;
wire signedless = st0[15] ^ st1[15] ? st1[15] : minus[16];
wire unsignedless = minus[16];
wire zeroflag = minus[15:0] == 0;
9'b0_011_00111: st0N = {16{zeroflag}}; // =
9'b0_011_01000: st0N = {16{signedless}}; // <
9'b0_011_01100: st0N = minus[15:0]; // -
9'b0_011_01111: st0N = {16{unsignedless}}; // u<
You get the idea :-)
Let's try that ! (smart and crazy at the same time, I love it !). I'm pretty sure I won't get it right the first time though... (I'm always confused when handling signed quantities
Hey, thanks ! Have fun !
I usually need a few tries for properly handling signed values, too. But there are maps into these mostly uncharted territories:
If you like tricks like these, I wish to recommend you this
https://graphics.stanford.edu/~seander/bithacks.html
and the book "Hacker's Delight" by Warren. It's full of small tricks which are very useful for compiler writers and processor designers.
https://en.wikipedia.org/wiki/Hacker's_Delight
Just tried your elegant trick for the branch predicates, and made it work, however it uses 37 more LUTs (???). LUT golfing is something between art and sorcery it seems ! Still looking for subtracts that I could "factor" in the design, it seems that there is a couple of them...
(BTW, thank you for the two links, excellent !!)
I also want to say congratulations on making a RISC-V that fits on the icestick! That is super impressive. If you would like to port your work to run on the Fomu (https://fomu.im) which has an iCE40UP5K, send me an email to [email protected] and I'll send you some!
Have you tried playing with Yosys settings -- there are a lot of options you can tell Yosys to give to ABC to change the area verse frequency trade offs.
Have you looked at serv from @olofk -- it is a bit serial based RISC-V implementation and @olofk has been slowly trimming the core down -- see http://corescore.store/
It might also be interesting to integrate your RISC-V core into the LiteX environment. It already supports quite a few number of different RISC-V (and other architectures) cores. See

@mecrisp, I am completely confused with what takes LUT and what does not: Cleaning up a bit the implementation, I wanted to use parameters and 'generate' statement instead of macros, and just adding this parameter to the ALU and without changing anything else eats up 100 LUTs !?!??
module NrvALU #( parameter [0:0] TWOSTAGE_SHIFTER = 0 ) ( input clk, input [31:0] in1, input [31: ...
@mithro
lot of options you can tell Yosys to give to ABC
I tried adding "-abc2 -relut" to the synth_ice40 command, according to this documentation:
http://www.clifford.at/yosys/cmd_synth_ice40.html
On my project, Mecrisp-Ice 1.8, it improved from 1273 to 1224 ICESTORM_LC for HX1K.
But this is not "lot of options". I am surely missing something. Could you please point me to a configuration for synthesis with aggressive optimisation for size ?
Hi Bruno,
I am sorry I cannot give more guidance. It's a quite erratic random walk for me also. Could you please add your code with the changed branch predicates for me to try a few things ?
Matthias
Hi Matthias,
I've pushed the code, you can activate it by uncommenting the following line in femtosoc.v: `define NRV_TRY_COMPACT_PREDICATES If this work, we can probably play the same trick in the ALU, that computes in1-in2, signed comparison and unsigned comparison. (on my side, I'll try playing with abc flags, thank you for the link !)
-- B
The original version as-is weighs in at 1332 LUTs here and doesn't fit on Icestick. Then I activated the NRV_TRY_COMPACT_PREDICATES and LUT usage dropped to 1262. Further adding "-abc2 -relut" to Yosys gave 1265 LUTs.
I am currently using
yosys --version Yosys 0.9+2406 (git sha1 UNKNOWN, clang 7.0.1-8 -fPIC -Os)
It seems as if this varies a lot with Yosys revisions.
Can you use the same COMPACT_PREDICATES wires for sub, slt and sltu opcodes, too ? I am not sure what the aluInSel1 and aluInSel2 wires do when executing branches.
Weird, on my side it always increases LUT count, and I have the same version of YOSYS (but compiled with a different CLANG): yosys --version Yosys 0.9+2406 (git sha1 UNKNOWN, clang 9.0.1-12 -fPIC -Os)
Note: sometimes the order of things / name of things change the LUT count, and different compilers may order things differently (C++ std::map, std::set etc...), we observed that already. I'm keeping the option for now, and will add a similar option for the ALU.
-- B P.S. Which devices did you activate ? Did you activate NRV_TWOSTAGE_SHIFTER as well ?
I took your code as-is and activated NRV_TRY_COMPACT_PREDICATES only.
With NRV_TWOSTAGE_SHIFTER activated as well along with NRV_TRY_COMPACT_PREDICATES, I get 1246 LUTs (with -abc2 -relut) or 1261 LUTs (without).
On my side here is what it gives (so much difference !)
.------------------- NRV_TRY_COMPACT_PREDICATES | .----------- NRV_TWOSTAGE_SHIFTER | | OFF OFF : 1227 OFF ON : 1285 ON OFF : 1264 ON ON : 1339
Oh no ! I hope "mithro" can comment on this.
One more idea to try might be to merge branches & alu opcodes and use a dedicated adder for PC instead. This would save three subtractions for sub, slt and sltu and add one addition for PC handling. But I have no clue on the total effect on LUT usage, as multiplexers need gates, too.
I am quite sure we are using different minor revisions of Yosys, despite they report the same 0.9+2406 version string. Using a different clang to compile Yosys should not alter its algorithms.
@mithro I am missing an human readable output which tells how resources are distributed on the design, to give a better feedback for manual optimisation.
"Using a different clang to compile YOSYS should not alter its algorithms" It should not, but I'm pretty sure it does ! Some explanations: if you use a C++ std::set, depending on the version of the compiler and libc++, the order of the elements in the set may be different, and it seems that YOSYS is quite sensitive to that. I observed that changing the names of some regs and wires gives a completely different LUT count !