openCologne Multiplexer Stress Test

Personally I think it is brilliant that GateMate actually has multiplexers. Using 31 traditional LUT-4s to multiplex 32 bits seems quite nuts to me. 11 CPEs working as MUX-4s seems much better. But then with GateMate, I get lost because of the lack of detailed information. Take this image for example.

 There is some hidden magic in the “Logic Lut Tree”. If it were just a LUT tree, it would make sense to me. But sometimes I think it is used as a multiplexer. And sometimes parts of the LUT Tree are used for the adders and multipliers. I suspect that there is some brilliance going on here, where the same circuits can be used in vastly different ways. But I am stupid. I cannot quite see it. If you could do a diagram of the entire CPE, and then highlight the parts that are used in each case that would be brilliant. Who knows, maybe some of us would come up with other ways to use the pieces together.   

Same thing holds for networking nodes. It is not clear to me where the bit flips occur. I am not able to reason about why you cannot do 6 input muxes. Maybe I just need to read that section more carefully.   

And then with the CCL2T5, I am again confused. Is this really as good as a regular Lut4? Please persuade me. Is it different? Then please tell me how it is different.  

From a sales and marketing perspective it is really important to be clear about this stuff. Many developers will put up with lots of small irritating problems if we are convinced that you have a great idea here. I suspect that there is some brilliance in this design, but I am not entirely convinced.  

Please convince me.

Feb 21 '25 15:02 PythonLinks

Let's try to, with CologneChip help, prove the brilliance point, and construct a test that would bring to life GateMate hidden treasures!

Such test can be fully customized for GateMate, looking to squeeze the last drop of its unique advantages. It would be designed and fine-tuned to allow GateMate to showcase its best face. It can tap into the 8-input logic capability, muxing riches, math functions, or whatever else @pu-cc and we collectively believe would make a difference.

We would then expose a couple of the mainstream FPGAs in the 20K LUT category to that exact same test that was custom-constructed for GateMate...

Feb 21 '25 17:02 chili-chips-ba

I.e., stack the deck? Sorry, had to giggle.

On Fri, Feb 21, 2025, 12:53 PM Chili.CHIPS @.***> wrote:

Let's try to, with CologneChip help, prove the brilliance point, and construct a test that would bring to life GateMate hidden treasures!

Such test can be fully customized for GateMate, looking to squeeze the last drop of its unique advantages. It would be designed and fine-tuned to allow GateMate to showcase its best face. It can tap into the 8-input logic capability, muxing riches, math functions, or whatever else @pu-cc https://github.com/pu-cc and we collectively believe would make a difference.

We would then expose a couple of the mainstream FPGAs in the 20K LUT category to that exact same test that was custom-constructed for GateMate...

— Reply to this email directly, view it on GitHub https://github.com/chili-chips-ba/openCologne/issues/57#issuecomment-2675195799, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADFUATJKM6IZJGFZKFBTOQT2Q5R2XAVCNFSM6AAAAABXTQAZGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZVGE4TKNZZHE . You are receiving this because you are subscribed to this thread.Message ID: @.***> [image: chili-chips-ba]chili-chips-ba left a comment (chili-chips-ba/openCologne#57) https://github.com/chili-chips-ba/openCologne/issues/57#issuecomment-2675195799

Let's try to, with CologneChip help, prove the brilliance point, and construct a test that would bring to life GateMate hidden treasures!

Such test can be fully customized for GateMate, looking to squeeze the last drop of its unique advantages. It would be designed and fine-tuned to allow GateMate to showcase its best face. It can tap into the 8-input logic capability, muxing riches, math functions, or whatever else @pu-cc https://github.com/pu-cc and we collectively believe would make a difference.

We would then expose a couple of the mainstream FPGAs in the 20K LUT category to that exact same test that was custom-constructed for GateMate...

— Reply to this email directly, view it on GitHub https://github.com/chili-chips-ba/openCologne/issues/57#issuecomment-2675195799, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADFUATJKM6IZJGFZKFBTOQT2Q5R2XAVCNFSM6AAAAABXTQAZGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZVGE4TKNZZHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Feb 21 '25 18:02 TurboVega

Rather than stacking the deck, it is more about identifying the market niche for this interesting device.

Please everyone contribute your ideas to this project. Often one person's crazy idea will trigger a very reasonable suggestion from another person. I will do mine for Math.

To save energy biological systems use logarithmic representation. In biological systems many senses use a logarithmic scale. We can hear differences of 1 decibel in amplitude, that corresponds to a sound 1.25 times louder. In frequency, an octave is divided into 12 notes, and each note is divided into 100 cents. For frequencies above 5Khz, we can distinguish 10 cents, or 1/120 of the signal, or a little less than a 1% change in frequency. For light we can distinguish a change of 10% in amplitude. Of course color uses different 3 different sensors and so works differently. When the just noticeable difference is proportional to the signal size, that implies a logarithmic number system.

Without going into detail about how to use Gaussian Logarithms (also called fixed point logarithms) for doing multiplication, division, addition and subtraction, I think that this device is great for it. Basically one needs to do table lookup, with some interpolation. Say 12 bit logarithmic math, just a little bit larger than the hard core 9 bit adders on other devices.

Any other suggestions? I am particularly curious about how one would use 8 bit LUT trees.

Feb 21 '25 18:02 PythonLinks

What about vector math in a 3D pipeline? I ported Pingo into Agon, in C, of course. It ran okay, but with no real hardware help, the frame rates were minimal. And that board only has 64 colors. GM has 4096 colors. Much better possibilities.

On Fri, Feb 21, 2025, 1:43 PM Christopher Lozinski @.***> wrote:

Rather than stacking the deck, it is more about identifying the market niche for this interesting device.

Please everyone contribute your ideas to this project. Often one person's crazy idea will trigger a very reasonable suggestion from another person. I will do mine for Math.

To save energy biological systems use logarithmic representation. In biological systems many senses use a logarithmic scale. We can hear differences of 1 decibel in amplitude, that corresponds to a sound 1.25 times louder. In frequency, an octave is divided into 12 notes, and each note is divided into 100 cents. For frequencies above 5Khz, we can distinguish 10 cents, or 1/120 of the signal, or a little less than a 1% change in frequency. For light we can distinguish a change of 10% in amplitude. Of course color uses different 3 different sensors and so works differently. When the just noticeable difference is proportional to the signal size, that implies a logarithmic number system.

Without going into detail about how to use Gaussian Logarithms (also called fixed point logarithms) for doing multiplication, division, addition and subtraction, I think that this device is great for it. Basically one needs to do table lookup, with some interpolation. Say 12 bit logarithmic math, just a little bit larger than the hard core 9 bit adders on other devices.

Any other suggestions? I am particularly curious about how one would use 8 bit LUT trees.

— Reply to this email directly, view it on GitHub https://github.com/chili-chips-ba/openCologne/issues/57#issuecomment-2675290060, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADFUATL6J7F7KKFON6HZEZL2Q5XTNAVCNFSM6AAAAABXTQAZGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZVGI4TAMBWGA . You are receiving this because you commented.Message ID: @.***> [image: PythonLinks]PythonLinks left a comment (chili-chips-ba/openCologne#57) https://github.com/chili-chips-ba/openCologne/issues/57#issuecomment-2675290060

Rather than stacking the deck, it is more about identifying the market niche for this interesting device.

Please everyone contribute your ideas to this project. Often one person's crazy idea will trigger a very reasonable suggestion from another person. I will do mine for Math.

To save energy biological systems use logarithmic representation. In biological systems many senses use a logarithmic scale. We can hear differences of 1 decibel in amplitude, that corresponds to a sound 1.25 times louder. In frequency, an octave is divided into 12 notes, and each note is divided into 100 cents. For frequencies above 5Khz, we can distinguish 10 cents, or 1/120 of the signal, or a little less than a 1% change in frequency. For light we can distinguish a change of 10% in amplitude. Of course color uses different 3 different sensors and so works differently. When the just noticeable difference is proportional to the signal size, that implies a logarithmic number system.

Without going into detail about how to use Gaussian Logarithms (also called fixed point logarithms) for doing multiplication, division, addition and subtraction, I think that this device is great for it. Basically one needs to do table lookup, with some interpolation. Say 12 bit logarithmic math, just a little bit larger than the hard core 9 bit adders on other devices.

Any other suggestions? I am particularly curious about how one would use 8 bit LUT trees.

— Reply to this email directly, view it on GitHub https://github.com/chili-chips-ba/openCologne/issues/57#issuecomment-2675290060, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADFUATL6J7F7KKFON6HZEZL2Q5XTNAVCNFSM6AAAAABXTQAZGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZVGI4TAMBWGA . You are receiving this because you commented.Message ID: @.***>

Feb 21 '25 19:02 TurboVega

... it would also be very interesting and super-cool to try @JulianKemmerer's PipelineC HLS with GateMate for some 3D animations. For that, CologneChip needs to provide proper timing models. @pu-cc may have updates on that front.

@PythonLinks how do you feel about stacking some logarithms into a GateMate-optimized test deck, written in either Verilog or VHDL?!

BTW, we too are curious about putting GateMate's 8-input LUT trees to good use. As @tarik-ibrahimovic noted, Yosys might be in the way of getting the most out of them...

Feb 21 '25 20:02 chili-chips-ba

RISC-V Stress Tests

From my work on very small soft core stack machines, I learned that RISC-V is grossly inefficient on 4 input LUT devices. You can watch the video here: (starts at 2:40). https://www.youtube.com/embed/ZokT2tEiXSA

To multiplex 32 bits on a 4 input LUT device requires 31 4-LUTs. To multiplex 32 x 32 bit registers requires 996 4-LUTs. To do 2 arguments requires 1988 4-LUTs. In contrast on GateMate, multiplexing 32 bits would be 11 CPEs. 352 CPEs for 32x32 registers. If they can get the 8Mux to work, that would fall to 5 CPEs. or 160 total. Of course to that one has to add the cost of 32 bit multiplies. I estimate 160 CPEs. I could calculate it exactly. In the video I did calculate it for 8x8 multiplies. But overall a fraction, about 1/4, of what is required for RISC-V on 4-LUTs. Plus this device is less expensive on Digikey.

Instead what many RISC-V designs do is to store the registers in BRAMs, but then that requires more clock cycles to access, or a larger pipelined device. In either case RISC-V is a killer app for GateMate.

I actually built an "Awesome RISC-V Soft Cores" page. The next step is to figure out which ones use proper registers, and which ones use BRAMs for registers. One person said that many have a switch to choose. And then that will make great marketing material for GateMate. "Awesome RISC-V Soft Cores is a directory of the best soft core RISC-V machines." And when reading it people would find out that they all run best on GateMate. Based on the Digikey cost of the part, and the percentage occupied by RISC-V, one could give the cost of the different soft cores on the different chips. A very powerful marketing message.

how do you feel about stacking some logarithms into a GateMate-optimized test deck, written in Verilog.

Generally that is what I will be doing. It is where my strength lies. (Math on FPGAs). But I think it is a terrible market niche. Almost nobody thinks in terms of logarithms nowadays. (I used to even have log and log-log graph paper in my filing cabinet). There is no money to be made there.

Where is the money?

What I should be doing is porting RISC-V soft cores to GateMate, and comparing the cost of each soft core (percentage of resources * chip cost) on GateMate versus on other devices. There is a real business case to be made for GateMate. Now that Trump is ditching Ukraine, the risk of him ditching Taiwan is far greater. There is money to be made selling EU sourced FPGAs.

But sadly no one wants to hire me to do this either. Back to Gaussian logarithms.

Feb 22 '25 05:02 PythonLinks

"... EU sourced FPGA..."

is made in an UAE-owned fab, and also eying TSMC 😉

"... a fraction, about 1/4, of what is required for RISC-V on 4-LUTs..."

@tarik-ibrahimovic tests have yielded substantially different results: By all means, please go for the challenge to prove Tarik wrong. The more eyes, brains and perspectives on the problem, the better the final conclusion.

"... I could calculate it exactly..."

Please, don't calculate. Just do it (Nike style). Push it through the tools, then compare the numbers that the tools have calculated.

"... ditching Ukraine, the risk of him ditching Taiwan is far greater..."

The USA seems to be ditching the EU as a whole, so that it can focus on the Far East (that Taiwan is a rather small part of).

Feb 22 '25 07:02 chili-chips-ba

I do not think that the core score is a good measure. Serve is a bit serial processor. It is not at all clear to me how that is implemented. I suspect that it initially targeted ICE40 LUT-4 devices. reportedly it takes 32 or 64 clock cycles to execute one instruction, plenty of time to load registers from BRAMS. What is needed is a regular RISC-V that tries to multipliplex 32 x 32 bit registers.

Checking "Awesome Gatemate" There are three RISC-Vs on Gatemate. When I started looking into these issues, on Dec 28th, 2024, one of the two FemtoRV authors said that "Without pipelines given the memory model, two clock cycles per instruction is possible when using quark-bicycle." Looking closer, it has a full register set.

reg [31:0] registerFile [31:0];

Femto RV Bicycle

So that would be a good comparison. Probably not the one that runs on GateMate now.

The second RISC-V on GateMate is the NEORV32. From the documentation

The data register file contains the general purpose architecture registers x0 to x31. For the rv32e ISA only the lower 16 registers are implemented.

A web search says that.

"RV32E is a reduced version of RV32I for embedded systems, with 16 integer registers and soft-float calling convention. It uses the same instruction-set encoding as RV32I and can be combined with standard extensions."

Here are the notes on porting NEORV32 to gatemate.

https://github.com/stnolting/neorv32/discussions/983

and

https://github.com/stnolting/neorv32-setups/tree/main/cologne_chip/GateMateA1-EVB

Sadly no resource utilizations are published. If it is indeed the i version, and not the e version, that would be another good comparison.

There is also the EduBoss but sadly that page does not quote the resources for GateMate, and all the quoted examples store the register file in SSRAM. But there is the option to store the register file in LUTRAM, so that might make a good 3rd comparison. Really all 3 comparisons should be done.

Better yet do a 64 bit RISC-V.

If we want to showcase the 4 bit multiplexers, we need to use examples that are actually based on using lots of multiplexers.

The USA seems to be ditching the EU as a whole, so that it can focus on the Far East (that Taiwan is a rather small part of).

Do we believe anything that the White House says? I will remind you that MUSK has significant factories in China, and thus is subject to pressure from them. IIRC Trump also has business interests in China. For security, Taiwan needs to give him a nice beachfront Casino! One that is not transferable!

Feb 22 '25 08:02 PythonLinks

... good point about the need to expand the portfolio of RISC-V cores used for GateMate Stress Testing👍

Towards that, Tarik is trying to bring up Veerwolf on GateMate, thus far unsuccessfully 🤒

@stnolting has also reported a number of problems with NEORV32 on GateMate. Here is one related to the 8-input muxes. There is a bunch of other issues. The entire text is a good read.

While Tarik shall add femtorv32_quark_bicycle and eduBOS5 metrics to his GateMate test suite, it would also be very interesting to hear @BrunoLevy, @trabucayre, @matthias thoughts on this FPGA architecture benchmarking effort, including about where to take to in the next step.

Feb 22 '25 18:02 chili-chips-ba

Tarik shall add femtorv32_quark_bicycle and eduBOS5 metrics to his GateMate test suite.

Great. Thanks for listening. From a marketing perspective, what is needed is to also get the statistics for those soft cores on the $60 ECP5 MuseLabs IceSugar Pro, and the $38 Pico Ice with ICE40 UP5K.

This data will be a big help for my talks about the advantages of stack machines vs RISC-V soft cores on FPGAs with LUT4s and no MUXs.

Feb 22 '25 19:02 PythonLinks

Board pricing is not a reliable indicator of the cost of FPGA device that a particular board carries. The FPGAs are generally categorized by their LUT count. The LUT count is therefore the primary entry point into all comparisons.

We can also try to attach the Cost Factor, but only as the last/informal column, and with caveats.

To that end, any recommendations from your side on the source of cost info? Should we use Digikey pricing for 1K quantities??

Feb 22 '25 20:02 chili-chips-ba

I agree that board pricing is not the best indicator.

Digikey pricing for 1000 parts seems like a more reasonable metric.

The larger point is not just port those soft cores to GateMate, but also to the price similar FPGAs.

Feb 22 '25 20:02 PythonLinks

While Tarik shall add femtorv32_quark_bicycle and eduBOS5 metrics to his GateMate test suite, it would also be very interesting to hear @BrunoLevy, @trabucayre, @matthias thoughts on this FPGA architecture benchmarking effort, including about where to take it in the next step.

The FemtoRV-Matthias here :-) Huh, the Gatemate FPGAs look nice, but these are quite large FPGAs. The FemtoRV32-Quark (and Quark-Bicycle) implementing RV32I only were designed to run in tiny Lattice HX1K FPGAs. I think for the quite large Gatemate FPGAs, FemtoRV32-Gracilis implementing RV32IMC with interrupts is a better match. I have, however, no idea on FPGA benchmarks.

Feb 22 '25 23:02 Mecrisp

By the way, here is a small RISC-V demo project featuring the FemtoRV32-Quark: https://codeberg.org/Mecrisp/Nandland-RISC-V The Quark-Bicycle is a drop-in replacement if you like.

Feb 22 '25 23:02 Mecrisp

Here is a larger port of the RISC-V playground for Lattice UP5K, featuring FemtoRV32-Gracilis:

https://badge.team/docs/badges/mch2022/software-development/risc-v/ https://github.com/badgeteam/mch2022-firmware-ice40/tree/master/projects/RISCV-Playground

Maybe also some of the other examples for the MCH2022 badge could be interesting.

Feb 22 '25 23:02 Mecrisp

... this discussion seems to be converging on a configurable corescore benchmark, one that would allow plug-and-play drop-in replacement of RISC-V cores within that multi-core SOC.

That's to, in addition to the incumbent SERV, let the testing encompass additional cores, such as the ones that @Mecrisp and @PythonLinks have brought up.

Feb 23 '25 04:02 chili-chips-ba

I changed the title from Documenation to "Multiplexer Stress Test". I wan to keep everyone's eyes firmly on the issue of the advantages of the GateMate multiplexers over LUT-4 Multiplexers.

Bicycle

The Femto RV bicycle has that full multiplexer, and another large multiplexer for the 32 bit shifter, and "is quite small", meaning that it does not have much else. So it is a perfect demo of the strength of the GateMate Multiplexers. My only issue with bicycle is that it only has 1 shift multiplexer, not 2.

Use the same shifter both for left and right shifts by applying bit reversal

https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_quark_bicycle.v#L128

Serve

As I suspected Serv is not a good stress test for multiplexers. From the documentation, it is clear that it stores the register file in RAM.

VeerWolf

VeerWolf is probably also not a good stress test/ marketing pitch. It is a big cpu with lots of stuff besides register and shifter multiplexer.

Gracilis

I suspect that the same is true for the Gracillis one recommended by @Mecrisp. Although it probably has two shifters. My goal here is not to find the best RISC-V for GateMate. My goal here is not to show what is wrong with GateMate, but to show what is right with it. To show that if you have an application with a high percentage of Multiplexers, what is the best low cost FPGA to use.

And do not be shy about putting price front and center as the first column! I hate to say it, but real world customers put price front and center. They want that comparison. Technical details, well they leave that to people like us who focus on technical details. Still we need to keep in mind what those managers have in their minds.

Feb 23 '25 06:02 PythonLinks

Hello, Thank you for the very interesting discussion ! If I understand well, the goal of the discussion is finding a reasonably simple risc-V core with a large number of muxes to demonstrate GateMate ? Here is a couple of ideas:

Maybe we could start from the bicycle (that has a single shifter) and add a right shifter to it (but I'm unsure this would add a mux the way you want to demonstrate, since bicycle and the other FemtoRVs already have muxes to select from the direct and bit-reversed output of the shifter, but I have a very partial understanding on how this maps to muxes...).

Another (more complicated) possibility is considering a pipelined core with register forwarding (see figure). This involves two three-way muxes for selecting the input of the execute stage. An example is here and associated design notes. My "draft" is not as clean and self-contained as the Femto series though... (and lacks an I$ and D$ caches). If Gatemate has four-way muxes, then we could imagine a 6-stages pipeline (for instance splitting the Memory stage into two to align and sign-extend the read values). It is also possible to create a bypass for accelerating memcpy (involves an additional mux). If you have even wider muxes, then it may be possible to create a larger number of smaller stages for higher maxfreq (but then you need efficient branch prediction).
One more point: if the goal is to create a tech demo, then I think it is interesting to have some interrupts and traps mechanism (like in FemtoRV-gracilis), since it makes it possible to have a very small core and emulate instruction sets in traps. This can be used to run Linux-NoMMU (not very interesting, but it can have a symbolic/emotional echo in the user base). gracilis may need a couple of (simple) updates to do that (and I am very unsure that running Linux is the way to go, except the emotional target...).
Another idea: starting from a Gracilis base, a minimal RV32F core (with a floating point register bank), implementation of MAC(multiply-add) in hw, and software traps for all other RV32F instructions. It would both add a number of muxes and create an interesting tech demo. We could also imagine adding support for POSIT floating-point format (that has some similarities with the log representation mentioned in this discussion).

Feb 23 '25 07:02 BrunoLevy

... while it is a bit funny to have to work this hard to find a design that CologneChip FPGA would excel in:

knowing that GateMate is severely lacking on the LUTRAM front
and looking to offset this shortcoming though muxing and math functions

a minimal RV32F core seems to be a good option.

How long would it take to put one together to benchmark GateMate against GowinSemi, Xilinx and Lattice?!

Feb 23 '25 08:02 chili-chips-ba

I have a basic microprogrammed RV32F unit here, see also design notes. It would be mostly a matter of:

removing FDIV and FSQRT from it (that eat-up of lot of resources and that could be emulated in a trap)
grafting it on a gracilis
make sure gracilis generates traps for the missing instructions If @Mecrisp and I can find some time to work on this, I'd say it is a couple of weeks full time , but since we do that on our free time, I'd say something around 3 months. It also depends on how easy it is to talk to CologneChip FPGAs (if it speaks Yosys then it is easy) -> I've seen in the doc that it uses Yosys (+ an in-house PnR) and OpenFPGALoader, it seems we'll feel just "at home", great !

Note1: this is a quite unefficient multicycle FPU. It would be also possible to create a pipelined version, but this would require much more work / more time (I have been wanting to do this for more than 1y without finding the time...)

Note2: there is also the Zfinx instruction set that uses a shared register file for FP and integer instructions, that may be easier to implement. @Mecrisp, what do you think ?

Feb 23 '25 08:02 BrunoLevy

... the good news is that we can also count on the direct help from Cologne Chip engineering staff, represented on this discussion thread by @pu-cc

Feb 23 '25 16:02 chili-chips-ba

I've taken a quick look at Cologne Chip whitepaper, it seems its CPE also has an interesting modular multiplier function block. If this one is fast, we could imagine creating an RV32D (double precision) multiply-adder.

Feb 23 '25 18:02 BrunoLevy

Other ideas for a technical demo: if it is easy to create a large number of multipliers, having a GPU-like design would be interesting, either computing pixels in parallel and generating an image, or doing some AI/inference (and the low power consumption of the Cologne Chip could be an interesting argument).

Feb 23 '25 18:02 BrunoLevy

We seem to have at this point assembled a stack of ideas for the benchmarking tangents that would stack the deck for GateMate👍. It shall be both interesting and revealing to stack up:

--1--

GateMate math resources built into their unique LUT-Tree structure
to the DSP hard macros + generic LUTs found in the classic FPGA devices

--2--

GateMate 8-input logic/muxing capability
to the generic LUT with MUXF7 components routinely found in the mainstream architectures

In the process, we can also compare GateMate power consumption, Fmax and pricing structure to the others.

@pu-cc, if you have any further guidance on this attack plan from the Cologne Chip insider perspective, this is the time to chime in.

Feb 23 '25 20:02 chili-chips-ba

Dumped some notes here, @Mecrisp do not hesitate to tell us what you think !

Feb 25 '25 09:02 BrunoLevy

I am thinking on what could be a nice demo... Do we have analogs on the upcoming board? If we have ADCs/DACs, the pipelined FFT by Dan Gisselquist would be a nice benchmark:

https://zipcpu.com/dsp/2018/10/02/fft.html https://github.com/ZipCPU/dblclockfft

Olimex is selling 100 MHz ADCs/DACs addons for their FPGA boards, and I would love to have BNC or SMA connectors and fast analogs on board as-is. Complement this with a few jumpers to select voltage ranges. With two DAC channels, we can also do vector graphics, which is very cool.

It probably depends if the board shall be a vanilla FPGA eval, more of an artistic badge with cool, but maybe odd hardware features like the MCH2022, or a lab instrument in the footsteps of a Red Pitaya or (lower spec) Thunderscope. Another possibility would be to look at the ideas of the Apertus Labs AXIOM Micro, and create an FPGA eval that is also a fancy and fast black&white webcam, ready to be used for infrared and ultraviolet (astro-)photography, with readout algorithms and triggers user-selectable.

Implementing an improved floating point unit for FemtoRV is a large project indeed. I am not sure which niche we want to grow into; this is nicely catered by Vexriscv, which already comes with a single and double floating point implementation out of the box, optimised for performance. Bruno, if we choose to do it, I would go for Zfinx as you suggest, but my favourite instructions to implement next would be the Bitmanip ones. Our special sauce we got known for is having small, well-readable and nicely understandable CPU cores, and we would be expected to find a very elegant way to implement new features.

Feb 25 '25 16:02 Mecrisp

I tried to fire up FemtoRV Quark and then bicycle on Mac OS X, and ran into the following issue with Quark. Basically does not work with the newest versions of Yosys. There is a pull requrest which fixes the problem for the ICESTICK, but I expect that bicycle will not fit on ICESTICK. It only has ~1260 LUTs.

https://github.com/BrunoLevy/learn-fpga/issues/125

I am curious about the savings when using multiplexers over 4-input LUTs, but am not sufficiently motivated to debug the problems. Really it is an issue for Cologne Chips to resolve, if they want to demonstrate the advantage of their FPGA over the competing Lattice devices.

I will try it again in about a month. Maybe the problems will be fixed by then.

Feb 27 '25 08:02 PythonLinks

"... it is an issue for Cologne Chips to resolve, if they want to demonstrate the advantage of their FPGA over the competing Lattice devices..."

@PythonLinks, could you please open another ticket here for that issue, so that we can keep track of it.

Feb 27 '25 17:02 chili-chips-ba

Tuning in with some numbers to this discussion. Benchmarking Gowin GW2AR-18C C8/I7 and CCGM1A1 with @aimamovic6's Sigma-Delta DAC yielded detailed insights which you can check out in depth here.

The takeaway

Purely in the sense of logic capability, LUT-trees are on-par with LUTs when it comes to complex arithmetic (-28% of logic elements used for the same design), but a mentioned lack of DSP HMs really hurts the end results (+37% of logic elements used for the same design). However, when normalizing the number of logic elements by the number of configuration bits, it seems that total area on the chip is still larger in CCGM1A1 when compared to traditional LUT4 GW2AR-18C, with or without DSP HMs.

Mar 11 '25 20:03 tarik-ibrahimovic

That is a great technical analysis, but a very unfair market comparison.

You are comparing a $13.75 GateMate quantity 119 with a 69,10 € Gowin FPGA quantity 168 It would be much fairer to compare it with:
$7.95 Lattice ICE40 UP5K quantity 100

It would be great if someone figured out what the competing FPGA's are, and then I could write my piece about what the advantages of the GateMate are compared to those other FPGAs.

For example the lattice ICE40 UP5K has a similar amount of RAM, but most of it is single port, whereas the GateMate RAM is dual port.

Christopher Lozinski

Mar 14 '25 06:03 PythonLinks