clash-compiler icon indicating copy to clipboard operation
clash-compiler copied to clipboard

BRAM initializers use loops in initial block, triggering "maximum iteration" counters in synthesizers

Open gergoerdi opened this issue 5 years ago • 16 comments

The simplest possible block RAM, one that is initialized to all zeroes:

import Clash.Prelude

type Addr = Unsigned 13
type Value = Unsigned 8

topEntity
    :: Clock System Source
    -> Reset System Asynchronous
    -> Signal System Addr
    -> Signal System (Maybe (Addr, Value))
    -> Signal System Value
topEntity = exposeClockReset $ \addr write ->
  blockRam (pure 0x00 :: Vec 0x2000 Value) addr write

From this, CLaSH 9a466bd9c08b8ff8fc215bfbca318873e6d57e4a generates the following Verilog:

/* AUTOMATICALLY GENERATED VERILOG-2001 SOURCE CODE.
** GENERATED BY CLASH 0.99. DO NOT MODIFY.
*/
module Main_topEntity
    ( // Inputs
      input  clk // clock
    , input  rst // asynchronous reset: active high
    , input [12:0] addr
    , input [21:0] write

      // Outputs
    , output reg [7:0] result
    );
  wire  c$app_arg;
  wire [20:0] tup;
  wire signed [63:0] wild;
  wire signed [63:0] wild_0;
  wire [12:0] x;
  wire [20:0] x_0;

  // blockRam begin
  reg [7:0] result_RAM [0:8192-1];

  reg [65535:0] ram_init;
  integer i;
  initial begin
    ram_init = {8'd0, 8'd0, 8'd0, ... };
    for (i=0; i < 8192; i = i + 1) begin
      result_RAM[8192-1-i] = ram_init[i*8+:8];
    end
  end

  always @(posedge clk) begin : result_blockRam
    if (c$app_arg) begin
      result_RAM[(wild)] <= tup[7:0];
    end
    result <= result_RAM[(wild_0)];
  end
  // blockRam end

  assign c$app_arg = write[21:21] ? 1'b1 : 1'b0;

  assign tup = write[21:21] ? x_0 : ({21 {1'bx}});

  assign wild = $signed(($signed({{(64-13) {1'b0}},x})));

  assign wild_0 = $signed(($signed({{(64-13) {1'b0}},addr})));

  assign x = tup[20:8];

  assign x_0 = write[20:0];
endmodule

Two problems with this:

  1. Surely ram_init is superfluous if the RAM contents were initialized by pure. And isn't that the most common case? Would a non-pure initializer be even synthesizable on an FPGA?

  2. FPGA synthesis tools trip up on the looping initalizer. Intel Quartus gives:

Error (10106): Verilog HDL Loop error at SpaceInvaders.v(26447): loop must terminate within 5000 iterations

Xilinx ISE gives a similar error (the max iteration count is different but similarly limited).

gergoerdi avatar Jun 15 '19 10:06 gergoerdi

Surely ram_init is superfluous if the RAM contents were initialized by pure.

Probably. We'd need to be cleverer during blackbox generation.

And isn't that the most common case?

Yes! Related: https://github.com/clash-lang/clash-compiler/issues/621.

Would a non-pure initializer be even synthesizable on an FPGA?

It is. The tricky part is resetting it, which does need to happen word-by-word.

FPGA synthesis tools trip up on the looping initalizer. Intel Quartus gives:

We probably need to initialize using blockram files.


In conclusion, I think we need to rethink/restructure our blockram code. I believe we need to offer something like the following API. Normal blockram will not reset ram on active reset, and will retain current API:

blockRam
  :: [..]
  => Vec n a
  -> Signal domain addr
  -> Signal domain (Maybe (addr, a))
  -> Signal domain a

Two extra functions will be added:

data ResetStrategy (r :: 'Bool) where
  ResetBlockRam :: ResetStrategy 'True
  NoResetBlockRam :: ResetStrategy 'False

blockRamInitF
  :: [..]
  => (SNat n -> a)
  -> ResetStrategy r
  -> Signal domain addr
  -> Signal domain (Maybe (addr, a))
  -> Signal domain a


blockRamInit1
  :: [..]
  => ResetStrategy r
  -> a
  -> Signal domain addr
  -> Signal domain (Maybe (addr, a))
  -> Signal domain a
blockRamInit1 c = 
  blockRamIinitF (const c)

martijnbastiaan avatar Jun 16 '19 09:06 martijnbastiaan

Is there some short-term workaround I can apply to remove these initializers completely for now? I am currently removing the initializer blocks manually after CLaSH generates them, and the resulting Verilog works on real hardware (I am not depending on the initial value of 0x00 at all).

gergoerdi avatar Jun 17 '19 03:06 gergoerdi

@gergoerdi work-around is to use blockRamFile, using pack/unpack to do the BitVector conversion, and TemplateHaskell + compiletime file I/O to fill the vector with the desired values (see also: http://hackage.haskell.org/package/clash-prelude-0.99.3/docs/Clash-Sized-Fixed.html#creatingdatafiles scroll down for the TH section)

christiaanb avatar Jun 17 '19 05:06 christiaanb

OK I've been making progress writing the following horrible, horrible code:

{-# LANGUAGE TemplateHaskell, QuasiQuotes #-}
module Cactus.Clash.TH.InitRAM where

import Clash.Prelude hiding (Exp)
import Language.Haskell.TH
import Data.List as L
import System.FilePath
import Control.Monad.IO.Class

writeRAMImage :: Int -> Int -> FilePath -> IO ()
writeRAMImage n width fn = writeFile fn $ unlines bvs
  where
    bvs = L.replicate n $ L.replicate width '0'

createRAMImage :: Int -> Int -> Q Exp
createRAMImage n width = do
    liftIO $ mapM_ (writeRAMImage n width)
      [ "_build" </> fn
      , "_build/verilog/SpaceInvaders/SpaceInvaders" </> fn
      ]
    litE $ stringL fn
  where
    fn = "null-" <> show n <> "x" <> show width <.> "hex"

blockRam_
  :: Int -> Int -> Q Exp
blockRam_ n w =
    [e|
     let ram :: forall addr n value. (Enum addr, BitPack value, KnownNat (BitSize value))
             => forall domain gated. (HiddenClock domain gated)
             => Signal domain addr
             -> Signal domain (Maybe (addr, value))
             -> Signal domain value
         ram r w = unpack <$> blockRamFile $sizeNat $img r (fmap (fmap pack) <$> w)
     in ram
    |]
  where
    sizeNat = appTypeE (conE 'SNat) (litT . numTyLit . fromIntegral $ n)
    img = createRAMImage n w

but now it seems CLaSH has an extra trick up its sleeve to make my life harder: although this generates (deliberately!) RAM image file names that only depend on the size and width, the generated $readmemb macro invocations are made unique, i.e. I have

    $readmemb("null-7168x8.hex",RAM_0);

at one point, and then

    $readmemb("null-7168x8.hex",RAM_1);

at the next, which of course points to a non-existent file.

gergoerdi avatar Jun 22 '19 12:06 gergoerdi

On the plus side, using this blockRam_ speeds up CLaSH considerably: my Space Invaders implementation now synthesizes in 1m30s instead of 3m50s! So hopefully blockRamInit1 will be a huge improvement overall.

gergoerdi avatar Jun 22 '19 12:06 gergoerdi

@gergoerdi I've added blockRam1 in https://github.com/clash-lang/clash-compiler/pull/645. Hopefully it'll be merged soon.

And btw, I saw your Space Invaders demo on Reddit. Wicked cool!

martijnbastiaan avatar Jun 24 '19 08:06 martijnbastiaan

Using blockRam1 doesn't quite solve this issue.

While true that it omits the large literal array for the initial values, it still produces a loop that looks like this:

  // blockRam1 begin
  reg [7:0] result_25_RAM [0:7168-1];
  integer i_0;
  initial begin
      for (i_0=0;i_0<7168;i_0=i_0+1) begin
          result_25_RAM[i_0] = 8'd0;
      end
  end

and this still results in

ERROR:HDLCompiler:569 - "_build/verilog/SpaceInvaders/SpaceInvaders/SpaceInvaders.v" 
Line 9589: Loop count limit exceeded. Condition is never false.

gergoerdi avatar Jun 25 '19 13:06 gergoerdi

Ah darn, you're right. It doesn't produce a loop in SystemVerilog and VHDL anymore, but it still does for Verilog. The real solution is to have the blackbox implementation generate a memory file and include it in the final design.

While "fixing" this bug I did find some resources mentioning that you can simply increase the maximum loop counter. At least for some synthesis tools. What are you using?

martijnbastiaan avatar Jun 25 '19 13:06 martijnbastiaan

Currently I am using Xilinx ISE 14.7 but the less customization I have to do to it the better.

gergoerdi avatar Jun 25 '19 13:06 gergoerdi

Yeah it's annoying..

https://forums.xilinx.com/t5/Synthesis/Loop-count-limit-exceeded-Condition-is-never-false/m-p/591941/highlight/true#M14436

martijnbastiaan avatar Jun 25 '19 13:06 martijnbastiaan

So are you going to have a loop-less blockram implementation for Verilog target? I am using Verilog because I remember seeing a CLaSH doc page saying VHDL is not supported for memory for Altera; and SystemVerilog I don't even know how to use it to blink an LED.

gergoerdi avatar Jun 26 '19 10:06 gergoerdi

@gergoerdi the only thing not supported is blockRamFile when you use the Quartus+VHDL combo. The following all work: Quartus + (System)Verilog, Vivado/ISE + VHDL, Vivado/ISE + (System)Verilog. blockRam works on all combinations.

Note that this is a Quartus "limitation" in that it doesn't want to do file I/O at synthesis time for VHDL. If you want file-based instantiation for BlockRAMs on Quartus you have to instantiate their altsyncram primitive.

christiaanb avatar Jun 26 '19 10:06 christiaanb

So are you going to have a loop-less blockram implementation for Verilog target?

Either that or multiple loops (if that's supported at all). My original plan was to have a single, general blockram primitive taking a function Index n -> a, which would be completely evaluated by the primitive implementation, written to a file, and used by every HDLs equivalent of Verilog's readmemh. However, I've just learned - like @christiaanb mentioned - that this is impossible for Quartus+VHDL. What we could do is add a specific Quartus codepath that generates an altsyncram instead though.

I won't have time to fix this for it for at least another two weeks.. :disappointed:

martijnbastiaan avatar Jun 26 '19 11:06 martijnbastiaan

JFTR I've finally checked, and Xilinx ISE doesn't support SystemVerilog so I can't use that for Spartan-6 based projects.

gergoerdi avatar Jul 03 '19 12:07 gergoerdi

Would it be possible as a temporary workaround to have a blockRam0 which does no initialization at all?

gergoerdi avatar Aug 05 '19 09:08 gergoerdi

Bumping to 1.1 as there are work-arounds in the respective synthesis tools:

  • ISE has a set -loop_iteration_limit XYZ which you can add to the .xst file
  • Quartus you can add set_global_assignment -name VERILOG_NON_CONSTANT_LOOP_LIMIT XYZ to the .qsf file.

An initial prototype attempting to write the constants to file in a format that either the $readmemb or $readmemh system tasks understand proved more involved than initially expected.

christiaanb avatar Aug 18 '19 13:08 christiaanb