clash-compiler
clash-compiler copied to clipboard
BRAM initializers use loops in initial block, triggering "maximum iteration" counters in synthesizers
The simplest possible block RAM, one that is initialized to all zeroes:
import Clash.Prelude
type Addr = Unsigned 13
type Value = Unsigned 8
topEntity
:: Clock System Source
-> Reset System Asynchronous
-> Signal System Addr
-> Signal System (Maybe (Addr, Value))
-> Signal System Value
topEntity = exposeClockReset $ \addr write ->
blockRam (pure 0x00 :: Vec 0x2000 Value) addr write
From this, CLaSH 9a466bd9c08b8ff8fc215bfbca318873e6d57e4a generates the following Verilog:
/* AUTOMATICALLY GENERATED VERILOG-2001 SOURCE CODE.
** GENERATED BY CLASH 0.99. DO NOT MODIFY.
*/
module Main_topEntity
( // Inputs
input clk // clock
, input rst // asynchronous reset: active high
, input [12:0] addr
, input [21:0] write
// Outputs
, output reg [7:0] result
);
wire c$app_arg;
wire [20:0] tup;
wire signed [63:0] wild;
wire signed [63:0] wild_0;
wire [12:0] x;
wire [20:0] x_0;
// blockRam begin
reg [7:0] result_RAM [0:8192-1];
reg [65535:0] ram_init;
integer i;
initial begin
ram_init = {8'd0, 8'd0, 8'd0, ... };
for (i=0; i < 8192; i = i + 1) begin
result_RAM[8192-1-i] = ram_init[i*8+:8];
end
end
always @(posedge clk) begin : result_blockRam
if (c$app_arg) begin
result_RAM[(wild)] <= tup[7:0];
end
result <= result_RAM[(wild_0)];
end
// blockRam end
assign c$app_arg = write[21:21] ? 1'b1 : 1'b0;
assign tup = write[21:21] ? x_0 : ({21 {1'bx}});
assign wild = $signed(($signed({{(64-13) {1'b0}},x})));
assign wild_0 = $signed(($signed({{(64-13) {1'b0}},addr})));
assign x = tup[20:8];
assign x_0 = write[20:0];
endmodule
Two problems with this:
-
Surely
ram_init
is superfluous if the RAM contents were initialized bypure
. And isn't that the most common case? Would a non-pure
initializer be even synthesizable on an FPGA? -
FPGA synthesis tools trip up on the looping initalizer. Intel Quartus gives:
Error (10106): Verilog HDL Loop error at SpaceInvaders.v(26447): loop must terminate within 5000 iterations
Xilinx ISE gives a similar error (the max iteration count is different but similarly limited).
Surely ram_init is superfluous if the RAM contents were initialized by pure.
Probably. We'd need to be cleverer during blackbox generation.
And isn't that the most common case?
Yes! Related: https://github.com/clash-lang/clash-compiler/issues/621.
Would a non-pure initializer be even synthesizable on an FPGA?
It is. The tricky part is resetting it, which does need to happen word-by-word.
FPGA synthesis tools trip up on the looping initalizer. Intel Quartus gives:
We probably need to initialize using blockram files.
In conclusion, I think we need to rethink/restructure our blockram code. I believe we need to offer something like the following API. Normal blockram will not reset ram on active reset, and will retain current API:
blockRam
:: [..]
=> Vec n a
-> Signal domain addr
-> Signal domain (Maybe (addr, a))
-> Signal domain a
Two extra functions will be added:
data ResetStrategy (r :: 'Bool) where
ResetBlockRam :: ResetStrategy 'True
NoResetBlockRam :: ResetStrategy 'False
blockRamInitF
:: [..]
=> (SNat n -> a)
-> ResetStrategy r
-> Signal domain addr
-> Signal domain (Maybe (addr, a))
-> Signal domain a
blockRamInit1
:: [..]
=> ResetStrategy r
-> a
-> Signal domain addr
-> Signal domain (Maybe (addr, a))
-> Signal domain a
blockRamInit1 c =
blockRamIinitF (const c)
Is there some short-term workaround I can apply to remove these initializers completely for now? I am currently removing the initializer blocks manually after CLaSH generates them, and the resulting Verilog works on real hardware (I am not depending on the initial value of 0x00 at all).
@gergoerdi work-around is to use blockRamFile
, using pack
/unpack
to do the BitVector conversion, and TemplateHaskell + compiletime file I/O to fill the vector with the desired values (see also: http://hackage.haskell.org/package/clash-prelude-0.99.3/docs/Clash-Sized-Fixed.html#creatingdatafiles scroll down for the TH section)
OK I've been making progress writing the following horrible, horrible code:
{-# LANGUAGE TemplateHaskell, QuasiQuotes #-}
module Cactus.Clash.TH.InitRAM where
import Clash.Prelude hiding (Exp)
import Language.Haskell.TH
import Data.List as L
import System.FilePath
import Control.Monad.IO.Class
writeRAMImage :: Int -> Int -> FilePath -> IO ()
writeRAMImage n width fn = writeFile fn $ unlines bvs
where
bvs = L.replicate n $ L.replicate width '0'
createRAMImage :: Int -> Int -> Q Exp
createRAMImage n width = do
liftIO $ mapM_ (writeRAMImage n width)
[ "_build" </> fn
, "_build/verilog/SpaceInvaders/SpaceInvaders" </> fn
]
litE $ stringL fn
where
fn = "null-" <> show n <> "x" <> show width <.> "hex"
blockRam_
:: Int -> Int -> Q Exp
blockRam_ n w =
[e|
let ram :: forall addr n value. (Enum addr, BitPack value, KnownNat (BitSize value))
=> forall domain gated. (HiddenClock domain gated)
=> Signal domain addr
-> Signal domain (Maybe (addr, value))
-> Signal domain value
ram r w = unpack <$> blockRamFile $sizeNat $img r (fmap (fmap pack) <$> w)
in ram
|]
where
sizeNat = appTypeE (conE 'SNat) (litT . numTyLit . fromIntegral $ n)
img = createRAMImage n w
but now it seems CLaSH has an extra trick up its sleeve to make my life harder: although this generates (deliberately!) RAM image file names that only depend on the size and width, the generated $readmemb
macro invocations are made unique, i.e. I have
$readmemb("null-7168x8.hex",RAM_0);
at one point, and then
$readmemb("null-7168x8.hex",RAM_1);
at the next, which of course points to a non-existent file.
On the plus side, using this blockRam_
speeds up CLaSH considerably: my Space Invaders implementation now synthesizes in 1m30s instead of 3m50s! So hopefully blockRamInit1
will be a huge improvement overall.
@gergoerdi I've added blockRam1
in https://github.com/clash-lang/clash-compiler/pull/645. Hopefully it'll be merged soon.
And btw, I saw your Space Invaders demo on Reddit. Wicked cool!
Using blockRam1
doesn't quite solve this issue.
While true that it omits the large literal array for the initial values, it still produces a loop that looks like this:
// blockRam1 begin
reg [7:0] result_25_RAM [0:7168-1];
integer i_0;
initial begin
for (i_0=0;i_0<7168;i_0=i_0+1) begin
result_25_RAM[i_0] = 8'd0;
end
end
and this still results in
ERROR:HDLCompiler:569 - "_build/verilog/SpaceInvaders/SpaceInvaders/SpaceInvaders.v" Line 9589: Loop count limit exceeded. Condition is never false.
Ah darn, you're right. It doesn't produce a loop in SystemVerilog and VHDL anymore, but it still does for Verilog. The real solution is to have the blackbox implementation generate a memory file and include it in the final design.
While "fixing" this bug I did find some resources mentioning that you can simply increase the maximum loop counter. At least for some synthesis tools. What are you using?
Currently I am using Xilinx ISE 14.7 but the less customization I have to do to it the better.
Yeah it's annoying..
https://forums.xilinx.com/t5/Synthesis/Loop-count-limit-exceeded-Condition-is-never-false/m-p/591941/highlight/true#M14436
So are you going to have a loop-less blockram implementation for Verilog target? I am using Verilog because I remember seeing a CLaSH doc page saying VHDL is not supported for memory for Altera; and SystemVerilog I don't even know how to use it to blink an LED.
@gergoerdi the only thing not supported is blockRamFile
when you use the Quartus+VHDL combo. The following all work: Quartus + (System)Verilog, Vivado/ISE + VHDL, Vivado/ISE + (System)Verilog. blockRam
works on all combinations.
Note that this is a Quartus "limitation" in that it doesn't want to do file I/O at synthesis time for VHDL. If you want file-based instantiation for BlockRAMs on Quartus you have to instantiate their altsyncram
primitive.
So are you going to have a loop-less blockram implementation for Verilog target?
Either that or multiple loops (if that's supported at all). My original plan was to have a single, general blockram primitive taking a function Index n -> a
, which would be completely evaluated by the primitive implementation, written to a file, and used by every HDLs equivalent of Verilog's readmemh
. However, I've just learned - like @christiaanb mentioned - that this is impossible for Quartus+VHDL. What we could do is add a specific Quartus codepath that generates an altsyncram
instead though.
I won't have time to fix this for it for at least another two weeks.. :disappointed:
JFTR I've finally checked, and Xilinx ISE doesn't support SystemVerilog so I can't use that for Spartan-6 based projects.
Would it be possible as a temporary workaround to have a blockRam0
which does no initialization at all?
Bumping to 1.1 as there are work-arounds in the respective synthesis tools:
- ISE has a
set -loop_iteration_limit XYZ
which you can add to the.xst
file - Quartus you can add
set_global_assignment -name VERILOG_NON_CONSTANT_LOOP_LIMIT XYZ
to the.qsf
file.
An initial prototype attempting to write the constants to file in a format that either the $readmemb
or $readmemh
system tasks understand proved more involved than initially expected.