If you are interested in what went into writing this blog post, you can view a replay of the livestream here.
As I’ve been working on the logic design for
moss, I have been regularly investigating
translates the Verilog RTL (Register Transfer Level) source into Basic
Elements of Logic (BELs), a process known as synthesis. BELs represent the
physical components on an FPGA that can be used to implement a design. One of
the larger elements that you will find on most FPGAs is some type of Block RAM
(BRAM). BRAM can be useful when there is a large memory with attributes that
match the physical component, as well as instances where other resources are
being used for implementation of the rest of the design.
I was curious about when Vivado would infer BRAM and when it would not, so I put together a minimal design meant to exercise the synthesis process. The design consists of three modules.
top is the “top” module, which serves to take inputs on package pins for the
FPGA and connect them to child modules. Every design has a top module that
represents its entrypoint.
module top( input clk, input [1:0] sw, output [1:0] led ); wire [7:0] w_data; wire [7:0] r_data; wire [7:0] w_addr; wire [7:0] r_addr; counter counter(.clk(clk), .w_data(w_data), .w_addr(w_addr), .r_addr(r_addr)); maybe_bram maybe_bram(.clk(clk), .i_addr(w_addr), .i_data(w_data), .o_addr(r_addr), .write(sw), .read(sw), .o_read(r_data)); // Do something with the data read from bram so that it does not get // eliminated in synthesis. assign led = r_data > 100 ? 1 : 0; endmodule
counter module is only serving to manipulate the values of the signals
that we will be using in our RAM module. It increments each of the passed
signals on a positive clock edge (
posedge) until they reach
255, at which
point they are reset to
module counter( input clk, output reg [7:0] w_data, output reg [7:0] w_addr, output reg [7:0] r_addr ); always @(posedge clk) begin w_data <= w_data == 255 ? 0 : w_data + 1; w_addr <= w_addr == 255 ? 0 : w_addr + 1; r_addr <= r_addr == 255 ? 0 : r_addr + 1; end endmodule
Vivado also specifies attributes that can be placed in RTL to indicate that a signal should not be removed during optimization.
KEEP_HIERARCHYprovide variations of this functionality.
The final module,
maybe_bram, consists of a 256-register array with 8-bit
values and can be read from and written to.
module maybe_bram( input clk, input [7:0] i_addr, input [7:0] i_data, input [7:0] o_addr, input write, input read, output reg [7:0] o_read ); reg [7:0] ram [255:0]; always @(posedge clk) begin if (write == 1'b1) begin ram[i_addr] <= i_data; end if (read == 1'b1) begin o_read <= ram[o_addr]; end end endmodule
After implementation, which consists of taking the synthesized netlist and mapping the primitives to physical elements on the FPGA, we can see the specific BRAM, LUTs, and other elements used on the device.
Implementation is commonly referred to as “place and route”.
We are primarily interested in the
RAMB18E1 element, which is an 18K-bit
Configurable Synchronous Block RAM. The Xilinx documentation describes this
element as a good choice for FIFO queues, automatic error correction RAM, or
general purpose memory. The
B18 refers to the 18K-bit size, while it appears
E1 suffix refers to the “edition” of the primitive, though I was unable to
confirm that from any official sources.
Esuffix is appended to a variety of Xilinx primitives and does not appear to refer directly to any specific functionality. However, newer products seem to have similar primitives with a larger
Ex. For example, the Versal equivalent to the
RAMB18E5, which is still 18K-bit, but offers a variety of other features, such as Block RAM Cascading. If you are aware of a more exact definition of the
Esuffix feel free to reach out!
Will Green has put together a great guide on FPGA memory types over at Project F if you’re interested in all the variations that different vendors provide. Typically, BRAM and Distributed RAM are compared as options for implementing memories for logic designs on an FPGA.
Distributed RAM makes use of Look-Up Table (LUT) BELs, which are the primary
primitives used to implement custom logic in an FPGA. LUTs typically consist of
static random access memory
multiplexers, resulting in an
element that can be configured as any logic function with the specified number
of inputs (e.g. 6 inputs on a
LUT6 BEL). BRAM, on the other hand, only acts as
a memory on its own, making it less flexible than LUTs. For this reason,
synthesis tools will frequently opt to use BRAM elements for logic that
implements a memory in order to preserve LUTs for custom logic when possible.
However, there are cases in which the latter is more appropriate than the
former. The Xilinx documentation provides guidance for when to use BRAM or
which focuses on read and write behavior.
|Action||Distributed RAM||Block RAM|
From this information, we can see that there are a number of attributes of our design that would make it a strong candidate to use BRAM.
Reading and writing on the
maybe_bram module are clocked.
Link to heading
As shown in the table above, BRAM reads and writes are both synchronous, while
only distributed RAM writes are synchronous. In our
maybe_bram module, both
reads and writes occur on the positive edge of the clock (
clk), meaning they
are both synchronous. We can test this by converting the read operations to
asynchronous and re-running synthesis and implementation.
module maybe_bram( input clk, input [7:0] i_addr, input [7:0] i_data, input [7:0] o_addr, input write, input read, output [7:0] o_read ); reg [7:0] ram [255:0]; assign o_read = ram[o_addr]; // asynchronous always @(posedge clk) begin if (write == 1'b1) begin ram[i_addr] <= i_data; end end endmodule
Though only a portion of the schematic is shown in the image below, we can see
maybe_bram module is now being implemented using multiple
elements, which are actually made up of multiple
The absence of BRAM usage can also be seen on the device.
Our register array is relatively large. Link to heading
Another reason why BRAM may be selected, or not selected, is due to the size of
the memory. While there are certainly much larger memories that may be
instantiated in a design, ours is at least large enough that it could not be
implemented with a single LUT element. However, if we shrink the size of the
array down to a depth of two (i.e.
[1:0]), the initial synchronous read /
synchronous write design will infer distributed RAM using
RAM32X1D elements, which are made up of LUTs.
module maybe_bram( input clk, input [7:0] i_addr, input [7:0] i_data, input [7:0] o_addr, input write, input read, output reg [7:0] o_read ); reg [7:0] ram [1:0]; // depth = 2 always @(posedge clk) begin if (write == 1'b1) begin ram[i_addr] <= i_data; end if (read == 1'b1) begin o_read <= ram[o_addr]; end end endmodule
The implementation is also, expectedly, noticeably smaller than the asynchronous read / synchronous write example.
It is likely that inference behavior could vary based on the usage of elements by other modules in a larger design. Unlike the asynchronous read / synchronous write example, this smaller memory could still be implemented with BRAM. In a future post, I’d like to explore how Vivado responds in more constrained scenarios.
We didn’t explicitly specify a RAM style. Link to heading
We briefly mentioned some of the attributes supported by Vivado earlier, but one
that we didn’t touch on is
This allows us to guide synthesis in what type of RAM to infer, and if we apply
distributed value to our synchronous read / synchronous write example, it
will use distributed RAM instead of the BRAM it selected without an attribute.
module maybe_bram( input clk, input [7:0] i_addr, input [7:0] i_data, input [7:0] o_addr, input write, input read, output reg [7:0] o_read ); (* ram_style = "distributed" *) reg [7:0] ram [255:0]; always @(posedge clk) begin if (write == 1'b1) begin ram[i_addr] <= i_data; end if (read == 1'b1) begin o_read <= ram[o_addr]; end end endmodule
The schematic view looks mostly similar to the asynchronous read / synchronous write example.
And the same is true for the device.
Closing Thoughts Link to heading
One of the things I consistently hear from folks working on chip design or FPGAs is to study how the HDL you write is being translated into actual circuits. While I am still in the earlier stages of my journey to develop expertise in chip design, I want to make sure that my learnings (and mistakes) in this area are being captured for others who come after me. If you’re interested in keeping up with this content, subscribe to the RSS feed for this blog, and tune in for my ~weekly livestreams.
As always, if you have feedback, questions, or just want to chat, feel free to
reach out to
@hasheddan on any of the platforms listed on the home