The DNPCIe_80G_A10_LL is a PCIe-based FPGA board designed to minimize input to output processing latency on 10-Gbit or 40-Gbit Ethernet packets. The primary application is the acceleration of data center tasks and algorithms. This list includes low-cost, low latency, high throughput trading without CPU intervention and search engine acceleration. Every possible variable that affects input to output latency has been analyzed and minimized. Raw 10 or 40GbE Ethernet packets can be analyzed and acted upon without a MAC, interrupts, or an operating system adding delay to the process. This configurable hardware computing platform has the ability to achieve the theoretical minimum Ethernet packet processing latency.
The FPGA - Intel/Altera Arria 10
We use a single FPGA from the Intel/Altera Arria 10 family in the F34 package. Translating the historically inane Altera-now-Intel package naming conventions into English, this means '1152-pin FBGA'. This package supports 504 I/Os with the majority utilized. Most are dedicated to off chip memory peripherals including 3 separate banks of DDR4 memory and a single QDRII+ SSRAM. The GT version of the Arria 10 FPGA is not applicable to this product. Eight of 24 total transceivers are used for an 8-lane GEN3 PCIe interface. Eight of the transceivers are connected to two QSFP+ sockets for 40GbE Ethernet (or 4 channels of 10 GbE).
You can choose one of twelve Arria 10 FPGAs from the following list
- GX: Lower cost, more size variations
- GX1150, GX900, GX660, GX570, GX480, GX320, GX270
- SX: Add a dual-core ARM Cortex-A9
- SX660, SX570, SX480, SX320, SX270
These FPGAs come in a variety of speed grades -1, -2, -3 with -1 the fastest. Table 1 depicts the resources of the FPGA with the Intel/Altera marketing exaggerations excised. The GX1150 is capable of handling nearly ~10M ASIC gates of logic. These are large FPGAs and power and cooling could be the constraining variable for resource utilization. This gate count number does not include the internal FPGA memory and multiplier blocks that are free, whether you use them or not. Features of the Arria 10 FPGAs include efficient, dual-register 6-input look-up table (LUT) logic, 20Kb block RAMs, and abundant 18x19 multipliers. Direct support for many floating operations is a key advantage of the Arria 10 FPGA fabric. OpenCL will come just as soon as we figure out how to get this board into the program.
Table 1 -- FPGA Resources
Low Latency Network Interface
Dual QSFP+ for 40 GbE or 10 GbE
Four transceiver lanes are connected to each QSFP+ socket, enabling 40 GbE and 10 GbE interfaces. This product is not capable of 100 GbE. Raw Ethernet packets (UDP) can be accessed directly by bypassing the MAC.
DDR4 - 3 banks of 4GB memory
PC4-2400 DDR4 chips are mounted on the card, providing 12GB of DDR4 memory. The memory configuration in 3 separate banks with each bank 1024M x 32 (4 GB). One bank is lost when the GX/SX320 or GX/SX270 is stuffed.
To minimize data synchronization across clock boundaries, it probably makes sense to clock this DDR4 interface at a 7x multiple of the base Ethernet frequency of 156.25 MHz, which is 1093.75 MHz A 7x phase synchronous clock can be easily generated internal to the FPGA, allowing zero latency synchronous data transfers between the Ethernet packet receiving logic and the DDR4 memory controller. The DDR4 controller can be optimized in any way you choose. We, of course, provide several verilog examples for no charge that you are welcome to use. All functions of the DDR4 DRAM can be exploited and optimized. Up to 8 banks can be open at once. Timing variables such as CAS latency and precharge can be tailored to the minimum given your operating frequency and the timing specification of the exact DDR4 memory utilized.
QDR II+ SSRAM - Memory with the lowest latency
We use a single quad data rate static RAMs (QDR II+ SSRAM) in the 4M x 18 size (72Mbit). This type of memory has separate input and output data paths enabling maximum read/write data bandwidth with minimum latency. The maximum tested frequency of this memory is 633 MHz. To minimize processing latency, we suspect it will be best to clock this QDRII+ SRAM at 625 MHz, exactly four times the internal Ethernet controller frequency of 156.25 MHz. The Arria 10 FPGAs are capable of generating internal 4x clocks that are phase synchronous, eliminating the latencies associated with the tricky re-synchronization of data moving between different clock frequencies. The internal controller can be optimized in any way you choose. We, of course, provide several Verilog examples for no charge that you are welcome to use. All functions of the QDR II+ SSRAM can be exploited, including concurrent read and write operations and four-tick bursts. The only real limitation is the amount of time and effort spent in customizing the individual memory controllers.
PCIe - Customizable 8-lane, GEN3 PCI Express
PCIe is connected directly to the FPGA via 8-lanes of transceivers. Note that the board has a 16-lane mechanical finger for stability. The interface is fully GEN2 and GEN3 capable. We ship GEN3 PCIe IP that is a full function, fixed, 8-lane master/target. To gain access to the PCIe interface, this IP must be integrated with your application. The Dini Group PCIe IP provides a flexible interface that allows the user access to multiple DMA engines, scratchpad memories, interrupts, and other endpoint-related functions to maximize performance while utilizing minimal FPGA resources. Drivers (required) for 'C' source for several operating systems are included no charge.
· Dual QSFP+ sockets. Each socket:
Ø 1 port 40 GbE or
Ø 4 ports 10 GbE
· Hosted in an 8-lane GEN1/GEN2/GEN3 PCIe slot
Ø 16-lane mechanical
Ø Low profile, short length form factor
· Fully compatible with our TCP Offload Engine (TOE/TOE128/TOE_IoT)
· FIX board support package (DN_FBSP). Functioning reference design with:
Ø 40 GbE MAC/10 GbE MAC
Ø TCP/IP Offload Engine (TOE/TOE128)
。 Up to 128 sessions
Ø FIX protocol parser
Ø PCIe Interface (8-lane, GEN3)
。 DDR4 Controller
。 QDRII+ Controller
· Intel/Altera Arria 10 FPGA (F34/1152 FBGA)
。 GX1150, GX900, GX660, GX570, GX480, GX320, GX270
Ø SX (adds dual-core ARM Cortex-A9 processor)
。 SX660, SX570, SX480, SX320, SX270
· 9.8M ASIC gates (ASIC measure) when stuffed with GX1150
Ø 854k flip-flop/6-input LUTs (1.7 million total FFs)
Ø 135 Mbytes FPGA block memory (12,984, 20kbit blocks)
Ø 3,036 multipliers each 18x19
Ø OpenCL (consult factory for availability)
· DDR4 Memory, 12GB total, PC4-2400
Ø 3 independent banks, each organized as 1024M x 32
。 2 banks with GX/SX320, GX/SX270 (8 GB)
· QDRII+ SRAM memory, 4M x 18 (72Mb)
Ø Separate 18-bit read and write ports
Ø 633 MHz bus operation, DDR (double data rate)
Ø Fast enough to be clocked at 625 MHz
。 Eliminates clock synchronization delays between memory and Ethernet clock
· Full support for embedded logic analyzers via JTAG interface
· Eight FPGA-controlled LEDs
Ø Enough debug LEDs to make popcorn.