It's one of the most flexible, adaptable computing technologies available. However, it is only slowly being utilized in blades and desktops around the world. RC creates an unprecedented opportunity for orders of magnitude improvement in GFlops-per-dollar, GFlops-per-watt, and just GFlops.
It's no silver bullet, though. Realizing the potential of RC requires understanding the basic technology, then making sure that it's the right vehicle for the specific application.
Reconfigurable computing fabric
RC typically depends on field programmable gate arrays (FPGAs). For now, consider a high performance FPGA to be a bag of loose computer parts. Today's largest FPGAs include 500 or more block multipliers, on-chip RAM totaling a few MB, and pools of uncommitted arithmetic, control, and connectivity resources. RAM-based switches and lookup tables control connectivity and function, allowing easy redefinition of the computation.
An accelerator board attaches to an existing computer's main system interconnect, such as HyperTransport in an AMD system, NUMAlink in Silicon Graphics Altix processors, or PCI in a typical workstation. The board contains one or more FPGAs for application computing, plus some amount of SRAM and/or DRAM, arranged in several independently addressable banks. The block diagram shown in Fig 1 is similar to that of a graphics accelerator with a computing engine, on-board memory, and system interconnection.
1. FPGAs typically include low-latency on-board buffers, as well as access to system memory.
The von Neumann programming model distributes an algorithm across time: one function unit performs a sequence of operations, one at a time, to carry out a specific computation. Speed comes from performing many operations, including memory accesses, in rapid succession.
RC gets away from the von Neumann model; it distributes an algorithm spatially across the configurable computing fabric, as shown in Fig 2. Speed comes from performing tens to hundreds of operations in parallel, using pipelining, broadside parallelism, or a combination of both.
2. Programming an FPGA means configuring it into an application-specific processor.
High-level descriptions rarely exploit the full potential of an FPGA, however. The biggest reason is that C-like programming languages have sequential execution built deeply into their basic structure, making it extremely difficult to automate the extraction of FPGA-friendly parallelism.
Just as von Neumann programmers may fall back to an assembler for performance-critical kernels, FPGA programmers can use hardware description languages (HDLs) like Verilog or VHDL to expose more of the algorithm to the FPGA fabric. This step generally requires specialized programming skills, just as parallelism is a special case of a programmer's responsibility in C-like languages. HDLs are natively parallel and sequential execution is largely up to the developer.
When considering the "grain" of a processing element (PE), x86-compatible and similar processors traditionally stand at one end of the spectrum (i.e. coarse-grained with one PE but one that is big and complex). The continuum runs through common dual cores, multi-cores on the order of ten PEs (such as Cell Broadband Engine and the UltraSPARC T2 from Sun Microsystems), and many-cores on the order of 102 PEs (like Intel's Polaris or products from Clearspeed and Tilera).
As the number of PEs increases, the size and power of each PE decreases. FPGAs are sometimes considered the fine-grained extreme: on the order of 105 PEs of one-bit functionality, fixed at the time they are programmed. This, however, does not reflect how developers really program FPGAs.
Even in HDLs, design quanta are typically not individual bits of logic, but register arrays, RAM buffers, arithmetic units, or entire filters. As a result, any given FPGA can implement PEs of different sizes at different times, and occupy a different point on the axes of PE complexity vs. number of PEs per chip. In practice, FPGAs represent variable-grain computing, typically with 10-103 PEs custom tailored to the specific application.
FPGA performance factors
Applications running in FPGA accelerators typically run near 100 MHz, with higher clock speeds requiring increased design effort. And, like optimizations for memory hierarchy in typical high performance computing (HPC) programming, that effort has little to do with the application being accelerated.
Starting from a 10:1 disadvantage, or more, relative to common CPU clock speeds, FPGAs can still achieve impressive speedups for applications that take advantage of their strengths. Those include:
- Fine-grained parallelism: Large contemporary FPGAs contain hundreds of independent multiplication and arithmetic units, all of which can run concurrently.
- Low computation overhead: A typical HPC loop, a dot product for example, includes a termination test, indexing computations, sequential memory fetches, a branch, plus instruction fetches and decoding, as well as the multiply-and-add payload logic. In an FPGA, indexing and fetches can be pipelined, operands can be stored in independent memory banks, termination testing can happen in parallel with arithmetic functions, while typical control structures eliminate the idea of a von Neumann instruction.
- Memory concurrency: Common accelerator boards feature four to eight independently addressable off-chip RAM banks. High-performance FPGAs contain 500 (or more) independently addressable RAM buffers. Assuming dual-ported access and 100 MHz operation, 400 GB/s on-chip memory bandwidth is realistic and much higher values are theoretically possible (see Fig 3).
- Fast, fine-grained communication: On-chip communication runs at full chip speed and typically with latencies of just a few cycles. Since communication is so configurable, broadcast networks, multi-bus structures, and customized communication structures are all possible.
3. Careful use of accelerator resources can reduce or eliminate memory bottlenecks.
Standard HPC communication favors minimal communication and in large enough units to amortize transfer overhead. PEs within an FPGA can transfer individual data values as often as every cycle with minimal cost. Additionally, with a typical HPC single-program-multiple-data paradigm, access across memory banks can be very expensive. The large number of memory busses and low communication costs simplify data exchanges and concurrent fetch of multiple operands in FPGAs.
Picking the right application
FPGAs function much more efficiently with some applications than with others. Experienced HPC developers know that any computer technology has constraints and sweet spots, which determine the kinds of applications that can use the technology to its fullest. The WORKS criteria help identify good RC candidates: Working set, Order of complexity, Regularity of data and algorithms, Kernel size, and Speed of communications.
There are exceptions to these guidelines, of course, as well as many other constraints. These provide an idea of whether FPGA acceleration can be a reasonable solution for the application.
- Working set: Today's high performance FPGAs have up to about 2 MB of on-chip RAM. Configurations vary, but XtremeData's XD2000f offers 36MB of RAM on the accelerator board, while the RASC RC100 from Silicon Graphics claims 80 MB. On these devices, off-chip bandwidth can be an order of magnitude lower than on-chip, with off-board access to system memory even slower.
- O(N2) behavior or greater: Downloading data to the accelerator and uploading the results generally represent time not spent computing. Algorithms that heavily reuse data can amortize download time across many references to each value loaded. Kernels that touch each value just once (e.g. a dot product), might even slow down relative to execution time on the CPU, due to transfer time and clock speeds.
- Regularity: Simple data structures and regular access patterns allow FPGAs to pipeline off-chip and possibly off-board memory references. Regular computations often allow loops to be unrolled into wide parallel arrays or into pipelines with tens or hundreds of stages running concurrently.
- Kernel size: The "program store" in an FPGA is relatively small, compared to the memory that a workstation has for programs, and loading a new bit file takes time. Applications may need to be restructured in order to isolate the loops that could benefit from FPGA acceleration and to minimize the need of switching between kernels.
- Speed of communications: Three features distinguish on-chip communications in FPGAs, specifically low latency (typically a few cycles, down to one), massive bandwidth (easily totaling many TB/s), and customizability able to support most topologies an application might require.
RC has existed for years, but only recently has begun to emerge from niches including signal and image processing. As with any performance technology, RC functions more efficiently on certain applications.
Even though tools keep getting better, realizing the performance benefits possible with RC often requires unusual programming skills. The good news is that a wide range of RC platforms are available, tools are constantly improving, and FPGAs are advancing with the expected benefits of Moore's Law as quickly as any other type of computing device.
The end result is that, depending on the HPC application, RC can be both a cost-effective and power-effective solution.
Tom VanCourt, Ph.D., is currently a senior member of technical staff at Altera Corporation. He develops system-building tools and champions performance computing on FPGAs.
Tom has spent over 25 years in industry with DEC, HP, and other companies. He has also taught at Boston University, where he earned his Ph.D. in computer systems engineering. His interests include FPGA-based computing in finance, life science, medical imagining, and other application areas.
Tom's ongoing efforts include adapting existing FPGA tools and creating new ones to move beyond logic synthesis and into the very different world of FGPA-based performance computing.
No comments:
Post a Comment