As Field Programmable Gate Array (FPGA) density increases, so does the potential for configurable computing machines (CCMs). Unfortunately, the tools to develop applications for CCMs are not maturing as quickly as the technology. This problem manifests itself in several areas: initial application development is complex and time consuming, and migration from one generation to the next requires starting over almost from scratch. These problems leave many unwilling to migrate their applications from a software implementation to hardware. We propose an environment called the Reconfigurable Computing Application Development Environment (RCADE) for the development of CCM applications which will offer higher performance than software solutions, allow non-experts in FPGA design to develop applications, will simplify the design and maintenance process for those who are experts, and ease the process of migration from CCM platform to the next.
The primary benefit of using FPGA technology in a CCM system is the increase in performance. Rather than implementing applications in software, FPGA technology allows the execution of applications at hardware speeds without the inordinately high cost of creating custom silicon.
To gain these hardware speeds, using CCMs presents a problem that is as intricate as hardware design. This presents a barrier which has to date made CCMs inaccessible to many users. Our environment seeks to remove this barrier and make CCM available to a larger user community by allowing the user to operate at a higher level of abstraction than traditional hardware design. The user enters their initial design in a platform-independent, algorithm-oriented way.
Even if a user spends the time to learn the intricacies of hardware design, it is difficult to verify and debug an application. Compared to a traditional software approach, the debugging process suddenly expands from merely testing and repairing an application implemented in C to debugging both the application and the individual operations used. Simulation helps logically debug the application, but simulation is slow and often does not deal with timing issues. The component-based approach used in our environment hides many of the timing and control issues associated with interconnecting the components. This approach serves to greatly reduce the time required to debug or to update an existing application.
The RCADE environment also addresses the problem of portability among CCM platforms. In traditional FPGA system design, each new generation of device and each new CCM board architecture required in essence throwing out old designs and starting a new design from scratch. Technology is progressing in these areas so quickly in fact, that the time from release to obsolescence of many devices is shorter than the design time for sophisticated applications! Our environment addresses this problem by allowing initial design entry in a platform independent representation, and by building implementations on a standard set of components. Migrating the components from system to system will allow all the applications based on them to migrate more easily.
The nodes in the data flow graph represent abstract operations in the design and are created by agents in the front end. A set of analysis agents activated by the user refines and enhances the data flow graph by adding parameters which constrain the final implementation. In the final step, back-end agents map the design to components.
Components in our environment are efficiently implemented blocks of logic that conform to a well-defined interface and correspond to operations in the design. In addition, our component-based approach hides many of the timing and control issues typically associated with interconnecting components. This serves to greatly reduce the time required to debug a new application or update an existing application.
To illustrate this process, we include the following example. The Probabilistic Neural Net (PNN) is an image classification algorithm for satellite telemetry [,]. The algorithm has N classes, each with Pk vectors (1 £ k £ N). An important computation in the algorithm is
Our goal is to create an environment for creating high performance configurable computing applications which shortens the design cycle for experienced users, makes migration from one CCM platform to another simpler, and allows non-experts access to the performance of dedicated hardware, though an expert with a longer design time may produce a smaller, faster solution. To achieve this, we based RCADE on a technology independent description of an algorithm. To this, we added agents which assist the user in describing the application and enhancing that description. Agents are also provided to map the description to a component-based design approach and create the final application.
The fundamental structure of RCADE is shown in Fig. . At the core of the environment is the design, which is initially represented in an implementation independent format. A number of different tools, known as agents have concurrent access to the design. Agents perform a variety of functions, including aiding the user in initially describing their algorithms , guiding the selection of suitable components for the target CCM, and analysis and optimization of the design. Agents are distinct entities that utilize to a well-defined interface to the shared design. It is easy to extend or modify RCADE by adding new agents.The key to the success of the environment is that all the agents share access to the same representation of the design. This allows the user to initially describe the algorithm with the front-end agents of the environment at a high level of abstraction. These design descriptions may be re-used by various back-end agents to re-target the applications to ever-changing CCM architectures. Rather than attempting to implement the design at a gate level for each target architecture, the back-end agents use a component-based approach to implement the applications on the FPGAs. In this way, changing from one architecture to another requires a new set of components, rather than a new design.
All of the functionality of the system is provided through a set of independent agents which interact through the shared design representation. Each agent is intended to independently transform the graph or enhance it with attributes. Front-end agents create an algorithm in ADF. Analysis Agents enhance and constrain the design with attributes describing everything from data formats to area constraints on individual components. Back-end agents generate a working application from the specification by selecting appropriate components and generating the code to connect them.
To create an application in RCADE, the user specifies the design with a design entry agent. We describe two ways of entering designs in this section. Users primarily interact with the environment via the ADF Editor which can be used to create new designs as well as modify existing ones. We also provide an agent which reads an ASCII file and generates a design.
The ADF Editor provides a graphical means of directly manipulating the attributed graph. In addition to the ability to view and perform normal editing functions on the graph, the ADF Editor has an import feature which allows concatenation of existing graphs. The editor operates on the same shared design that other agents see, so changes made to the graph by other agents are immediately visible in the editor. Through the View mechanism, the ``look and feel'' of the editor is completely configurable. By default, nodes are represented as circles and the links between them as lines, but this can easily be changed to suit a particular application.
The text translator uses a file format that reflects the underlying design unit format but is structured with keywords and symbols to make it readily understood by humans. For users familiar with block-structured languages (C, Ada, Pascal, etc.), the syntax is simple to understand: a design unit is a block that has a list of nodes and links. Nodes are blocks that list their input and output ports. Links connect the ports of arbitrary nodes. The syntax provides a uniform mechanism for adding attributes to design units, nodes, and ports. It also supports an expansion notation that introduces a name for a set of attributes. Thus entities that have attributes in common can simply refer to the name rather than repeat the set of attributes.
Like other agents in our environment, the design entry agents are software components with a well-defined interface to the rest of the system. It is relatively simple for an agent writer to add alternative design entry agents. Generating designs from other applications or high level languages involves developing a parser for the language which builds the ADF graph using methods provided by CECAAD.
Analysis agents set constraints for use in optimization, perform graph transformations, and gather statistics from the ADF graph. These agents are intended to be invoked by and interact with the user as necessary to transform the graph and set constraints to produce a viable implementation.
CCMs have not yet matured to the point where the use of standard floating-point formats is practical in general applications . Users are required to design their applications using a less familiar fixed-point implementation. The precision analysis agent provides the ability to propagate precision information through the graph to determine the bit width and format of each value. Frequently, users may have a priori knowledge of required or adequate precision at certain points in the algorithm they are implementing. The precision analysis agent gives the user the ability to adjust the precision of a data value on any port in the graph. The effects of each adjustment are then propagated throughout the entire system, and stored as attributes on the ports of each node. The back-end agents of the environment attempt to take advantage of any bit-width savings generated by the precision analysis agent by selecting a part from the library that can be scaled to the correct bit width. Precision analysis used intelligently can greatly increase the number of operations that can be implemented using a given reconfigurable platform. For instance, if the inputs of a particular multiplication are known to only require 8 significant bits, a pipelined 8-bit multiplier can be implemented using less than 10% of the area required if the inputs were assumed to be a standard 32-bit integer.
Upon initialization of the precision analysis agent, precision from the external inputs to the graph is propagated forward to every port within the graph. When the calculated precision is manually altered at some point within the graph, the precision is recalculated from that point forward in the graph. The precision analysis agent also attempts to determine if any savings in bit width can be achieved by propagating the effect of the altered precision backwards from output to input through the graph. Automatic backwards precision calculation is an incompletely specified problem for many arithmetic operations. However, the agent contains a mode which allows the user to interactively enter constraints to aid the backwards propagation.
Operations which the precision analysis agent currently support include addition, multiplication, division, sine, cosine, natural log, square root and buffer. Algorithms for computing the precision for additional operations can be added easily by adding a class when a new operation is added to the system. The precision analysis agent also supports hierarchical designs by computing the precision of subgraphs in the design, then attaching the precision of the external inputs and outputs of that subgraph to attributes of the corresponding node in the parent graph.
Back-end agents perform similar functions to the back end components of a compiler. They provide the platform specific details of the implementation. Back-end agents convert an algorithm specification into a working application. At minimum, this requires that a component be bound to each node and that application code be generated.
Once all of the components have been selected, the pipeline balancing agent traverses the graph and inserts additional buffers (FIFOs) in instances where the simple full-empty register interface would impede performance. These buffers are critical in two scenarios to maintain performance. First, they are needed to maintain a steady data flow into and out of a variable speed part, which may consume or produce data in bursts. The second scenario is illustrated in Fig.3. In this oversimplified illustration, each node is labelled with a latency in the center and a maximum throughput in terms of data items per clock cycle on the ports. We see two branches which start and end at common points. The difference is that one path has a latency four times higher than the other. Though both branches have a throughput of one data item per clock cycle, the effective throughput without pipeline optimization will be an average of one data item per four clock cycles, effectively cutting performance by a factor of four. This occurs because the shorter path cannot continue to consume data, preventing the longer path from consuming data.
To address these problems, we replace the interconnect between selected components with FIFOs. For variable rate components, the interconnects entering and leaving the node are replaced by FIFOs to assist in properly managing data flow for the component. For converging paths, FIFOs are used to provide additional buffering to increase the number of data items stored in one branch to equal the other. For a given branch, m, the number of elements, Em, stored along the path can be calculated as shown below, where for each component j, the latency is lj, the throughput is tj, and Lm is the number of links in the branch, and Cm is the number of components in the branch. For a set of k parallel branches, the number of additional buffers, Ni, needed in branch i can then be calculated as shown in Eq. .
The placement of the FIFOs is determined by port attributes and interpreted by the code generator. In the multipath case, all FIFOs are placed at the last link in the path to minimize their collective space requirements. logic and should therefore be kept together. If the number of extra buffers needed exceeds sixteen (a sixteen stage FIFO can be implemented in a single block of a Xilinx 4000 series FPGA), additional buffer stages are added. The result is that the total number of data buffers along any possible path is identical. Therefore, the pipeline balancing agent keeps latency from having an undue effect on throughput.
Code generation is the final stage of the design process. A code generating agent is provided to convert the specification of a design in ADF to a VHDL implementation. This task requires instantiating the components, instantiating the interconnection buffers, and interconnecting components through the buffers. At present, RCADE only produces code for single chip designs. Partitioning issues will be addressed in the future.
The first target architecture is an Annapolis MicroSystems WildForce XL series board. Since we only provide for a single chip design, our initial model of the architecture corresponds to PE1 with a single input path, a single output path, and a single port to SRAM. We generate additional code to multiplex outputs onto a single path and demultiplex inputs from a single path. The code generated is suitable for use with the Synopsys FPGA Compiler, and an appropriate compile script is generated. Additional target architectures can be added by providing a new code generator and a new set of components. The rest of the environment remains unchanged.
Although many research projects are using FPGAs to do computation, their specific objectives vary and often their goals are not the same as ours. In this section, we present a brief overview of some related projects and describe how they address the problem described in Section 1.1.
In general, most research projects have opted for a source language similar to C.[,,] The Transmorgrifier C developed at the University of Toronto compiles a subset of C and is limited to pragma directives to control the bit widths of integers. The VT project and a project at NEC offer more control over what may be constructed in hardware but lack the interaction with the user necessary to develop high performance designs. While the syntax of C is likely to be familiar to new users, the language lacks a direct means for specifying timing or ranges of precisions or multiple storage formats. By itself, C is not a complete solution to our problem but it may be a helpful start. Indeed, nothing in our approach prohibits adding a front-end agent to translate from C to ADF.
The languages Lola and JHDL do recognize the need for a central data format and interactions with the user to produce high performance designs. In the case of Lola, the target user is an expert and the language focuses on giving the designer as much control over the implementation as possible. The JHDL project has taken a top-down approach by first working from a host software development perspective, whereas we emphasize building the hardware specifically.
Our use of components is not unique. Others have also used component-based design approaches [,,,] and FPGA vendors, such as Xilinx, are acknowledging the importance of components. Xilinx ships tools with their new software releases to build components. There has been a significant movement toward using third-party components (or cores) as well as module generation tools[,,]. However, these approaches focus only on minimal area and optimal performance. While this is appealing to some users, it is not ideal for building large designs with numerous components where some flexibility in the component results large gains. Furthermore, these components are usually designed to be centrally controlled whereas we use distributed control for the advantages previously described.
In the future, we plan to add agents to provide a functional simulator and debugging support. Library enhancements will consist of more operations, more implementations for each operation, and the ability to integrate cores available from vendors. Algorithm validation can be further simplified by incorporating a runtime debugger capable of executing portions of the design in hardware and single stepping the clock.
On the front end, additional agents will give the user more flexibility in the specification of the design. Initially, we intend to add manual partitioning support to utilize more of the FPGA resources on a CCM and later augmenting this with some automated partitioning support. Additionally, we intend to support the use of higher level language constructs for various elements of design entry. This would involve automatically translating a limited subset of C into ADF, to specify such things as memory access patterns and algorithms for a node. The final front end enhancement being considered is a mechanism for allowing the user to specify the host software and the interaction between the host software and the CCM.
Other analysis agents currently being considered include agents to support functional unit re-use, and hardware/software co-design. An analysis agent supporting functional unit re-use would analyze the graph and determine if cases exist which would allow a more efficient implementation by using a single, faster component rather than multiple smaller, slower components. A hardware/software co-design agent would evaluate the trade-offs of moving components between hardware and software implementations, including issues such as the communication overhead and the performance impacts.
The Planet Earth Research Lab is supported by NASA GSFC under grant number NAG 5-3053. Donations were received from Xilinx under donation number 0000CU7-1978 and Synopsys under agreement number UPS-70004334. Keith Underwood is supported under an NSF Graduate Research Fellowship.