A development environment for configurable computing

A development environment for configurable computing

B. Boysen, N. DeBardeleben, K. M. Hazelwood, W. B. Ligon III,
R. Sass, D. C. Stanzione,jr., K. D. Underwood Parallel Architecture Research Lab
Holcombe Department of Electrical and Computer Engineering
Clemson University
105 Riggs Hall
Clemson, SC 29634-0915

Abstract

As FPGA density increases, so does the potential for configurable computing machines. Unfortunately, the larger designs which take advantage of the higher densities require much more effort and longer design cycles, making it even less likely to appeal to users outside the field of configurable computing. To combat this problem, we present the Reconfigurable Computing Application Development Environment (RCADE). The goals of RCADE are to produce high performance applications, to make FPGA design more accessible to those who are not hardware engineers, to shorten the design lifecycle, and to ease the process of migration from one platform to another. Here, we discuss the environment architecture, the current set of agents, and other agents to be developed.

1.  INTRODUCTION

As Field Programmable Gate Array (FPGA) density increases, so does the potential for configurable computing machines (CCMs). Unfortunately, the tools to develop applications for CCMs are not maturing as quickly as the technology. This problem manifests itself in several areas: initial application development is complex and time consuming, and migration from one generation to the next requires starting over almost from scratch. These problems leave many unwilling to migrate their applications from a software implementation to hardware. We propose an environment called the Reconfigurable Computing Application Development Environment (RCADE) for the development of CCM applications which will offer higher performance than software solutions, allow non-experts in FPGA design to develop applications, will simplify the design and maintenance process for those who are experts, and ease the process of migration from CCM platform to the next.

1.1  Problem description

The primary benefit of using FPGA technology in a CCM system is the increase in performance. Rather than implementing applications in software, FPGA technology allows the execution of applications at hardware speeds without the inordinately high cost of creating custom silicon.

To gain these hardware speeds, using CCMs presents a problem that is as intricate as hardware design. This presents a barrier which has to date made CCMs inaccessible to many users. Our environment seeks to remove this barrier and make CCM available to a larger user community by allowing the user to operate at a higher level of abstraction than traditional hardware design. The user enters their initial design in a platform-independent, algorithm-oriented way.

Even if a user spends the time to learn the intricacies of hardware design, it is difficult to verify and debug an application. Compared to a traditional software approach, the debugging process suddenly expands from merely testing and repairing an application implemented in C to debugging both the application and the individual operations used. Simulation helps logically debug the application, but simulation is slow and often does not deal with timing issues. The component-based approach used in our environment hides many of the timing and control issues associated with interconnecting the components. This approach serves to greatly reduce the time required to debug or to update an existing application.

The RCADE environment also addresses the problem of portability among CCM platforms. In traditional FPGA system design, each new generation of device and each new CCM board architecture required in essence throwing out old designs and starting a new design from scratch. Technology is progressing in these areas so quickly in fact, that the time from release to obsolescence of many devices is shorter than the design time for sophisticated applications! Our environment addresses this problem by allowing initial design entry in a platform independent representation, and by building implementations on a standard set of components. Migrating the components from system to system will allow all the applications based on them to migrate more easily.

1.2  Our approach

To address these issues, we use a technology independent description of a user design which contains all pertinent information about the algorithm in a data flow graph. RCADE is built upon the CECAAD base, which provides a format for describing the graph and a model for independent agents to interact with the graph. The agents encapsulate the expertise needed for successful design of high performance CCM applications.

The nodes in the data flow graph represent abstract operations in the design and are created by agents in the front end. A set of analysis agents activated by the user refines and enhances the data flow graph by adding parameters which constrain the final implementation. In the final step, back-end agents map the design to components.

Components in our environment are efficiently implemented blocks of logic that conform to a well-defined interface and correspond to operations in the design. In addition, our component-based approach hides many of the timing and control issues typically associated with interconnecting components. This serves to greatly reduce the time required to debug a new application or update an existing application.

To illustrate this process, we include the following example. The Probabilistic Neural Net (PNN) is an image classification algorithm for satellite telemetry [,]. The algorithm has N classes, each with Pk vectors (1 k N). An important computation in the algorithm is

f(X | Sk) = K
Pk
Pk

i = 1 
exp

- (X - Wki)T (X - Wki)
2s2


(1)
where K = (2p)-d/2s-d, Wki is the ith training vector from the kth class, and d is the dimension of each vector. (The s is a smoothing parameter.) One of the N classes is assigned to every pixel by computing values from Equation 1 and selecting the class with the highest resulting score. Fig.  shows an initial abstract representation of the algorithm. By invoking a series of agents, this graph is transformed into a working hardware design.


Figure 1: The PNN algorithm as represented in ADF and displayed by the editor.

Our goal is to create an environment for creating high performance configurable computing applications which shortens the design cycle for experienced users, makes migration from one CCM platform to another simpler, and allows non-experts access to the performance of dedicated hardware, though an expert with a longer design time may produce a smaller, faster solution. To achieve this, we based RCADE on a technology independent description of an algorithm. To this, we added agents which assist the user in describing the application and enhancing that description. Agents are also provided to map the description to a component-based design approach and create the final application.

1.3  Overview

The remainder of this paper is organized as follows. In Section , we discuss the infrastructure, specifically the organization of the system and our component-based approach. Several important agents are described in Section . There are many researchers actively pursuing ways of using FPGAs to do computation. In Section , we show how our project's goals and work are related to other projects in the area. Finally, we conclude with future work and a summary of our work to date in Section .

2.  System architecture

The fundamental structure of RCADE is shown in Fig. . At the core of the environment is the design, which is initially represented in an implementation independent format. A number of different tools, known as agents have concurrent access to the design. Agents perform a variety of functions, including aiding the user in initially describing their algorithms , guiding the selection of suitable components for the target CCM, and analysis and optimization of the design. Agents are distinct entities that utilize to a well-defined interface to the shared design. It is easy to extend or modify RCADE by adding new agents.The key to the success of the environment is that all the agents share access to the same representation of the design. This allows the user to initially describe the algorithm with the front-end agents of the environment at a high level of abstraction. These design descriptions may be re-used by various back-end agents to re-target the applications to ever-changing CCM architectures. Rather than attempting to implement the design at a gate level for each target architecture, the back-end agents use a component-based approach to implement the applications on the FPGAs. In this way, changing from one architecture to another requires a new set of components, rather than a new design.


Figure 2: The basic structure of RCADE

3.  AGENTS

All of the functionality of the system is provided through a set of independent agents which interact through the shared design representation. Each agent is intended to independently transform the graph or enhance it with attributes. Front-end agents create an algorithm in ADF. Analysis Agents enhance and constrain the design with attributes describing everything from data formats to area constraints on individual components. Back-end agents generate a working application from the specification by selecting appropriate components and generating the code to connect them.

3.1  Front-end agents

To create an application in RCADE, the user specifies the design with a design entry agent. We describe two ways of entering designs in this section. Users primarily interact with the environment via the ADF Editor which can be used to create new designs as well as modify existing ones. We also provide an agent which reads an ASCII file and generates a design.

The ADF Editor provides a graphical means of directly manipulating the attributed graph. In addition to the ability to view and perform normal editing functions on the graph, the ADF Editor has an import feature which allows concatenation of existing graphs. The editor operates on the same shared design that other agents see, so changes made to the graph by other agents are immediately visible in the editor. Through the View mechanism, the ``look and feel'' of the editor is completely configurable. By default, nodes are represented as circles and the links between them as lines, but this can easily be changed to suit a particular application.

The text translator uses a file format that reflects the underlying design unit format but is structured with keywords and symbols to make it readily understood by humans. For users familiar with block-structured languages (C, Ada, Pascal, etc.), the syntax is simple to understand: a design unit is a block that has a list of nodes and links. Nodes are blocks that list their input and output ports. Links connect the ports of arbitrary nodes. The syntax provides a uniform mechanism for adding attributes to design units, nodes, and ports. It also supports an expansion notation that introduces a name for a set of attributes. Thus entities that have attributes in common can simply refer to the name rather than repeat the set of attributes.

Like other agents in our environment, the design entry agents are software components with a well-defined interface to the rest of the system. It is relatively simple for an agent writer to add alternative design entry agents. Generating designs from other applications or high level languages involves developing a parser for the language which builds the ADF graph using methods provided by CECAAD.

3.2  Analysis agents

Analysis agents set constraints for use in optimization, perform graph transformations, and gather statistics from the ADF graph. These agents are intended to be invoked by and interact with the user as necessary to transform the graph and set constraints to produce a viable implementation.

3.2.1 Precision analysis agent

CCMs have not yet matured to the point where the use of standard floating-point formats is practical in general applications []. Users are required to design their applications using a less familiar fixed-point implementation. The precision analysis agent provides the ability to propagate precision information through the graph to determine the bit width and format of each value. Frequently, users may have a priori knowledge of required or adequate precision at certain points in the algorithm they are implementing. The precision analysis agent gives the user the ability to adjust the precision of a data value on any port in the graph. The effects of each adjustment are then propagated throughout the entire system, and stored as attributes on the ports of each node. The back-end agents of the environment attempt to take advantage of any bit-width savings generated by the precision analysis agent by selecting a part from the library that can be scaled to the correct bit width. Precision analysis used intelligently can greatly increase the number of operations that can be implemented using a given reconfigurable platform. For instance, if the inputs of a particular multiplication are known to only require 8 significant bits, a pipelined 8-bit multiplier can be implemented using less than 10% of the area required if the inputs were assumed to be a standard 32-bit integer.

Upon initialization of the precision analysis agent, precision from the external inputs to the graph is propagated forward to every port within the graph. When the calculated precision is manually altered at some point within the graph, the precision is recalculated from that point forward in the graph. The precision analysis agent also attempts to determine if any savings in bit width can be achieved by propagating the effect of the altered precision backwards from output to input through the graph. Automatic backwards precision calculation is an incompletely specified problem for many arithmetic operations. However, the agent contains a mode which allows the user to interactively enter constraints to aid the backwards propagation.

Operations which the precision analysis agent currently support include addition, multiplication, division, sine, cosine, natural log, square root and buffer. Algorithms for computing the precision for additional operations can be added easily by adding a class when a new operation is added to the system. The precision analysis agent also supports hierarchical designs by computing the precision of subgraphs in the design, then attaching the precision of the external inputs and outputs of that subgraph to attributes of the corresponding node in the parent graph.

3.2.2 Throughput analysis agent

Another way of achieving space savings in FPGA designs is through throughput analysis. Throughput analysis is the process of estimating the rate at which each piece of the design will consume and produce data, and eliminate any excess capacity by replacing large and fast parts with smaller slower ones. The throughput analysis agent focuses on achieving this goal. This agent traverses the graph and estimates the maximum throughput for each input and output of every node. Realizing that a node cannot, and need not, consume data faster than previous nodes are producing it, the throughput analysis agent then propagates any throughput limitations created by slower nodes throughout the graph. This information is stored as an attribute which is later used by the component selector discussed in Section . For instance, in the PNN example of Section 1, there is a subtraction feeding a square, then a summation, followed by a multiplication. Assuming that the subtraction and square are fully pipelined and the summation sums 16 elements, optimal application performance can still be achieved if the multiplication produces a result only once every 16 cycles. Such a component is a factor of 10 smaller than its fully pipelined counterpart.

3.3  Back-end agents

Back-end agents perform similar functions to the back end components of a compiler. They provide the platform specific details of the implementation. Back-end agents convert an algorithm specification into a working application. At minimum, this requires that a component be bound to each node and that application code be generated.

3.3.1 Component selection agent

The front-end tools express an algorithm in an ADF graph in terms of abstract operations. The analysis agents place constraints on the implementations of the these abstract operations. The purpose of the component selector agent is to traverse the graph and select a specific implementation for each node, based on the abstract operation, the constraints, and the target platform. For each node, the component selector queries the database for all functions of that type. It then selects the implementation best meets the constraints.

3.3.2 Pipeline balancing agent

Figure 3: A simple example to illustrate pipeline balancing issues.

Once all of the components have been selected, the pipeline balancing agent traverses the graph and inserts additional buffers (FIFOs) in instances where the simple full-empty register interface would impede performance. These buffers are critical in two scenarios to maintain performance. First, they are needed to maintain a steady data flow into and out of a variable speed part, which may consume or produce data in bursts. The second scenario is illustrated in Fig.3. In this oversimplified illustration, each node is labelled with a latency in the center and a maximum throughput in terms of data items per clock cycle on the ports. We see two branches which start and end at common points. The difference is that one path has a latency four times higher than the other. Though both branches have a throughput of one data item per clock cycle, the effective throughput without pipeline optimization will be an average of one data item per four clock cycles, effectively cutting performance by a factor of four. This occurs because the shorter path cannot continue to consume data, preventing the longer path from consuming data.

To address these problems, we replace the interconnect between selected components with FIFOs. For variable rate components, the interconnects entering and leaving the node are replaced by FIFOs to assist in properly managing data flow for the component. For converging paths, FIFOs are used to provide additional buffering to increase the number of data items stored in one branch to equal the other. For a given branch, m, the number of elements, Em, stored along the path can be calculated as shown below, where for each component j, the latency is lj, the throughput is tj, and Lm is the number of links in the branch, and Cm is the number of components in the branch. For a set of k parallel branches, the number of additional buffers, Ni, needed in branch i can then be calculated as shown in Eq. .

Em =

j = 1 
Cm lj
tj
+ Lm
(2)

Ni =
max
k 
 { Ek - Ei }
(3)

The placement of the FIFOs is determined by port attributes and interpreted by the code generator. In the multipath case, all FIFOs are placed at the last link in the path to minimize their collective space requirements. logic and should therefore be kept together. If the number of extra buffers needed exceeds sixteen (a sixteen stage FIFO can be implemented in a single block of a Xilinx 4000 series FPGA), additional buffer stages are added. The result is that the total number of data buffers along any possible path is identical. Therefore, the pipeline balancing agent keeps latency from having an undue effect on throughput.

3.3.3 Code generation

Code generation is the final stage of the design process. A code generating agent is provided to convert the specification of a design in ADF to a VHDL implementation. This task requires instantiating the components, instantiating the interconnection buffers, and interconnecting components through the buffers. At present, RCADE only produces code for single chip designs. Partitioning issues will be addressed in the future.

The first target architecture is an Annapolis MicroSystems WildForce XL series board. Since we only provide for a single chip design, our initial model of the architecture corresponds to PE1 with a single input path, a single output path, and a single port to SRAM. We generate additional code to multiplex outputs onto a single path and demultiplex inputs from a single path. The code generated is suitable for use with the Synopsys FPGA Compiler, and an appropriate compile script is generated. Additional target architectures can be added by providing a new code generator and a new set of components. The rest of the environment remains unchanged.

4.  Related Work

Although many research projects are using FPGAs to do computation, their specific objectives vary and often their goals are not the same as ours. In this section, we present a brief overview of some related projects and describe how they address the problem described in Section 1.1.

In general, most research projects have opted for a source language similar to C.[,,] The Transmorgrifier C[] developed at the University of Toronto compiles a subset of C and is limited to pragma directives to control the bit widths of integers. The VT project[] and a project at NEC[] offer more control over what may be constructed in hardware but lack the interaction with the user necessary to develop high performance designs. While the syntax of C is likely to be familiar to new users, the language lacks a direct means for specifying timing or ranges of precisions or multiple storage formats. By itself, C is not a complete solution to our problem but it may be a helpful start. Indeed, nothing in our approach prohibits adding a front-end agent to translate from C to ADF.

The languages Lola[] and JHDL[] do recognize the need for a central data format and interactions with the user to produce high performance designs. In the case of Lola, the target user is an expert and the language focuses on giving the designer as much control over the implementation as possible. The JHDL project has taken a top-down approach by first working from a host software development perspective, whereas we emphasize building the hardware specifically.

Our use of components is not unique. Others have also used component-based design approaches [,,,] and FPGA vendors, such as Xilinx, are acknowledging the importance of components. Xilinx ships tools with their new software releases to build components. There has been a significant movement toward using third-party components (or cores) as well as module generation tools[,,]. However, these approaches focus only on minimal area and optimal performance. While this is appealing to some users, it is not ideal for building large designs with numerous components where some flexibility in the component results large gains. Furthermore, these components are usually designed to be centrally controlled whereas we use distributed control for the advantages previously described.

5.  CONCLUSIONS AND FUTURE WORK

RCADE provides a configurable computing application development environment which is accessible to a broader range of users than traditional FPGA design techniques, shortens the design life cycle, eases migration of application to new platforms, and yields high performance design. RCADE achieves this by employing a platform independent algorithm description process. The use of independent agents and a common design representation give the environment the flexibility to change with the technology and the needs of the user. The current set of agents allow the user to specify the design, enhance the design with precision information, constrain the design to improve performance and limit area usage, map the specification to a set components, and produce binaries for a current architecture.

In the future, we plan to add agents to provide a functional simulator and debugging support. Library enhancements will consist of more operations, more implementations for each operation, and the ability to integrate cores available from vendors. Algorithm validation can be further simplified by incorporating a runtime debugger capable of executing portions of the design in hardware and single stepping the clock.

On the front end, additional agents will give the user more flexibility in the specification of the design. Initially, we intend to add manual partitioning support to utilize more of the FPGA resources on a CCM and later augmenting this with some automated partitioning support. Additionally, we intend to support the use of higher level language constructs for various elements of design entry. This would involve automatically translating a limited subset of C into ADF, to specify such things as memory access patterns and algorithms for a node. The final front end enhancement being considered is a mechanism for allowing the user to specify the host software and the interaction between the host software and the CCM.

Other analysis agents currently being considered include agents to support functional unit re-use, and hardware/software co-design. An analysis agent supporting functional unit re-use would analyze the graph and determine if cases exist which would allow a more efficient implementation by using a single, faster component rather than multiple smaller, slower components. A hardware/software co-design agent would evaluate the trade-offs of moving components between hardware and software implementations, including issues such as the communication overhead and the performance impacts.

The Planet Earth Research Lab is supported by NASA GSFC under grant number NAG 5-3053. Donations were received from Xilinx under donation number 0000CU7-1978 and Synopsys under agreement number UPS-70004334. Keith Underwood is supported under an NSF Graduate Research Fellowship.


File translated from TEX by TTH, version 0.9.