ACC graphicAdaptable Computing Cluster

General Description

Engineering computer systems almost always involves some cost-performance analysis.[1] For high-performance computing systems, a cluster of workstations with a high-performance network has emerged as a cost-effective solution for many users. One class of these cluster of workstation, which is distinguished by its exclusive use of COTS (commodity, off-the-shelf) parts, is called a Beowulf machine. Our Adaptable Computing Cluster (ACC) project is exploring the novel use of a off-the-shelf component called an ACE2card. This PCI-bus card has reconfigurable computing (RC) resources and, in our system, a gigabit Ethernet network interface on the mezzanine adapter. Since the RC is on the critical path to the network switch, the ACE2card and RC is an integral part of the machine. Although the ACE2card is not a commodity part, we are using it to explore its potential in a Beowulf-class machine. Specifically, we are investigating questions such as how to balance the use of RC for computation and communication and how to best use the system for specific application/problem domains. The current system consists of eight processors (four dual-processor workstations) connected by a Foundry Net BigIron (gigabit switch). An NSF award as enabled us to expand the system to 16 nodes and the ACE2card are slated to be delivered in mid-September 2000. Underway is graduate student work developing Linux device drivers for the ACE2card and the network interface mezzanine card. (Preliminary results are achieving 96 MB/s, roughly 75% of the theoretical bandwidth, through the line to the switch.) Several investigations have been organized as part of the ACC project. First is an investigation into the general evaluation of the performance of the machine. This includes the development of network protocols that are implemented in the RC, user-level network interfaces, and development of RC components to support general parallel processing applications, via MPI or another message-passing system. Closely related is an investigation that proposes to port PVFS (a parallel file system developed in the PARL lab) to the ACC machine. PVFS on ACC would implement the PVFS control messages in the RC and thus greatly improve the latency. A third investigation is in collaboration with Clemson University Genomics Institute (CUGI). It is well established that the performance of genomics applications can be greatly improved by using RC to speed up individual jobs and the nature of the applications allow them to be run in parallel on Beowulf machines. So we expect that the ACC machine will show exceptional performance gains, but there are several important questions. For example, how to balance the system resources between communication and computation to give optimal performance.
[1] the exceptions are the relatively few `performance at any cost' systems