
The links in the index below lead to more detailed information about a tech report, such as a download of the report, a homepage, a contact, and an abstract (all if available). These reports are also available by anonymous ftp from ftp://ftp.parl.clemson.edu:/pub/techreports
The quality of the job scheduling system has a high impact on the overall performance of a cluster. In order to design and evaluate different scheduling strategies for clusters or parallel computers, simulations are often executed. As the scheduling quality highly depends on the workload that is served on the cluster, a representative workload is needed in the job scheduling simulations. Workload traces logged at various supercomputing facilities are the most realistic data source, but they lack the flexibility needed for evaluating the performance under different loading conditions. Use of synthetic workload traces, created using some mathematical/statistical modeling does provide the required flexibility, but most of such models that the researchers have been using are too simplistic and do not represent the workload as served on in-production parallel computer systems. Hence, use of these models in simulations also results in misleading performance evaluation results.
In this thesis we discuss a synthetic, but realistic workload model
based on the statistical modeling of three main parameters of a parallel
computer workload namely, job inter-arrival time, job size and job
run-time, which not only models the distribution of these parameters
in the workload traces but also also captures the correlation among
the parameters. This research also studies the impact of the workload
parameters on the performance of the jobs scheduling strategies by
comparing the performance evaluations obtained using this realistic
workload with those obtained using a simplistic synthetic workload.
Finally, it is demonstrated that use of realistic workload results in
a realistic evaluation on scheduling strategies without sacrificing the
workload flexibility needed for such studies.
In this paper, we demonstrate that these multi-site scheduling techniques
can be successfully integrated with fairness policies to ensure
that participation in the multi-cluster is beneficial under extremely
disparate workload intensities. Furthermore, we demonstrate that the
trade-off between fairness and performance is relatively small.
In this research, we focus on scheduling strategies that make use of
available information such as network link utilization, per-processor
bandwidths, and job communication topology in order to make intelligent
decisions regarding application partition sizes and job placement. We
provide results that help to establish the relationship between
the quantity of information available a priori to the scheduler and
its ability to improve overall system performance. Additionally,
we demonstrate the dramatic impact that salient workload characteristics
can have on the effectiveness of co-allocation.
This research results in several contributions. Firstly, it identifies
several of the parameters necessary to make intelligent scheduling
decisions with respect to job co-allocation in a multi-cluster. It
demonstrates that by making use of these parameters, a job scheduler
can significantly improve job throughput by carefully managing node
resources as well as shared inter-cluster network bandwidth. This
work establishes the relationship between the amount of specific job
and network attributes that are available to the scheduler and the
performance that can be achieved by making use of that information.
This research also demonstrates the dramatic impact that salient workload
characteristics can have on the effectiveness of parallel job scheduling
in multi-clusters. Finally, it demonstrates that these optimizations
can be successfully integrated with policies that ensure fairness among
participating clusters.
We also present several bandwidth-aware co-allocating meta-schedulers.
These schedulers take inter-cluster network utilization into account as
a means by which to mitigate degraded job run-time performance. We make
use of a bandwidth-centric parallel job communication model that captures
the time-varying utilization of shared inter-cluster network resources.
By doing so, we are able to evaluate the performance of multi-cluster
scheduling algorithms that focus not only on node resource allocation,
but also on shared inter-cluster network bandwidth.
We have identified five key obstacles that limit the ability of parallel file systems to scale to systems with thousands of processors: efficiency, complexity, management, consistency, and fault tolerance. In order to address these obstacles we present the techniques of intelligent servers and collective communication for parallel I/O. These techniques are used to offload work from client processes, optimize high level file system operations, and limit the overhead of network communication in order to provide a comprehensive framework for building scalable file systems. These techniques not only improve file system scalability, but also help to broaden the applicability of parallel file systems to problem domains beyond scientific computing. Intelligent servers are an original concept in which servers transparently take control of optimization decisions and communicate with each other in order to service individual operations. Collective communication is a well known optimization in the fields of message passing and distributed shared memory which we have applied in a novel manner to the parallel file server environment.
In this work we present the Parallel Virtual File System 2 (PVFS2),
along
with several key extensions, as an experimental platform for this study.
We then develop an analytical modeling framework for comparing a variety
of file system algorithms in order to predict file system performance
at scale and compare potential optimizations. These models are verified
against a real world implementation with hundreds of processors and
multiple
network environments. Next we evaluate the implementation of
intelligent servers and collective communication in PVFS2 with regard to
the
five previously listed obstacles to scalability. We show that
throughput for
metadata operations can be doubled for moderately sized systems and
project
an order of magnitude improvement for systems with thousands of servers.
We simultaneously reduce client code complexity and decrease CPU
overhead by
90\%. We show that management is improved through intelligent server
load
balancing and performance monitoring. We also evaluate consistency
improvements with case study analysis and demonstrate improved fault
tolerance when compared to conventional design alternatives. This study
concludes with a summary of how the research goals have been met and how
previously intractable avenues of future work have been enabled.
This work details a model of computation for parallel computing called the Coven model and the Coven PSE which implements the model. The Coven model makes it possible for the PSE to optimize applications implemented in it by understanding the application's computational structure, including data and control flow. The design and implementation of the Coven PSE is detailed and includes a suite of features that ease parallel programming by exploiting properties of the Coven model.
Three optimizations
(multi-threading, dynamic load balancing, and checkpoint / recovery)
are detailed and analyzed to study the performance
benefits they provide to Coven applications. Using real applications
such as the FFT, fractal generation, N-Body simulation, and a heat
transfer simulation, this work examines both the performance improvement
by employing
these optimizations as well as the complexity of implementing
the optimization outside of Coven. We conclude that Coven can provide
these optimizations with almost no programming cost to the user and can
be used to achieve performance gains. Most importantly, this work makes
accessible these optimizations to users without the need for them to
possess parallel computing expertise.
In this paper, we describe Arches, an object-oriented framework for building
domain-specific PSEs. The framework was designed to support a wide
range of problem domains and to be extendable in a way that allows it
to target very different high-performance computing models. To
demonstrate this flexibility we describe two PSEs that have been
developed from the same framework yet solve different problems
and target very different computing platforms. The Coven PSE supports
parallel applications that need the large-scale parallelism that is
found in cost-effective Beowulf clusters. In contrast, the RCADE PSE
targets reconfigurable computing (FPGA-based) platforms with
fine-grain parallelism. RCADE was designed
to aid NASA Earth Scientists interested in studying satellite
instrument data and who are unlikely to be schooled in low-level
hardware design.
This paper discusses our work in this area as of November 2002.
An important component of parallel file systems is the file system interface which has different requirements compared to the normal UNIX interface particularly the I/O interface. A parallel I/O interface is required to provide support for non-contiguous access patterns, collective I/O, large file sizes in order to achieve good performance with parallel applications. As it supports significantly different functionality, the interface exposed by a parallel file system assumes importance. So, the file system needs to either directly provide a parallel I/O interface or at the least support for such an interface to be implemented on top.
The PVFS2 System Interface is the native file system interface for PVFS2 - the
next generation of PVFS. The System Interface provides support for multiple
interfaces such as a POSIX interface or a parallel I/O interface like MPI-IO to
access PVFS2 while also allowing the benefits of abstraction by decoupling the
System Interface from the actual file system implementation. This document
discusses the design and implementation of the System Interface for PVFS2.
Parallel I/O remains a critical problem for cluster computing. A significant number of important applications need high performance parallel I/O and most cluster systems provide enough hardware to deliver the required performance. System software for achieving the desired goals remains in the research and development stage. A number of parallel file systems have achieved remarkable goals in one or more of several key areas related to parallel I/O, but there is still great reluctance to commit to any file system currently available. This is mostly due to the fact that these file systems do not address enough issues at once in a package that is robust enough for widespread use.
Critical goals in the development of an operation parallel file system for clusters include:
These issues give rise to problems with interfaces and semantics, in addition to specific technical problems such as distributed locking, caching, and redundancy. The next generation of parallel file systems must look beyond traditional interfaces, semantics, and implementation methods in order achieve the desired goals. Of equal importance is the issue of knowing to what extent a given file system achieves these goals. Given that no file system is likely to address ALL of these goals equally well, it is important to be able to measure a given file system's utility in these areas through benchmarking or other evaluation methods.
In this talk we will explore a few of these issues. Included are specific
examples and a case study of the PVFS V2 team's approach to these issues. This
talk is intended to point out areas we where we feel parallel file systems must
grow beyond the confines of traditional file systems or traditional
implementation methods in hopes of generating discussion of these views. While
we will present details of PVFS V2 design and implementation this talk focuses
more on future development directions that might apply to any parallel file
system project.
In this work, a problem solving environment for parallel computing
named the Component-based
Environment for Remote Sensing (CERSe) is proposed. CERSe's runtime
engine and graphical user interface are presented. Performance results
with and without use of a parallel file system are given which show near
linear speedup can be achieved through I/O tuning.
In this work, a problem solving environment for parallel computing named the Component-based Environment for Remote Sensing (CERSe) is described. CERSe is targeted at producing applications for Linux Beowulf clusters. CERSe uses out-of-core computation techniques to handle extremely large datasets. A multi-threaded, multi-queue runtime engine executes user-supplied modules in parallel over partitioned datasets. The modules supplied by the user can be purely sequential code. Parallelism comes from the partitioning of the dataset by the system and from the ability to execute multiple modules simultaneously. Parallel I/O is used to take advantage of the I/O resources throughout the cluster and provide high performance. A GUI is provided which allows users to specify a dataflow for the application by creating and connecting pluggable modules. The GUI also provides facilities for launching jobs, selecting parameters and data, and analyzing job performance and results.
CERSe's runtime engine and graphical user interface are presented.
Performance results with and without use of a parallel file system are
given which show near linear speedup can be achieved through I/O tuning.
With multiple resources distributed across a system, an effective way to combine their processing power must be examined. A way to transfer the ownership of the remote processors on a cluster to a separate cluster on the Grid is needed to be able to combine the resources into a larger computer. This paper explores the different aspects of scheduling and allocating resources in a Grid of Beowulfs. It also describes the design of a specialized node allocation mechanism for a such a cluster. This mechanism integrates with current Beowulf software which allows the user to see a single computer instead of a distributed system.
The techniques used by this new mechanism to borrow and return nodes between
separate clusters is discussed, as well as methods for order of contact and
ownership of nodes. This paper also surveys a few different allocation and
scheduling tools that are currently used in parallel computers. By using these
ideas and the new mechanism, the organization and the computational power of a
Grid of Beowulfs will be improved.
Users and potential users of this data require an environment which is easy to use, provides ways for code to be reused and extended, and above all makes it possible to handle the growing dataset sizes and algorithm complexity through the use of high performance supercomputers. Existing problem solving environments handle one or two of these problems, but none yet address all three.
In this work, a problem solving environment named the Component-based
Environment for Remote Sensing (CERSe) is proposed. Implementation of CERSe is
discussed, case studies are presented, and the performance of these studies
is given.
In this work we present reactive scheduling (RS), which combines
adaptive methods with scheduling algorithms to optimize service order to
the workload and system state.
First the Parallel Virtual File System (PVFS), a parallel file
system developed for researching issues in parallel I/O, is described.
Second a selection of additional scheduling algorithms are added to PVFS
and a study of four workloads is performed on the system.
Third a model is developed which matches PVFS performance under the
workloads studied. Fourth the model and scheduling techniques are
combined to create an implementation of RS for PVFS.
This implementation is tested and compared against the best case
performance seen in our workload study, showing as much as 60%
improvement in task service time over the original PVFS system.
Finally conclusions are drawn on the viability of the RS technique and
directions of future study are described.
In this paper, we describe the design and implementation of PVFS and present
performance results on the Chiba City cluster at
Argonne. We provide performance results for a workload of concurrent reads and
writes for various numbers of compute nodes, I/O
nodes, and I/O request sizes. We also present performance results for MPI-IO on
PVFS, both for a concurrent read/write workload
and for the BTIO benchmark. We compare the I/O performance when using a Myrinet
network versus a fast-ethernet network for
I/O-related communication in PVFS. We obtained read and write bandwidths as
high as 700 Mbytes/sec with Myrinet and
225 Mbytes/sec with fast ethernet.
The core of our design tool is a general Algorithm Description Format
(ADF) which represents an algorithm as an attributed graph, and a library
of components, which are tailored to a particular FPGA device. In this
paper we present a set of tools which operate on the ADF representation
to provide a number of functions to aid in the creation of applications
for configurable computing platforms. These tools include front-end tools
for specifying a design, a suite of analysis tools to aid the user in
refining and constraining the design, and a set of back-end tools which
select components from the library based on the design constraints and
place and route these components on the configurable computing platform.
In this paper, particular emphasis is placed on the component model and
the analysis tools.
This paper discusses expanding this library of components by implementing
the following single precision floating point operations: sine, cosine,
arctangent, arcsine, arccosine, square root, and natural logarithm.
Each component is designed around the CORDIC shift-and-add algorithms.
CORDIC is a set of hardware efficient algorithms which uses only shifts
and adds/subtracts to compute a number of trigonometric and logarithmic
functions. The CORDIC algorithms provide the core for each component.
Floating-point extensions are added before and after the CORDIC core
to handle IEEE single precision floating-point values. Each component
is fully pipelined to achieve high performance. Beyond discussing how
each operation is implemented, the space requirements of each component
on current and future Xilinx FPGAs is analyzed. Finally, performance
results are compared for each component individually and for several
equations against several current workstations.
The Parallel Virtual File System (PVFS) Project is an effort to provide a
parallel file system for PC clusters. As a parallel file system,
PVFS provides a global name space, striping of data across multiple I/O nodes,
and multiple user interfaces. The system is
implemented at the user level, so no kernel modifications are necessary to
install or run the system. All communication is performed
using TCP/IP, so no additional message passing libraries are needed, and
support is included for using existing binaries on PVFS files.
This paper describes the key aspects of the PVFS system and presents recent
performance results on a 64 node Beowulf
workstation. Conclusions are drawn and areas of future work are discussed.