SIGCSE 2013
(08-MAR-2013)
"Using FPGAs as a Reconfigurable Teaching Tool Throughout CS Systems Curriculum",
(Paper: PDF,
Local: PDF)
(Presentation: KEY)
ACMSE 2012
(29-MAR-2012)
"Application Monitoring and Checkpointing in HPC:
Looking Towards Exascale Systems",
(Paper: PDF,
Local: PDF)
(Presentation: PDF)
HPDC 2010 (23-JUN-2010)
"Impact of Sub-optimal Checkpoint Intervals on Application Efficiency in Computational Clusters",
(Paper: PDF,
Local Copy: PDF)
CCPE Journal:
Special Issue (10-SEP-2009)
"Network-aware Selective Job Checkpoint and Migration to Enhance Co-allocation In Multi-cluster Systems",
(Paper: PDF ,
Local Copy: PDF)
National HPC Workshop on Resilience: (12-AUG-2009)
"Failure Test Harness and the Impact of Sub-optimal Checkpoint Intervals on Application Efficiency",
(Poster only: PDF)
LANL Resilience Seminar: Invited Talk (07-JUL-2009)
"Impact of Non-optimal Checkpoint Intervals on Overall
Application Efficiency in Cluster Computing: A Simulation-Based Study",
(Presentation: PDF)
CCGrid 2008:
Resilience 2008
"Application Resilience: Making Progress in Spite of Failure",
(Paper: PDF,
Local: PDF)
(Presentation: PPT)
Middleware 2007:
MGC 2007
"Using Checkpointing to Recover from Poor Multi-site Parallel Job Scheduling Decisions",
(Paper: PDF)
(Presentation: PDF)
PDCS 2007
"The Impact of Error in User-Provided Bandwidth Estimates on Multi-site Parallel Job Scheduling Performance",
(Paper: PDF)
(Presentation: PDF)
ICPADS 2006: SRMPDS
"Ensuring Fairness Among Participating Clusters During Multi-site Parallel
Job Scheduling",
(Paper: PDF)
(Presentation: PDF)
ICPADS 2006
"The Impact of Information Availability and Workload Characteristics on the
Performance of Job Co-allocation in Multi-clusters",
(Paper: PDF)
(Presentation: PDF)
Ph.D. Dissertation, Dec. 2005
"Improving Parallel Job Scheduling Performance in Multi-clusters Through Selective Job Co-allocation",
(Manuscript: PDF, PS)
(Presentation: PDF)
CAEFF: Site Visit 2005
"Parallel Job Scheduling in a Mini-Grid",
(Poster: PPT,
PDF)
Journal of Supercomputing: Vol. 34
"Characterization of Bandwidth-aware Meta-schedulers for Co-allocating Jobs Across Multiple Clusters",
(Paper: PDF) (Local draft: PDF, PS)
SC 2004: NASA Booth
CLUSTER 2004
"Bandwidth-aware Co-allocating Meta-schedulers for Mini-grid Architectures",
(Paper: PDF, PS)
(Presentation: PDF)
SURE:
2004 Program
"Java Based Visualizer for BeoSim",
(Presentation: PPT)
(Poster: PPT)
IPDPS 2004: PMEO-PDS
"Job Communication Characterization and its Impact on Meta-scheduling Co-allocated Jobs in a Mini-grid",
(Paper: PDF, PS)
(Presentation: PDF)
CAEFF: Site Visit 2004
"Parallel Job Scheduling in a Mini-Grid",
(Poster: PPT,
PDF)
CAEFF:
Site Visit 2003
"Meta-scheduling for Mini-grids",
(Poster: PPT, PDF)
SC 2002: NASA Booth (GSFC)
"Beowulf/Mini-Grid System Software",
(Poster: PDF) (Whitepaper: PDF)
PARL TR's: PARL-2002-009
"Computational Mini-Grid Research at Clemson University",
(Report: PS, PDF)
|
|
|
|
|
BeoViz: BeoSim's Front-end Visualizer
|
Computational multi-clusters are an important emerging class of
supercomputing architectures. As multi-cluster systems become more
prevalent, techniques for efficiently exploiting these resources become
increasingly significant. A critical aspect of exploiting these resources
is the challenge of scheduling. In order to maximize job throughput,
multi-cluster schedulers must simultaneously leverage the collective
computational resources of each of its participating clusters. By doing
so, jobs that would otherwise wait for nodes to become available on
a single cluster can potentially run earlier by aggregating disjoint
resources throughout the multi-cluster. This procedure can result in
dramatic reductions in queue waiting times.
The main caveat of this approach is that by mapping jobs across cluster
boundaries, inter-cluster network resources are also consumed. If
the inter-cluster network links become too saturated with traffic, any
co-allocated jobs may experience degraded runtime performance due to the
communication bottleneck present in the network infrastructure. This
degradation in runtime performance can potentially offset the benefit
of performing job co-allocation in the first place. More precisely,
the increase in job runtime due to link saturation can rapidly outweigh
the decrease in queue waiting time, thus resulting in poorer overall
system performance.
Multi-cluster schedulers must make use of all available information
pertaining to job communication structure as well as network topology
and utilization in order to improve job throughput while mitigating any
negative impact to job runtime performance due to network congestion.
Additionally, these schedulers must make reasonable co-allocation
decisions in the absence of specific job and network information, as
this information is not always available.
In this research, we have developed a bandwidth-centric job communication
model that captures the interaction and impact of simultaneously
co-allocating jobs across multiple clusters. We compare our dynamic
model with previous research that utilizes a fixed execution time penalty
for co-allocated jobs. We explore the interaction of simultaneously
co-allocated jobs and the contention they often create in the network
infrastructure of a dedicated computational multi-cluster.
We have also developed several bandwidth-aware co-allocating
meta-schedulers. These schedulers take inter-cluster network
utilization into account as a means by which to mitigate degraded job
run-time performance. By making use of a bandwidth-centric parallel
job communication model we are able to evaluate the performance of
multi-cluster scheduling algorithms that focus not only on node resource
allocation, but also on shared inter-cluster network bandwidth.
BeoSim is a discrete-event simulator that has been implemented for the purpose
of studying multi-site parallel job scheduling algorithms in the context of a multicluster
computational grid. It is not presently available for download; however, there are
several other grid simulators that are. See the "Related Research" section on the right
for more information.
|
|
|
|
|
|
|
Possible Publication Venues
|
|
|
|
|
|
|
|
|