A Multi-cluster Computational Grid Simulator for Parallel Job Scheduling Research
Initially developed and maintained in the Parallel Architecture Research Laboratory
Continuing research in the Computer Science Department at Coastal Carolina University
Papers and Presentations      
  • ACMSE 2012 (29-MAR-2012)
    • "Application Monitoring and Checkpointing in HPC: Looking Towards Exascale Systems",
      (Paper: PDF, Local: PDF)
      (Presentation: PDF)
  • HPDC 2010 (23-JUN-2010)
    • "Impact of Sub-optimal Checkpoint Intervals on Application Efficiency in Computational Clusters",
      (Paper: PDF, Local Copy: PDF)
  • CCPE Journal: Special Issue (10-SEP-2009)
    • "Network-aware Selective Job Checkpoint and Migration to Enhance Co-allocation In Multi-cluster Systems",
      (Paper: PDF , Local Copy: PDF)
  • National HPC Workshop on Resilience: (12-AUG-2009)
    • "Failure Test Harness and the Impact of Sub-optimal Checkpoint Intervals on Application Efficiency",
      (Poster only: PDF)
  • LANL Resilience Seminar: Invited Talk (07-JUL-2009)
    • "Impact of Non-optimal Checkpoint Intervals on Overall Application Efficiency in Cluster Computing: A Simulation-Based Study",
      (Presentation: PDF)
  • CCGrid 2008: Resilience 2008
    • "Application Resilience: Making Progress in Spite of Failure",
      (Paper: PDF, Local: PDF)
      (Presentation: PPT)
  • Middleware 2007: MGC 2007
    • "Using Checkpointing to Recover from Poor Multi-site Parallel Job Scheduling Decisions",
      (Paper: PDF) (Presentation: PDF)
  • PDCS 2007
    • "The Impact of Error in User-Provided Bandwidth Estimates on Multi-site Parallel Job Scheduling Performance",
      (Paper: PDF) (Presentation: PDF)
    • "Ensuring Fairness Among Participating Clusters During Multi-site Parallel Job Scheduling",
      (Paper: PDF) (Presentation: PDF)
  • ICPADS 2006
    • "The Impact of Information Availability and Workload Characteristics on the Performance of Job Co-allocation in Multi-clusters",
      (Paper: PDF) (Presentation: PDF)
  • Ph.D. Dissertation, Dec. 2005
    • "Improving Parallel Job Scheduling Performance in Multi-clusters Through Selective Job Co-allocation",
      (Manuscript: PDF, PS) (Presentation: PDF)
  • CAEFF: Site Visit 2005
    • "Parallel Job Scheduling in a Mini-Grid",
      (Poster: PPT, PDF)
  • Journal of Supercomputing: Vol. 34
    • "Characterization of Bandwidth-aware Meta-schedulers for Co-allocating Jobs Across Multiple Clusters",
      (Paper: PDF) (Local draft: PDF, PS)
  • SC 2004: NASA Booth
  • CLUSTER 2004
    • "Bandwidth-aware Co-allocating Meta-schedulers for Mini-grid Architectures",
      (Paper: PDF, PS) (Presentation: PDF)
  • SURE: 2004 Program
    • "Java Based Visualizer for BeoSim",
      (Presentation: PPT) (Poster: PPT)
  • IPDPS 2004: PMEO-PDS
    • "Job Communication Characterization and its Impact on Meta-scheduling Co-allocated Jobs in a Mini-grid",
      (Paper: PDF, PS) (Presentation: PDF)
  • CAEFF: Site Visit 2004
    • "Parallel Job Scheduling in a Mini-Grid",
      (Poster: PPT, PDF)
  • CAEFF: Site Visit 2003
    • "Meta-scheduling for Mini-grids",
      (Poster: PPT, PDF)
  • SC 2002: NASA Booth (GSFC)
    • "Beowulf/Mini-Grid System Software",
      (Poster: PDF) (Whitepaper: PDF)
  • PARL TR's: PARL-2002-009
    • "Computational Mini-Grid Research at Clemson University",
      (Report: PS, PDF)
    BeoViz: BeoSim's Front-end Visualizer

    Computational multi-clusters are an important emerging class of supercomputing architectures. As multi-cluster systems become more prevalent, techniques for efficiently exploiting these resources become increasingly significant. A critical aspect of exploiting these resources is the challenge of scheduling. In order to maximize job throughput, multi-cluster schedulers must simultaneously leverage the collective computational resources of each of its participating clusters. By doing so, jobs that would otherwise wait for nodes to become available on a single cluster can potentially run earlier by aggregating disjoint resources throughout the multi-cluster. This procedure can result in dramatic reductions in queue waiting times.

    The main caveat of this approach is that by mapping jobs across cluster boundaries, inter-cluster network resources are also consumed. If the inter-cluster network links become too saturated with traffic, any co-allocated jobs may experience degraded runtime performance due to the communication bottleneck present in the network infrastructure. This degradation in runtime performance can potentially offset the benefit of performing job co-allocation in the first place. More precisely, the increase in job runtime due to link saturation can rapidly outweigh the decrease in queue waiting time, thus resulting in poorer overall system performance.

    Multi-cluster schedulers must make use of all available information pertaining to job communication structure as well as network topology and utilization in order to improve job throughput while mitigating any negative impact to job runtime performance due to network congestion. Additionally, these schedulers must make reasonable co-allocation decisions in the absence of specific job and network information, as this information is not always available.

    In this research, we have developed a bandwidth-centric job communication model that captures the interaction and impact of simultaneously co-allocating jobs across multiple clusters. We compare our dynamic model with previous research that utilizes a fixed execution time penalty for co-allocated jobs. We explore the interaction of simultaneously co-allocated jobs and the contention they often create in the network infrastructure of a dedicated computational multi-cluster.

    We have also developed several bandwidth-aware co-allocating meta-schedulers. These schedulers take inter-cluster network utilization into account as a means by which to mitigate degraded job run-time performance. By making use of a bandwidth-centric parallel job communication model we are able to evaluate the performance of multi-cluster scheduling algorithms that focus not only on node resource allocation, but also on shared inter-cluster network bandwidth.

    BeoSim is a discrete-event simulator that has been implemented for the purpose of studying multi-site parallel job scheduling algorithms in the context of a multicluster computational grid. It is not presently available for download; however, there are several other grid simulators that are. See the "Related Research" section on the right for more information.

    Web Stats      
  • Detailed Stats

  • eXTReMe Tracker
    Current Team      
  • William M. Jones
    • Project Leader, Coastal Carolina University
  • Nathan A. DeBardeleben
    • Los Alamos National Laboratory
  • John T. Daly
    • Center for Exceptional Computing, DoD
    Former Members      
  • Louis W. Pang
    • Lead Programmer, Currently at OPNET
  • Michael F. Bassily
    • Visualizer Programmer, Currently at Clemson University
  • Nishant Shrivastava
    • Workload Generation, Currently at Cisco
  • Walter B. Ligon III
    • PhD Advisor, Currently at Clemson University
  • Daniel Stanzione
    • Colleague, Currently at ASU
    Related Research      
  • DARPA Exascale Report
  • Green Computer Science
  • Dynamic Virtual Clustering
  • HECIOS I/O Simulator
  • GSSIM Grid Scheduling Simulator
  • GridSim Grid Simulator
  • SimGrid Grid Simulator
  • OptorSim Grid Simulator
  • Bricks Grid Simulator
  • Parallel Job Scheduling
  • Parallel Job Scheduling Strategies Workshop
  • Parallel Workload Archive
  • Computer Security Conference 2008
  • Proposal Goals
  • Current Goals
  • Previous Goals
  • Proposal
  • Initial problem addressed
  • Visualizer specification
  • Possible Publication Venues      
  • ISPASS 2006 -- Due Oct. 7, 2005
  • ICPADS 2006 -- Due Jan. 15, 2006
  • ICPP 2006 -- Due Feb. 1, 2006
  • TCPP Announce -- Archives
  • Cluster and Grid Computing -- Journals and Confs
  • Cluster and Grid Computing - Local (Modified) Copy (Fall 2006)
  • Conferences
  • Conferences - Local Copy