As the PC cluster has increased in popularity as a parallel computing platform, the demand for system software for this platform has grown as well. In today's cluster parallel computing environment we find that many of the essential system software components are in place, such as reliable operating systems, local storage systems, and message passing systems. However, one area devoid of production level products for clusters is that of parallel I/O systems.
The Parallel Virtual File System (PVFS) Project is an effort to provide a high-performance and scalable parallel file system for PC clusters. PVFS is open source and released under the GNU General Public License. It requires no special hardware or modifications to the kernel. PVFS provides four important capabilities in one package:
For a parallel file system to be easily used, it must provide a name space that is the same across the cluster and it must be accessible via the utilities to which we are all accustomed. PVFS file systems may be mounted on all nodes in the same directory simultaneously, allowing all nodes to see and access all files on the PVFS file system through the same directory scheme. Once mounted PVFS files and directories can be operated on with all the familiar tools, such as ls, cp, and rm.
In order to provide high-performance access to data stored on the file system by many clients, PVFS spreads data out across multiple cluster nodes, which we call I/O nodes. By spreading data across multiple I/O nodes, applications have multiple paths to data through the network and multiple disks on which data is stored. This eliminates single bottlenecks in the I/O path and thus increases the total potential bandwidth for multiple clients, or aggregate bandwidth.
While the traditional mechanism of system calls for file access is convenient and allows for all applications to access files stored on many different file system types, there is overhead in accessing through the kernel. With PVFS, clients can avoid making requests to the file system through the kernel by linking to the PVFS native API. This library implements a subset of the UNIX operations which directly contact PVFS servers rather than passing through the local kernel. This library can be utilized by applications or by libraries, such as the ROMIO MPI-IO library, for high speed PVFS access.
The above figure shows how nodes might be assigned for use with PVFS. Nodes are divided into compute nodes, on which applications are run, a management node which handles metadata operations (described in a moment), and I/O nodes which store file data for PVFS file systems. The management and I/O nodes might be used for computation as well; it's up to the administrators. For small clusters it makes sense to overlap these in order to maintain the highest utilization of resources, while for large clusters the I/O and metadata processing is best placed on dedicated resources.
There are four major components to the PVFS system:
The first two components are daemons which run on nodes in the cluster. The metadata server, named mgr, manages all file metadata for PVFS files. Metadata is information which describes a file, such as its name, its place in the directory hierarchy, its owner, and how it is distributed across nodes in the system. By having a daemon which atomically operates on file metadata we avoid many of the shortcomings of storage area network approaches, which have to implement complex locking schemes to ensure that metadata stays consistent in the face of multiple accesses.
The second daemon is the I/O server, or iod. The I/O server
handles storing and retrieving file data stored on local disks connected
to the node. These servers actually create files on an existing file
system on the local node, and they use the traditional
As mentioned before, the PVFS native API provides user-space access to the PVFS servers. This library handles the scatter/gather operations necessary to move data between user buffers and PVFS servers, keeping these operations transparent to the user. The above figure shows data flow in the PVFS system for metadata operations (on the left) and data access (on the right). For metadata operations, applications communicate through the library with the metadata server. For data access the metadata server is eliminated from the access path and instead I/O servers are contacted directly. This is key to providing scalable aggregate performance.
Finally the PVFS Linux kernel support provides the functionality necessary to mount PVFS file systems on Linux nodes. This allows existing programs to access PVFS files without any modification. This support is not necessary for PVFS use by applications, but it provides an extremely convenient means for interacting with the system. The PVFS Linux kernel support includes a loadable module, an optional kernel patch to eliminate a memory copy, and a daemon, pvfsd that accesses the PVFS file system on behalf of applications. It uses functions from libpvfs to perform these operations.
The above figure shows data flow through the kernel when the Linux kernel support is used. This technique is similar to the mechanism used by the Coda file system (we designed our system using Coda's implementation as an example). Operations are passed through system calls to the Linux VFS layer. Here they are queued for service by the pvfsd, which receives operations from the kernel through a device file. It then communicates with the PVFS servers and returns data through the kernel to the application.
In order for any file system to be usable, convenient interfaces must be available for it. This issue becomes even more important when applications run in parallel; these applications place heavy demands on a file system. To meet the needs of multiple groups, there are three interfaces through which PVFS may be accessed:
The PVFS native API provides a UNIX-like interface for accessing PVFS files. It also allows users to specify how files will be striped across the I/O nodes in the PVFS system.
The Linux kernel interface, as discussed earlier, allows applications to access PVFS file systems through the normal channels. This allows users to use all the common utilities to perform every day data manipulation, staging of data on to and off of PVFS file systems, and so on.
ROMIO implements the MPI2 I/O calls in a portable library. This allows parallel programmers using MPI to access PVFS files through the MPI-IO interface. Additionally ROMIO implements two optimizations, data sieving and two-phase collective I/O, which can be of great performance benefit. More information on the ROMIO package and these optimizations may be found on the ROMIO web page.
From the start two of the most important goals of PVFS have been high performance and scalability. As cluster sizes grow, high performance parallel I/O becomes ever more important, and we are working to ensure that PVFS will continue to scale to meet the needs of these new clusters.
Here we outline the performance of the PVFS native API on the Chiba City cluster at Argonne. This is a 256 node dual-processor Pentium III cluster. There are two interconnects in the system, fast ethernet and Myrinet. We will show performance across both of these networks.
In the above graphs we see aggregate performance across the fast ethernet. It is evident that we are not scaling well beyond 24 I/O servers in this case. This is likely due in part to TCP over fast ethernet and its overhead, but the architecture of the network and our choice of I/O server placement also likely played a role (this system has better connectivity between some sets of nodes than others, and we did not account for this in laying out servers and clients).
In this second set of tests we see performance across the Myrinet. Aggregate performance scales with TCP performance across the Myrinet and peaks at approximately the aggregate performance of the disks on our I/O servers (which are slower than the Myrinet network). PVFS scales very well in this environment for the tested sizes.
For more information on these tests and more performance numbers from Chiba City, see XXX.