Introduction
Description
Quick Start
User Guide
Files
Papers
Links
FAQ
Mailing Lists
Developers

(Old) Parallel Virtual File System Frequently Asked Questions List

NOTE: this FAQ is for the older "v1" version of PVFS. If you are looking for the FAQ for the newer PVFS v2, Please go to pvfs.org.

NOTE: for quickest answers to previously asked questions that are not listed here, search the PVFS mailing list archives.
  1. What is PVFS?
  2. What architectures does PVFS support?
  3. How do I install PVFS?
  4. What are these enablemgr and enableiod scripts all about?
  5. How can I store PVFS data on multiple disks on a single node?
  6. How can I run multiple I/O daemons on the same node?
  7. I ran Bonnie and the performance is terrible. Why? Is there anything I can do?
  8. Why is program XXX so slow?
  9. Does PVFS support redundancy? What if a node fails?
  10. Why do my modification dates change on PVFS files that I am reading from?
  11. How do I get MPI-IO for PVFS?
  12. When I try to compile ROMIO (MPI-IO) with pvfs support it fails with a list of "undefined reference" errors. How do I fix this?
  13. Can I directly manipulate PVFS files on the manager or I/O servers without going through the client interface?
  14. How can I back up my PVFS file system?
  15. How can I contribute to the PVFS project?
  16. What are the glibc wrappers and/or where are they now?
  17. When did you add symlinks to PVFS?
  18. Can I add, remove, or change the order of the I/O daemons on an existing PVFS file system?
  19. Does PVFS work across heterogeneous architectures?
  20. How do I keep the locate cron job from scanning the PVFS directory?
  21. Why does df show less free space than I think it should? What can I do about that?
  22. When I try to compile pvfs-kernel, I get an error message that says: /usr/include/linux/modversions.h:1:2: #error Modules should never use kernel-headers system headers. What's wrong?
  23. Can I use multiple managers (mgr processes) in PVFS?
  24. I see unresolved symbols errors when I try to load the kernel module. What should I do?
  25. When I load the pvfs module I see the following message: "devfs_register(pvfsd): could not append to parent, err: -17". What does this mean?
  26. How can I use multiple disks on each of my I/O servers?
  27. Does PVFS have a maximum file system size? If so, what is it?

What is PVFS?

PVFS is a virtual parallel file system which operates on clusters of PCs running Linux. It is virtual in that file data is actually stored on multiple file systems on local disks, not by PVFS itself. By parallel we mean that data is stored on multiple independent PCs, or nodes, and that multiple clients can access this data simultaneously.


What architectures does PVFS support?

PVFS (at least as of version 1.5.4) is known to compile and work properly on Alpha, x86, and IA64 based Linux systems. If you have had success with any other platforms please let us know.


How do I install PVFS?

If you want to get your system up and running quickly, you should probably start of by reading the quickstart guide. It can be found here.


What are these enablemgr and enableiod scripts all about?

They are simple scripts that can be used to set up the links in the rc.d directories on RedHat machines in order to start the iod or manager at boot time. For example, if you wanted to have the manager started on a machine on boot, you should run the enablemgr script once on that machine. These scripts (if used) only need to be run once for each machine. The daemons will start on boot from then on.


How can I store PVFS data on multiple disks on a single node?

You have two options. One is to use the md driver or something similar to create a disk array or RAID, create a file system on that, and use one I/O daemon to perform accesses to the new file system.

The alternative is to run multiple I/O daemons on the same node, one per file system you wish to use.


How can I run multiple I/O daemons on the same node?

This is easy; you just give them separate ports on which to communicate. This involves both the .iodtab file and some iod.conf files for configuration.

Remember, the .iodtab file exists in the root of the metadata directory tree and is used by the manager to determine the locations of iods. Examples are in the User's Guide.

For example, let's assume that you are going to run PVFS with two nodes used for I/O, but you want to use two disks on each node. Your .iodtab file might look like:

192.168.0.1:7000 
192.168.0.1:7001 
192.168.0.2:7000
192.168.0.2:7001

This would specify that four iods will be used for the file system. Two are running on the machine 192.168.0.1. One is listening on port 7000, the other on port 7001. Likewise there are two iods running on 192.168.0.2 listening on ports 7000 and 7001.

The iod.conf file tells a given iod about its configuration. We'll continue the example. Let's assume that the disks are mounted on the nodes at /pvfs_disk0 and /pvfs_disk1 (on both nodes). So we'll build a couple of configuration files:

Config file 1, "/etc/iod0.conf":

port 7000 
user nobody 
group nobody 
rootdir / 
datadir /pvfs_disk0 
logdir /tmp

Config file 2, "/etc/iod1.conf":

port 7001 
user nobody 
group nobody 
rootdir / 
datadir /pvfs_disk1 
logdir /tmp 

Ok. So we copy these two files out to the two nodes. Then we start the iods (on each of the two nodes) with "iod /etc/iod0.conf" and "iod /etc/iod1.conf". The iods will read their respective configuration files and prepare themselves to service requests.


I ran Bonnie and the performance is terrible. Why? Is there anything I can do?

Bonnie is a file system benchmark written by Tim Bray (see http://www.textuality.com/bonnie/). With PVFS v1.4.2 and later Bonnie will run fine, but the performance numbers are likely to be very low. Bonnie uses a 16Kbyte buffer for accessing the file which it is writing to, which is a particularly small access size for the PVFS system. PVFS performs poorly at this size, because TCP overhead is very apparent at requests this small.

There really isn't much to do about this at this time. Future versions of PVFS using different network transfer protocols will hopefully have better small-access performance. In the mean time you can hack Bonnie to use larger accesses (the value is "Chunk") and see what larger accesses will do if you want to see some better numbers.


Why is program XXX so slow?

Many applications use rather small buffers by default, and this can cause poor PVFS performance. Applications such as "tar" and "dd" are good examples. In cases such as this, if there is an option to set a block size, use it (smile)! Try something around 16-64K; it will almost certainly help things out.

As an example, the program "cpio" uses a 512 byte block by default. The --block-size option can be used to set the block size to some multiple of 512 bytes, so "cpio --block-size=128" would use a 64K buffer, which should perform much better.


Does PVFS support redundancy? What if a node fails?

Nope! Sure doesn't. We've talked about it, we have some ideas, but we haven't implemented any redundancy. So, if an I/O node fails, PVFS accesses that need that node will also fail. Generally though, barring disk destruction, restarting the node (and sometimes restarting the other PVFS daemons) will get you right back where you were, no data lost.

PVFS will run on top of RAID file systems, however. This can provide at least some measure of redundancy at the disk level. It does not protect against more catistrophic hardware failures such as IDE controller failure or spontaneous combustion.


Why do my modification dates change on PVFS files that I am reading from?

The PVFS manager is not involved in I/O operations, so it has no direct way of knowing if a file has been modified or not. It updates the modification time any time a file is closed. Really we should check to see if the file was opened for writing, but we don't at the moment.

Note that this behavior has been fixed as of revision 1.5.3.


How do I get MPI-IO for PVFS?

See the ROMIO web pages (http://www.mcs.anl.gov/romio). ROMIO is an MPI-IO implementation that is included with MPICH, but generally you will need to recompile to get PVFS support. This is discussed in the ROMIO documentation.


When I try to compile ROMIO (MPI-IO) with pvfs support it fails with a list of "undefined reference" errors. How do I fix this?

The problem here is the pvfs library that needs to be linked in during the compilation. This must be specified when you run the configure script. Here is an example of the command line needed to build the full MPICH distribution with ROMIO and PVFS support:

./configure -opt=-O -device=ch_p4 --with-romio="-file_system=pvfs" -lib="-L/usr/lib/libpvfs.a -lpvfs" -cflags="-I/usr/include/pvfs"


Can I directly manipulate PVFS files on the manager or I/O servers without going through the client interface?

The short answer is no. The metadata and file data is not meant to be modified directly by users. Doing so may cause corrupt data, lost storage space, etc. If you wish to delete or move files, always do so through a PVFS client interface, whether it is through the kernel module or the native PVFS library.


How can I back up my PVFS file system?

This isn't as easy as it should be, but it can be done. First, I'll give some specifics about why this is troublesome, then I'll discuss solutions and a suggestion for making this easier.

The problem with backing up PVFS comes from a design decision made by me (Rob) with respect to handing out unique handles (inode numbers) for files. The manager has to pick these numbers, and at the time it seemed like a good idea to just use the inode number from the actual metadata file. This was convenient because the data was already stored (as part of the file) and was guaranteed to be unique by the file system.

This is great, but it becomes a real problem for backups -- if you go to recreate the metadata directory it's next to impossible with standard tools (eg. tar) to get the inode numbers back to the same, especially since tools like tar don't save them anyway. This is all that keeps one from just tar'ing up all the local directories that make up PVFS and backing it up in that manner.

There are two solutions. The first is to use tar or some similar tool, through the PVFS interfaces (either the kernel or the library one) to pull all data off of PVFS. This will result in an archive that could be restored to a newly built PVFS file system with no problems. However, you have to have a storage device big enough to hold all the data, and pulling all the data off in this manner will likely take a long time for a large PVFS file system.

The second solution works by backing up the local directories individually, avoiding the need for a single large device to archive to (you could still use one device if you like) and allowing for archiving to take place in parallel on the machine. If you are not familiar with disk partitioning, "dd", and writing raw partitions back to disk, just don't try this.

In order for this scheme to work, the metadata directory should be stored on its own file system, preferably one that isn't too large. Remember that the metadata files are quite small, so a file system of 50 Mbytes is probably enough to hold all the metadata files you will ever create. So use a little partition to store the metadata.

Then the backup is simple. Use tar to archive the data on the I/O nodes. There's nothing special about those directory structures that tar won't keep up with. Then use dd to grab the entire partition that the metadata is stored on. Stuff it in a file, gzip or bzip2 it, and keep it with your I/O archives.

Then if there are problems and you need to restore, dd the partition back into place, untar the I/O directories onto the right machines, and away you go.


How can I contribute to the PVFS project?

We are always looking for help with implementing new features, testing, or simply commenting on what we are doing. If you are interested, have a look at the developers page for more information.


What are the glibc wrappers and/or where are they now?

The glibc wrappers are no longer supported as of PVFS version 1.5.0. They were discontinued because their functionality has been subsumed by the pvfs-kernel package.

There may still be references to these wrappers in documentation occasionally, but this will be corrected over time. For curious readers, the wrappers were a mechanism for providing compatibility with existing applications by wrapping libc I/O function calls and trapping the ones that dealt with PVFS. As you may imagine, it was rather difficult to maintain software that depended on very specific versions of libc to operate correctly. We now provide a much higher level of compatibility through a kernel module client side implementation.


When did you add symlinks support to PVFS v1?

We added symbolic link support in PVFS 1.6.1. Hardlinks are not supported.


Can I add, remove, or change the order of the I/O daemons on an existing PVFS file system?

No. If you need to add, remove, or swap I/O servers the the existing list (which can usually be found in the /pvfs-meta/.iodtab file), we recommend that you rebuild your file system. The safest thing to do is to copy all of your data to another location, delete all of the files on the existing PVFS file system, make your changes, restart PVFS, and then copy your data back onto the file system. All of the PVFS components rely on the ordering of I/O servers listed in the .iodtab file, and altering it will result in file corruption. We realize that this is really inconvenient, but we really don't have a better solution at this time. Future releases will hopefully better address this issue.


Does PVFS work across heterogeneous architectures?

Some. Currently PVFS only works on mixed x86 and IA64 clusters, or on Alpha-only clusters.


How do I keep the locate cron job from scanning the PVFS directory?

Most linux distributions allow you to control this from existing configuration files. On SuSE, you can edit the UPDATEDB_PRUNEPATHS setting in /etc/rc.config. On Redhat, you can edit /etc/cron.daily/slocate.cron.


Why does df show less free space than I think it should? What can I do about that?

PVFS calculates free space by multiplying the minimum amount free on any one iod by the number of iods in use. It does this because when you fill up the disk on one iod, PVFS is no longer be able to write files out with the default stripe (in general). PVFS doesn't try to modify the default stripe to adjust to full disks.

If the I/O server local file systems are used for things other than PVFS, or if large numbers of small files are being stored on PVFS, then the free space available on I/O servers might not be roughly equal. In the case of non-PVFS files on the I/O server's file system, you should move them :). If lots of small files are being stored on PVFS, you might want to consider using the u2p utility to redistribute the files and/or using the random base (-r) option on the manager in order to place new files on other servers. Note that the use of the random base option will likely result in a situation where new I/O servers cannot be added to the existing file system, so keep that in mind.


When I try to compile pvfs-kernel, I get an error message that says: /usr/include/linux/modversions.h:1:2: #error Modules should never use kernel-headers system headers. What's wrong?

Make sure that you have the proper kernel source/headers installed and configured on your system. Check the INSTALL file included in pvfs-kernel for more details.


Can I use multiple managers (mgr processes) in PVFS?

No. PVFS only allows one mgr process per file system. You can run as many iod's as you like, however. Fortunately, the constraint on the number of mgr processes is not as much of a bottleneck as most people expect. The mgr is _not_ involved in I/O operations (reads/writes) at all; these are handled directly between clients and iods. The mgr is only used for handling meta data, and therefore is contacted only for operations such as directory listings, opening and closing files, and changing permissions. A single mgr process is sufficient for these types of operations in most environments.


I see unresolved symbols errors when I try to load the kernel module. What should I do?

Please check the INSTALL file in pvfs-kernel for more information. Most likely you need to verify that you have the correct kernel headers installed and configured on your system.


When I load the pvfs module I see the following message: "devfs_register(pvfsd): could not append to parent, err: -17". What does this mean?

Everything is fine. This is a warning that occurs on later 2.4.x kernel ( at least 2.4.18 and above ) if you are using devfs. It happens because of the way we handle creating device entries for backwards compatibility with kernels that do not use devfs.


How can I use multiple disks on each of my I/O servers?

Currently the best way to take advantage of multiple disks in a single I/O server is to use some form of RAID or disk array solution to first create a local file system that spans the disks, then tell the iod to use a subdirectory on that disk to store its data.

You should search online for "Linux software RAID" for more information on setting up such a local disk configuration with commodity hardware.


Does PVFS have a maximum file system size? If so, what is it?

No, PVFS doesn't have an inherent maximum file system size.

For a long time Linux had a maximum addressable block device size of 2TB, which meant that all file systems residing on a single block device could only be of 2TB size. At the time of writing patches existed that would work around this, but these patches for the most part weren't available in most kernel distributions. This means that the local file systems for iods can only be 2TB, so you're limited to 2TB per iod due to this constraint.

PVFS doesn't deal with block devices, nor does the pvfs kernel code deal in terms of block devices, so this 2TB limit has no impact on PVFS other than the above mentioned limit on the local storage region size for an iod.

 


Contact:The PVFS mailing lists.