My thesis is the development of a filesystem data redundancy mechanism appropriate for parallel filesystems. Simply stated, if a node in the cluster becomes unavailable, no data will be lost because the data has been simultaneously stored and updated on another node. From a practical reliability standpoint, this is a very important modification for cluster computers. As we increase the number of independent resources available in a cluster computer (nodes, disks, etc.), the probability of a resource failure increases proportionally.
Until my thesis is completed, I'll avoid going into specific details, but I would like to note several interesting problems that arise during the design of such a system. First, traditional file replication techniques do not generally provide the scalable performance required in a parallel file system. Second, for the reliability of large systems to be increased appreciably, the replication of file data must occur synchronously. This is particularly important if the performance impacts of replication are non-zero. It is possible to transfer bandwidth costs around in the system, however, a cost in bandwidth must eventually be paid in any replication scheme. Finally, the synchronization of replicated data is not impossible, it does however require a failover scheme since operation atomicity cannot be guaranteed over multiple nodes. For more information, you can email me at bradles at parl.clemson.edu.