Achieving high performance computing is fundamentally being approached from two directions. In the first approach, researchers are trying to come up with more clever and efficient ways of solving computational problems via better algorithms (e.g. Cannon's algorithm for matrix multiplication). From the other direction, researchers are trying to build faster computers with which to perform computation.
Parallel system software blurs the boundaries between the two approaches by trying to improve the performance of parallel computers by increasing the utilization and efficacy of high performance architectural improvements. Parallel system software attempts to more effectively leverage the computational resources for use by the parallel algorithms, acting as a sort of glue between the pure parallel software and the pure high performance hardware. However, the nature of the system software depends on the hardware and algorithms that we are attempting to combine for high performance. Below I attempt to describe the two major hardware platforms and how system software research integrates with each platform.
Supercomputers are traditionally many processors stuffed in a single chassis with some sort of bus-like interconnect. Modern commercial examples include the offerings from Cray (X-1, XD-1, etc). Research in this area has recently produced MIT's RAW processor and Stanford's Merrimac. These research projects are fundamentally driven by the desire to build a faster processor architecture, usually by incorporating many processors together into one with a versatile bus and cache design. Lately, microprocessor research has become consumed with power consumption (future chips are expected to have power densities comparable to nuclear reactors). Supercomputer architects are unfortunately forced to worry about these constraints as well, but they are generally more concerned with building scalable processors. This includes designing smarter CPU interconnects, different processor types, and the system software to utilize these new computation paradigms (e.g. stream processing). To my knowledge modern commercial supercomputers are not really leveraging this technology yet, I suspect at least partially because paradigm shifts in computation are difficult to predict, and difficult to achieve profitably. High performance I/O in the supercomputer world is almost entirely limited to building scalable, low latency buses for message passing.
System software for supercomputers has traditionally been focused on building compilers that could exploit code level parallelism without explicit directives from the programmer. That approach has not achieved widespread success, and to my knowledge is no longer a field of interesting work. Recently, much of the system software work for supercomputers has been adapting the system software from clusters.
Cluster computing comes at the problem of building high performance machines a little differently. The idea is still to achieve speedup and scalability via parallelism, but rather than building a single machine containing many CPUs connected in parallel via a bus, we instead connect many computers together using a switched network. The processing power is then governed by the number of machines present and the speed of the switched network. Cluster computers are generally only able to achieve medium and coarse grain parallelism because of the higher message latency due to the switched network interconnect. However, high speed interconnects (such as Myrinet) and better commodity buses (PCI-E) have done a great deal in lowering the latency and increasing the bandwidth available for inter-computer messaging. And this is the nuts and bolts of the argument for the simple approach of building a Linux cluster. A commodity Linux cluster is able to effectively leverage the real advancements coming out of architecture research, and it does so at a much lower cost, especially considering that the cluster parallel paradigm (usually consisting of MPI based messaging) is so well researched and understood. The latest cluster architectures, such as the IBM Blue Gene and NECs Earth Simulator use supercomputers as the component computers in the cluster, thus building a cluster of supercomputers. The complex interactions between the fast bus interconnects and slower network interconnects further increases the complexities of leveraging the computation resources of these computers.
System software for clusters (particulaly Linux clusters) is a very active field. There are currently active efforts in developing job scheduling software, parallel filesystems, etc.