N.B. I also write for the Communications of the ACM (CACM). The following essay recently appeared on the CACM blog.
Petascale high-performance computing (HPC) is here, with multiple machines achieving more than ten petaflops on the Linpack (HPL) Top500 benchmark. These achievements have not been without teething problems, as the scale and complexity of the systems have made debugging, acceptance testing and application scaling ever more challenging. Nevertheless, these systems are now operational, and they are being used for scientific research and national security applications.
In many ways, it amazing that a decade ago we celebrated crossing the terascale threshold. When I christened the NSF Distributed Terascale Facility (DTF) as the TeraGrid in 2002, Ian Foster asked me if we should worry about embedding a performance level in a facility name. I responded that I was confident the capabilities and the name would evolve, and they have.
Planning is now underway for exascale computing, though both the depth of the technical challenges and the straitened economics of research funding have slowed progress. For detailed background on the technical challenges, I heartily recommend reading the 2008-2009 DARPA exascale hardware and software studies, chaired by Peter Kogge and Vivek Sarkar, respectively. Although some of the details have changed in the interim, the key findings are still relevant. (In full disclosure, I was one of several co-authors of the exascale software study.)
Among a plethora of design challenges highlighted by these two reports, three are especially relevant when considered with respect to petascale systems:
- Substantially reduced memory per floating point operation (i.e., reduced memory per processor core due to energy constraints)
- Dramatically higher energy efficiency per floating point operation with minimal data movement, given the high time and energy cost of off-chip data accesses.
- Frequent component failures, given the sheer number of chips required to reach the exascale performance target
Just a Few Orders of Magnitude
All currently envisioned exascale systems would require parallelism at unprecedented scale, and barring new, energy efficient memory technologies; they would be memory starved relative to current systems, even under a 20 MW system design point; and multilevel fault tolerance would be required to achieve acceptable systemic mean time to failure (MTBF). Extraordinary parallelism, unprecedented data locality and adaptive resilience: these are daunting architecture, system software and application challenges for exascale computing.
If we have learned anything in sixty years of software and hardware development, it is that orders of magnitude matter, whether in latencies and access times, bandwidths and capacities, software scale and complexity, or level of parallelism. From file system metadata bottlenecks when opening thousands of files to application performance losses from operating system jitter due to daemon activity, every order of magnitude brings new challenges. Only the naïve or inexperienced believe one can scale any computer system design by factors of ten without exposing unexpected issues.
Knowns and Unknowns
What can we expect at exascale? As always, there are the known knowns, the known unknowns and the unknown unknowns, to use a Rumsfeld phrase. The knowns, of both kinds, include the ever-present issues of scale and locality. Will variants of current scheduling and resource management techniques be effective and usable by application developers? Will the complexity of multilevel memory management, heterogeneous multicore and dark silicon shrink the cadre of ultra-high performance application developers even further, perhaps below a technically and politically viable threshold?
The unknowns are more deep and subtle. How can energy optimization be elevated to parity with performance optimization, both statically and dynamically. The dynamic aspect is crucial, as the hysteresis of thermal dissipation in dark silicon affects chip lifetimes. Equally importantly, how can energy usage be related to code in ways that highlight optimization choices? This is the energy analog of performance measurement and guidance for optimized code, where measurements must be related to the original code in ways that are meaningful and amenable to change.
Finally, there are important open questions about the future of operating system structures themselves. The fundamental lesson of cloud computing – the nearest equivalent in scale – is the importance of weak consistency and loose coordination. Given projected exascale communication costs (energy and time), and frequent component failures, might federated rather than synchronized operation be preferable? Is it time to revisit some of our most cherished HPC assumptions and imagine operating system structures and programming models not based on Linux variants, MPI and OpenMP?
Evolution or Revolution
The exascale hardware and software challenges are real. Do we pursue incremental extensions of current practices or step back and explore more radical and fundamental options? Each has different advantages and disadvantages, which suggests we should probably pursue both, recognizing the costs. To be sustainable, an exascale research and development program must lead to cost effective and usable systems that are an integral part of the mainstream of semiconductor and software industries.
Recent Comments