It seems as if it were just yesterday when I was at NCSA and we deployed a one teraflop Linux cluster as a national resource. We were as excited as proud parents by the configuration: 512 dual processor nodes (1 GHz Intel Pentium III processors), a Myrinet interconnect and (gasp) a stunning 5 terabytes of RAID storage. It achieved a then astonishing 594 gigaflops on the High-Performance Linpack (HPL) benchmark, and it was ranked 41st on the Top500 list.
The world has changed since then. We hit the microprocessor power (and clock rate) wall, birthing the multicore era; vector processing returned incognito, renamed as graphical processing units (GPUs) ; terabyte disks are available for a pittance at your favorite consumer electronics store; and the top-ranked system on the Top500 list broke the petaflop barrier last year, built from a combination of multicore processors and gaming engines. The last is interesting for several reasons, both sociological and technological.
On the sociological front, I remember participating in the first petascale workshop at Caltech in the 1990s. Seymour Cray, Burton Smith and others were debating future petascale hardware and architectures, a second group debated device technologies, a third discussed application futures, and a final group of us were down the hall debating future software architectures. (I distinctly remember talking to Seymour about his "parity is for farmers" comment regarding memory ECC.) All this was prelude to an extended series of architecture, system software, programming models, algorithms and applications workshops that spanned several years and multiple retreats.
By the way, you can read the original report here; it is fascinating to look back. Paul Messina, Thomas Sterling and others deserve our thanks for launching the seminal activity.
At the time, most of us were convinced that achieving petascale performance within a decade would require some new architectural approaches and custom designs, along with radically new system software and programming tools. We were wrong, or at least so it superficially seems. We broke the petascale barrier in 2008 using commodity x86 microprocessors and GPUs, Infiniband interconnects, minimally modified Linux and the same message-based programming model we have been using for the past twenty years.
However, as peak system performance has risen, the number of users has declined. Programming massively parallel systems is not easy, and even terascale computing is not routine. Horst Simon explained this with an interesting analogy, which I have taken the liberty of elaborating slightly. The ascent of Mt. Everest by Edmund Hillary and Tenzing Norgay in 1953 was heroic. Today, amateurs still die each year attempting to replicate the feat. We may have scaled Mt. Petascale, but we are far from making it pleasant or even routine weekend hike.
This raises the real question, were we wrong in believing different hardware and software approaches were needed to make petascale computing a reality? I think we were absolutely right that new approaches were needed. However, our recommendations for a new research and development agenda were not realized. At least in part, I believe this is because we have been loathe to mount the integrated research and development needed to change our current hardware/software ecosystem and procurement models.
I recently participated in the International Exascale Software Project Workshop (IESP), the first in a series of meetings designed to explore organizational and technical approaches to exascale system design and construction. The workshop built on several earlier meetings and studies, including the DARPA exascale hardware study and the forthcoming exascale software study (in which I participated), as well as the DOE exascale applications study. Complementary analyses are underway in the European Union and in Asia.
Evolution or revolution, it's the persistent question. Can we build reliable exascale systems from extrapolations of current technology or will new approaches be required? There is no definitive answer, as almost any approach might be made to work at some level with enough heroic effort. The bigger question is what design would enable the most breakthrough scientific research in a reliable and cost effective way?
My personal opinion is that we need to rethink some of our dearly held beliefs and take a different approach. The degree of parallelism required at exascale, even with future manycore designs, will challenge even our most heroic application developers, and the number of components will raise new reliability and resilience challenges. Then there are interesting questions about manycore memory bandwidth, achievable system bisection bandwidth and I/O capability and capacity. There are just a few programmability issues as well!
I believe it is time for us to move from our deus ex machina model of explicitly managed resources to a fully distributed, asynchronous model that embraces component failure as a standard occurrence. To draw a biological analogy, we must reason about systemic, organism health and behavior rather than cellular signaling and death, and not allow cell death (component failure) to trigger organism death (system failure). Such a shift in world view has profound implications for how we structure the future of international high-performance computing research, academic-government-industrial collaborations and system procurements.