As I’ve gotten older (note the photo of a bald guy above), I’ve come to realize why my grandfather and then my father often commented on some event by saying, “That reminds me of when …” As a boy, I rolled my eyes at these things, not seeing their relevance. Now, I recognize that age and experience do bring some ability to recognize current events as somewhat similar to previous ones. What’s it called? Ah, yes, they call it wisdom, I believe. Such is the case with multicore designs and our lamentable lack of parallel software and tools.
We have been predicting the end of “free” CMOS performance increases via Moore's law for over twenty years. Of course, Moore’s law is not a law, it is a technology trend enabled by a semiconductor roadmap, investment and hard work by many people. More to the point, it really isn’t about doubling clock speed but about rising transistor density. There are physical limits on transistor density, but we are not there yet -- look at the SIA roadmap to see the future.
We are however, facing a power crisis. At high clock speeds, chip power rises as the third power of clock frequency (i.e., doubling the frequency increases the power consumption by a factor of eight). The desire for long battery life and small form factors and the insatiable demand for more computing power, has led to the current crisis. We’ve largely exhausted the architectural techniques for extracting parallelism from a sequential instruction stream – pipelining, scoreboarding, superscalar issue, vectorization, out of order completion – and we cannot increase clock frequency further.
Multicore (multiple processors per chip) is our collective engineering response. By operating chips at lower clock frequencies and executing multiple instruction streams concurrently, one can deliver higher performance at lower power. Today, we have dual and quad-core chips, and Intel's Justin Rattner has demonstrated a teraflop (peak) 80 core chip. We can expect chips with hundreds (if not thousands) of heterogeneous cores (processing, graphics and signal processing) within a few years.
However, there is no free lunch. Multicore designs are forcing us to face an ugly truth, namely that we cannot continue to hide parallelism from the software designer. A compiler or runtime system will not transform sequential software, be it Windows or Linux operating system code or research or commercial application code to execute efficiently with hundred-way parallelism. This ugly truth must be faced squarely and directly.
Those of us in high-end computing have been peering over the edge of this precipice for many years. Inexpensive clusters are widely deployed in research laboratories and are programmed with low-level message passing libraries such as MPI. Truly high-end systems now contain tens of thousands and soon hundreds of thousands of processors that are also programmed using message passing. This is a laborious, painful process that has limited the uptake of parallel computing in industry (see the Council on Competitiveness' HPC study) and exacerbated U.S. response to national security challenges.
In short, high-end computing has seen the future, and it is neither pretty nor attractive for a consumer-dominated future. We need new, higher level approaches that recognize our extant software base and the rapidly rising cost of software development. Global address space (GAS) languages such as Co-array FORTRAN and UPC are promising approaches, but much more work is needed.
How did this happen and how does it relate to my grandfather? We began to address the parallel software problem a decade ago, with research projects in automatic parallelization and data parallel languages, driven by high-end computing. These approaches offered options to express large-scale parallelism while hiding many of the low-level details of message passing. They were immature and incomplete, but promising. (N.B. I realize that automatic parallelization research work goes back forty years. By a decade ago, I'm alluding to the work on data parallel languages for high latency communication networks -- the work on HPF, for example. My point is that we have had little success automatically parallelizing large codes and funding for more expressive alternatives and the underlying technology largely disappeared.)
However, we abandoned these research directions when they did not quickly yield commercial quality solutions. We forgot that it took over a decade to develop effective programming idioms and vectorizing compilers, a much simpler and restricted special case of parallel computing. Simply put, we might now be confidently exploiting data parallelism, even for irregular problems, on today’s consumer multicore designs if we had stayed the high-end research course.
My grandfather never heard of Santayana, but he would recognize the wisdom of the dictum, “Those who cannot remember the past are condemned to repeat it.” We need a coordinated, long-term research and development strategy and roadmap for computing (as PCAST has recommended) that spans the entire spectrum: advanced architectures, large-scale data management, software reliability and programmability, usability and human factors, sensors and networks.