A substantial fraction of my professional life has been devoted to analyzing and optimizing the dynamic behavior of some of the world's largest high-performance computing (HPC) systems. During the past thirty years, spanning my time at Illinois, North Carolina, Microsoft and now Iowa, my students, post-doctoral associates, research staff and I have used and developed many performance, energy consumption, and reliability instrumentation and measurement tools and techniques, embodied them in scalable data capture and reduction software, and created correlation tools that connect measurements to potential remediation opportunities. Multiple instantiations of these tools have been released as part of the Pablo performance analysis toolkit and Autopilot toolkit for adaptive performance control.
Just a Couple of Weeks
Most of this work originated in a challenge from my Ph.D. thesis supervisor, Herb Schwetman, issued while we were both at Purdue. After listening to me expound excitedly about the virtues of yet another interconnection network for parallel systems, he challenged me to prove that the object of my excitement would deliver higher performance than other alternatives. I dutifully promised him that I would spend a few weeks on the performance analysis problem, then return to the core parallel systems design question.
What ensued changed my thesis direction dramatically, as the two-week project took three years to complete. Over thirty years later, I find myself still working to answer generalizations my advisor's basic question: "How fast is system A compared to system B and what are the bottlenecks in each?" These are beguilingly simply questions, but they cut to the heart of system design and optimization. They are also especially pertinent as we consider the design of putative exascale systems with billion-way parallelism, and millions of components that must operate reliability, all while consuming tens of megawatts of power.
It Seems Simple, But It Is Not
Beneath all this complexity is a simple, yet subtle law that that relates performance (throughput), delay (response time) and number of interacting units (customers). It has been my friend and nemesis throughout my research career.
Generally known as Little's Law, it relates throughput (X), response time (R) and customers (n) in a stable system in a linear way: n=XR. This general relationship holds for computers, airline reservation systems, store checkout lines, or any other system where entities (customers or computer jobs) compete for service from a resource-limited system.
Little's Law also says there is no free lunch. Optimizing for high system utilization leads to large wait times for customers, as anyone who has waited on a busy telephone reservation line knows all too well. Conversely, ensuring short waiting lines requires system utilization to be low. Simply put, one must choose between high system efficiency and short customer wait times. One cannot (in general) have both.
As simple as this may seem, in reality, the best optimization strategy depends on many other constraints and the broader environment. For many systems, the relative costs of the components providing and receiving service determine the appropriate optimization axis. When humans are involved, their psychology and expectations must be managed appropriately. That is why many telephone response lines provide waiting time estimates and periodic feedback during the wait. Those of us who have managed batch-scheduled HPC systems also know that users and system operators constantly balance system utilization against the product of queue wait time and parallelism.
Coming Full Circle
Exascale computing brings many of these issues into focus in new ways. When a multiple megawatt energy spike is imposed on the electrical grid, utility operators notice. Thus, exascale system energy requirements may constrain when parallel job execute, as jobs with high floating-point computation intensity draw more energy those dominated by pointer chasing. Likewise, component reliability may limit maximum parallelism within exascale jobs; the mean-time-to-failure (MTBF) of processors, memory and storage systems, and interconnect elements really matters. Although more job parallelism normally reduces execution time, component reliability bounds effective parallelism and maximum computation time without either resilience or recovery mechanisms.
Performance, delay and parallelism at large scale: Little's Law speaks to all of these issues and more. When performance optimization, reliability requirements, and energy management are convolved with component costs, device physics, system software services and application characteristics, the constraint-based optimization problems become dauntingly complex. Their solution is a bit more than a two-week project, especially at exascale.