Multicolored lights are one of the iconic and ubiquitous manifestations of the end of year holidays. They also conjure festive memories of my childhood, albeit with one major exception. That rather less than pleasant memory involved much consternation, veiled (and not so veiled) mutterings and ritual questioning of all things derived from Maxwell and Edison.
I speak, without hesitation or ambiguity, regarding the curse (sometimes literal) of serially wired holiday lights. My father and mother kept a healthy supply of spare bulbs for laborious and sequential replacement and testing whenever a single bulb failed, and a strand of thirty bulbs went dark. I still remember my excitement and delight when we first purchased strands with parallel circuit wiring. What a difference it made, when a bulb failure dimmed only one light, rather than an entire strand. As a child, I quickly gained a visceral appreciation for Kirchhoff's laws, long before I ever knew the name Gustav Kirchhoff.
If there is an Aesop-like moral in this tale from my childhood, it relates to systemic design for resilience rather than component resilience alone. Parallel circuit resilience trumps serial circuit resilience, and the extra cost is repaid in greater systemic reliability. Alas, I fear we have not learned this lesson in parallel computer system design and parallel programming models and applications.
The standard domain decomposition models commonly used to solve the partial differential equations that underlie so much of computational science and that are embodied in parallel MPI implementations of the solvers, implicitly presume that all of the parallel processes (tasks) and the supporting hardware operate without error or failure. Periodic checkpointing is our primary concession to failure and the vagaries of job scheduling.
This is analogous to trusting that the series circuit holiday lights will not fail until we turn them off each evening. It can be effective for small numbers of lights (processes) but increasingly problematic for larger systems. All this suggests we need to rethink our models of hardware and software resilience, and embrace redundancy as the necessary cost of systemic reliability, particularly in the trans-petascale and exascale regimes. It may also mean that we can never use all of the hardware for non-redundant computation. After all, would one rather fail quickly at exascale rates or compute reliably at sustained trans-petascale rates?
Changing the Game
Cloud services now operate on the largest computing systems we have ever built on this planet, with service reliability expectations far higher than what we demand from scientific applications. Thus, I also believe there are lessons from cloud computing that are potentially applicable to computational science applications. Arguably the most important is exploring how to exploit eventual consistency rather than sequential consistency.
To understand this potential shift in perspective, I heartily recommend Werner Vogels' analysis of the power of eventual consistency for large-scale web services at Amazon. Eric Brewer's thoughts on the CAP theorem, drawn from his Inktomi experiences, have also shaped theoretical and empirical assessments of large-scale system reliability. For those not familiar with the CAP theorem, it postulates that one can choose any two of Consistency, Availability or Partition tolerance. More generally, it offers a framework to reason about conflicting objectives.
Lest everyone shudder in numerical and scientific horror at this prospect, remember that parallel, numerical applications are themselves samples drawn from the space of "possible answers." They are not "the answer," and they may not even be an unbiased sample. (See On Getting the "Right" Answer.) Are these weak consistency approaches transferrable to scientific applications? I do not know. I am only sure that we need to find out – quickly. If successful, this would be a profound change, with deep implications for computational science algorithms, software and application reliability.