The Sapir-Whorf Hypothesis holds that language and culture shape behavior and thought. The idea has a long, rich and sometimes controversial history, though recent cognitive linguistic research, together with brain imaging, has lent it new credence. Lately, I have been reflecting on the nature of linguistic relativity in scientific research, particularly for HPC and cloud computing. In particular, how has our focus on floating point operations per second (FLOPS) shaped discourse, science politics and outcomes?
Look, It's Terascale!
If I were to remark, "That is a terascale system," you would likely interpret my comment as an assessment of the system's computing performance, more specifically its floating point capabilities. For the cognoscenti, the sentence likely engenders visions of multicore systems with out of order instruction issue and deep pipelining, small clusters with fast interconnects and (perhaps) some accelerators such as GPUs. You might even reflect on the historical evolution of high-performance computing, when a terascale system was ranked first on the Top500 list about a decade ago.
Given my remark, you would be unlikely to eye your desktop computer with its terascale (terabyte) commodity disk drive. Yet, based on its storage capacity, that desktop is just as surely a terascale platform as is the compute cluster. Ah, you say, I have mixed capacity (bytes) with capability (operations). Terabyte per second storage systems are rare, you say. Absolutely true, yet as this bit of third rate linguistic legerdemain illustrates, language and, perhaps more importantly, connotation really does trump denotation in our discourse.
Whether Petascale and Exascale?
Today, the world's fastest computing platforms have broken the one petaflop mark on the high-performance Linpack (HPL) benchmark, and planning groups across the United States, Europe and Asia are debating political and technical approaches for the construction of exascale computing systems. I cannot help but wonder, though, why we in technical computing talk so little about petascale and exascale data platforms? Why is there no international initiative to build a low latency, high bandwidth (many terabytes/second to even a petabyte/second) data analysis engine with multiple exabytes of storage capacity?
I suspect the reason lies in our background and linguistic heritage. Most of us in high-performance computing came from mathematical backgrounds, where success was defined by a proof and an equation and their embodiment in a computational model, rather than by insight gleaned from large-scale data analysis. I believe it is time we found a lingua franca that bridges FLOPS and bytes and reclaim the full connotation of exascale.
I Want My Exabytes
It is worth remembering that the explosive growth of observational data is itself largely a product of inexpensive CMOS sensors, based on the same semiconductor technologies that begot microprocessors. The examples of high resolution instruments are legion, from the Large Hadron Collider (LHC) and its international hierarchy of data archives to the proposed Square Kilometer Array (SKA) radio telescope, which may produce as much as an exabyte of data every few days. This data tsunami is not limited to the physical sciences; the biological and social sciences are being inundated as well.
One of the major lessons from web search and cloud data centers is the power of truly massive scale, near real-time data analysis. When anyone with a cheap cell phone and a web browser can extract data and insights from a non-trivial fraction of the human knowledge base, behavior and culture are transformed. I would like to believe that we can bring the same data-driven analytics to scientific research as are routinely exploited to find a good lasagna recipe. We need a balanced exascale initiative, in every sense of the word, lest we be a technical confirmation of the Sapir-Whorf hypothesis.