N.B. I also write for the Communications of the ACM (CACM). The following essay recently appeared on the CACM blog.
Computing research and advanced computing infrastructure – each is dependent on the other in a myriad of subtle and complex ways, yet each is profoundly different in culture, process, skills and metrics. As a computer scientist who has often conducted research on, toward and with large-scale infrastructure, both for computational science and for big data analytics in clouds, and as someone who has overseen large-scale, production infrastructure, both as director of a national facility and as a university CIO, I have seen this creative tension and intellectual dichotomy firsthand.
Convolving the two inappropriately can be frustrating at best and disastrous at worst. Conversely, when each of the constituent communities acts appropriately, there are enormous benefits to both. As we think about next-generation capabilities and the balance of investment in basic research and deployed infrastructure for data analytics and computational science, it is salutatory to consider some insights from our experiences, reflected, sometimes unthinkingly, in the language we use. The prepositions, such simple words, connecting the two, make all the difference. About, on, before, beyond, for, toward, with: they are all different, and we confuse them at our peril.
Metrics and Culture {before and beyond}
First and perhaps most obviously, the metrics of success in research and infrastructure differ markedly, though both operate in a highly competitive world. The research culture rewards novelty and new ideas, intellectual risk taking, rapid prototyping and experimentation, and early publication. At its best, research pursues basic questions and long-term outcomes. Students and post-doctoral associates learn as they are doing, and the resulting prototypes are often buggy and usable only by the researchers themselves. Often, though not always, the experimental scale is small, limited by budgets and staff, with localized consequences for failure and little expectation that others will depend on the outcomes.
Conversely, the production infrastructure culture, whether for big data analytics or computational science, rewards operational stability; quality of service, both technical and human; and long-term compatibility, continuity and interoperability. It also prizes consultation with others, operational experience, planning, process, testing and documentation. Operationally, the production culture is inherently conservative, for failure at scale can affect thousands to millions of users. Thus, change for change's sake is rare. Rather, adoption and deployment are driven by shifting costs and technological capabilities, user expectations and needs and by competitor behavior. These competitors can be in the business world or in the world of international scientific instrumentation and facilities.
Collaborations {on and with}
Computing researchers rarely build and successfully operate production infrastructure; they lack the skills and production mindset. Likewise, infrastructure operators rarely ask the fundamental questions; they rightly focus on operational priorities and their demanding user community. Yet each can benefit from the other. Lest these characterizations seem unnecessarily pedantic or abstract, consider again the prepositions and their implications.
Conducting research on or with production infrastructure requires collaboration and deep partnership with the infrastructure operators. Moreover, under no circumstances can it compromise infrastructure security, reliability or quality of the service. Thus, there must be a clear value proposition for the infrastructure operator and its user community, other than just the goodwill of the computing research community. The payoff might be optimized production infrastructure based on experimental insights, data to shape next-generation vendor capabilities, or better algorithms for data analysis, but it must be weighed against the risks.
As a personal example, my research group spent years exploring performance measurement and analysis techniques for large-scale HPC systems, using systems at the NSF and DOE supercomputing centers for testing and measurement. Because these centers supported a national base of computational scientists, it was critical that my team's prototype tools not adversely affect either the HPC system or its users. Because these tools typically captured data at many levels, from hardware, system software, runtime libraries and applications, this was no small task. It required careful planning, testing and scheduling. The rewards, however, were substantial – data and insights at a scale simply unattainable in the laboratory and influence on next generation capabilities.
When I became director of the National Center for Supercomputing Applications (NCSA), one of those NSF-funded national facilities, I found myself on the other side of the equation, responsible for deploying and operating a portion of the national research infrastructure. This sharpened my awareness of fiduciary responsibility and the risk asymmetry. Make no mistake; in research collaborations, the infrastructure operator bears the majority of the risk. These span a wide range, but include compromised functionality from faulty experiments and security or privacy breaches due to data analysis.
There is a reason firms with web search logs and other social network data rarely release even sanitized and anonymized data sets for academic research. Even inside a company, production infrastructure and data are accessible to only a few.
Ideas and Implementation {for and toward)
What are the best ways to transfer innovative ideas and form mutually beneficial partnerships? It is neither easy nor simple. (See Technology Transfer: A Contact Sport for a few thoughts on this topic.) Crossing the "valley of death" from idea to realization requires understanding the entire set of optimization constraints and considerable patience. Indeed, repeated U.S. National Academies studies have shown that the time lag to translate basic research ideas in computing into a billion dollar industry is over a decade. (See the famous "tire tracks" diagram.) From supply chains and availability through competitive pressures and time to market to ecosystem economics and labor force, these constraints may seem prosaic or mundane to the research community, but they are very real.
As a researcher, this often leads to a difficult lesson, that of idea valuation. Simply put, an intellectually interesting idea is not necessarily a helpful idea or even an economically valuable one. This is an often-painful realization. I have seen many cases where a new and improved algorithm or hardware design was rejected because it was not backward compatible with existing infrastructure; it solved a problem but not the most pressing one; it was too costly to implement; it would have required retraining and a cultural shift; or it was simply the right idea at the wrong time.
As the scale, scope and influence of computing grow ever larger, the stakes are high, for both computing researchers and those who build and deploy large-scale computing infrastructure. Each needs to understand the constraints and needs of the other to find common ground.
Lessons for HPC and Big Data
What does all of this mean for today's world of HPC and big data? First, we must continue to support the basic research needed to advance the state of the art. There are deep and unresolved research questions regarding resilience and reliability, energy efficiency and performance, programmability and expressivity, privacy and security, among many others, which span algorithms, software, hardware and architecture.
Second, advanced HPC and big data infrastructure enable discovery across all scientific and engineering domains. There are important problems in biology and medicine, physics and materials sciences, weather prediction and climate modeling, and business and consumer services that depend on continued advances in our large-scale infrastructure. It is crucial that we continue to fund and deploy such capabilities at the largest scales possible.
Third and critically, we should not confuse the two. The two communities, computing research and research infrastructure for science and engineering, can and should learn from the other, but they are different, with divergent constituencies and cultures; disjoint timescales and needs; and complementary outcomes and rewards. One cannot be sacrificed for the other.
Remember the prepositions. They really matter.
Comments
You can follow this conversation by subscribing to the comment feed for this post.