This is a personal blog updated regularly by Professor Daniel Reed at the University of Utah.
These musing on the current and future state of technology, scientific research, innovation, and personal history, are my own and don’t necessarily represent the University of Utah's positions, strategies, or opinions.
SC17, the annual gathering of the supercomputing community, was held in mid-November, the week before U.S. Thanksgiving. As I mentioned in a previous post (Come to the Supercomputing Big Data Edge), we presented the results of the Big Data and Extreme-scale Computing (BDEC) workshops on the technical and social convergence of big data, machine learning, high-performance computing, and computing hardware and software. You can find the report here. At SC17, we had a lively discussion about edge computing, global competitiveness, and the nature of scientific revolutions.
HPC Data Center Energy Efficiency
In addition to the BDEC discussion, I also chaired an SC17 panel on the energy efficiency of HPC data centers. The panel was organized with the help of the Energy Efficient High Performance Computing Working Group (EEHPCWG). The panel participants were Sadaf Alam (CSCS/ETH), Bill Gropp (NCSA/Illinois), Satoshi Matsuoka (GSIC/Tokyo Tech), and John Shalf (NERSC/LBL). You can find the slides and a video of the panel here, courtesy of InsideHPC.
Autonomous Vehicles
Finally, I was recently a guest on Iowa Public Radio's lunchtime conversation, River to River. We had a lively and thoughtful conversation about the potential social consequences of autonomous vehicles. You can listen to that audio here.
SC17, the annual gathering of the supercomputing community is nigh. As always, there will be talk of next-generation technology – hardware, software, algorithms – and new applications. All of this will take place against the backdrop of the race to exascale computing – bigger and bigger, faster and faster. We will all geek out, and we will love it.
While we discuss operations/joule, quantum and neuromorphic computing, and new and exotic memory technologies to enable ever faster systems, there is another computing revolution underway, at the other extreme. As Richard Feynman famously said about nanoscale devices, "There's plenty of room at the bottom." Folks, I'm talking edge devices, the world of streaming data sensors that is changing the nature of computing and increasingly, of science as well. This is the subject of the forthcoming Big Data and Extreme-scale Computing (BDEC) report.
Computing History 101
The history of computing is one of punctuated equilibrium, with each pendulum swing, from centralized to distributed, bringing new and unexpected changes. It has also been a story of disruption from below, as successive generations brought order of magnitude shifts in price-performance ratios and sales volumes. Mainframes beget minicomputers, which begat workstations, which begat personal computers, which begat smartphones. The same has been true in scientific computing. Monolithic supercomputers (think the original Cray series) gave way to symmetric multiprocessors (SMPs) and distributed memory supercomputers. These gave way to clusters, now augmented with accelerators. Mind you, I still miss the lights on the Thinking Machines CM-5.
In both technical and "mainstream" computing, our largest systems are of unprecedented scale, whether globe spanning networks of massive cloud data centers or petascale and soon-to-be exascale HPC systems. As the Internet of Things sweeps across the computing landscape like a tsunami, one should ask about its analog in technical computing – the burgeoning networks of sensors that now capture science, engineering, environment, health and safety, and social data. (See Mobility, Transduction, and Garages)
The Sensor Edge
Today's smartphone-cloud ecosystem places most of the intelligence and data in the cloud. Indeed, all of the intelligent assistants – Siri, Cortina, and others – run in the cloud, dependent on precious wireless network bandwidth and relying on both behavioral history and deep learning algorithms. Limited by smartphone processing speeds, memory capacity, power budgets, and battery lifetime, and human expectations and physiology, there are good economic and technical reasons for this design.
There are also disadvantages, particularly in science. First, transmitting raw, time sensitive data to central sites places major demands on national and international networks. When the Large Hadron Collider (LHC) was under construction, this bandwidth and data processing demand was a major concern, triggering an international collaboration to build a distributed data storage hierarchy; those concerns remain as luminosity upgrades to the HLC are being planned.
The planned data pipeline for the Large Synoptic Survey Telescope (LSST) will send the raw data from the telescope in Chile to a dedicated supercomputer at NCSA, with a sixty second (maximum) delay before global alerts of transient phenomena must be distributed. Although this centralized approach can and will work when the daily data volume is measured in terabytes, the data deluge of the Square Kilometer Array (SKA) will require in situ reduction.
My friends in Chicago's Array of Things project grappled with another aspect of this local/remote tradeoff when designing urban sensors to capture and analyze human interactions. Communities were happy to support social dynamics analysis, but they did not want Big Brother video to leave the sensors. The privacy and social constraint requires low power, high-performance machine learning solutions on the sensors themselves.
Scale Challenges
Centralization also brings operational and security risks. My former Microsoft colleague, James Hamilton (now at Amazon), when asked why cloud data centers are not even bigger, said, "… the cost savings from scaling a single facility without bound are logarithmic, whereas the negative impact of blast radius is linear." We have all experienced the blast radius effect, at least metaphorically; when a cloud data center goes down, it affects both social networks and business, and it is international news. There is, perhaps, a salutatory lesson here for the ultimate, maximum size of supercomputers, but that is a discussion for another day.
In addition, the stateless design of the venerable TCP/IP stack includes no notion of data geo-location or intermediate data processing. Everyone who has debugged "last mile" network performance problems knows this all too well. To remediate this, research groups and companies have designed a variety of domain-specific network overlays, data caches, publisher-subscriber systems, and content delivery networks (CDNs).
Dramatic declines in computing costs mean it is now possible to deploy truly large numbers of inexpensive sensors. How dramatic? I have posed the following question on my campus. What would you do with 10,000, disposable ~$10 sensors? I was met with surprise at the scale of my question. For many of us, even those who came of age in the smartphone era, this scaling is a difficult question to answer because it defies our worldview of the possible. It is not a science fiction fantasy; it is a present reality.
A traditional hub and spoke model of data storage and sensor data analysis is increasingly neither scalable nor reliable. Amazon's Greengrass architecture is an early example of a generic, distributed approach, while NVidia's Metropolis system targets video in smart cities. There is an even more general model, where generic services support distributed data caching and processing throughout the network. Micah Beck (Tennessee) has made a thoughtful case for the design of such an approach, which he calls an Exposed Buffer Processing Architecture. Similarly, the Array of Things leaders have argued that edge processing is essential when
The data volume overwhelms the network capability
Privacy and security encourage short-lived data
Resilient infrastructure requires local actuation/response
Latency constraints demand the response times of dedicated resources
Simply put, these requirements imply a much more complex, distributed data storage and processing architecture for data-driven scientific computing than we have developed to date.
Supercomputing Edge Manifesto
This brave new world of streaming data has profound implications for the future of supercomputing. Here are just a few:
Streaming data is continuous and requires a reconceptualization of workflow and scheduling systems. Periodic, batch processing is necessary but not sufficient.
Sensor scale and complexity span many orders of magnitude, from inexpensive, disposable devices to nation-scale instruments. New orders of magnitude always expose new bottlenecks and necessitate architectural change to identify and decompose domain-specific and domain-independent elements.
Systemic resilience takes on new urgency, extending beyond central HPC to include networks and distributed devices.
Machine learning, modeling, and data assimilation apply concurrently at many sites and multiple levels, subject to energy and packaging constraints at the edge.
We in HPC are not the only players in this space. Much as we learned and adapted much from the open source Linux community, we must learn from and embrace distributed data analytics from the sensor community.
As the name implies, the Big Data and Extreme-scale Computing (BDEC) project has been examining this nexus and the changing nature of scientific discovery enabled by the fusion of big data – from sensors large and small – machine learning, and computational modeling. It is a rapidly shifting environment, with new hardware and software ecosystems evolving rapidly.
Much of it remains largely unknown to the traditional HPC community, just as software containerization and machine learning were just a few years ago. That will change very soon. If you are attending SC17, come to the Birds of a Feather (BOF) discussion, late Wednesday, November 15.
A Literary Coda
In his book, Hell's Angels, the inimitable gonzo journalist Hunter S. Thompson once wrote about living on the edge – riding flat out – in the dark – on a motorcycle:
The Edge…There is no honest way to explain it because the only people who really know where it is are ones who have gone over. The others–the living–are those who pushed their control as far as they felt they could handle it, and then pulled back, or slowed down, or did whatever they had to when it came time to choose between Now and Later.
In supercomputing, we have always felt the need for speed, exploiting every new technology in the endless search for higher performance and greater capacity. It's time for the supercomputing community to come to the edge, where sensors and streaming data meet HPC, computational models, and machine learning.
There's lots of room at the edge for innovation and discovery. Let's not crash in the dark.
When Larry Smarr, the founding director of the National Center for Supercomputing Applications (NCSA), quietly confided to me in the fall of 1999 that he was moving the UCSD, I was Head of the Department of Computer Science during the midst of the first dot.com boom, itself enabled by NCSA Mosaic and the first web revolution. Netscape and its Illinois alumni were remaking Silicon Valley; Max Levchin had just founded PayPal while a student in the department; alumnus Tom Siebel was flying high with new CRM technology at Siebel Systems; and we had just celebrated HAL's birthday (from 2001: A Space Odyssey) with a massive Cyberfest event that attracted worldwide attention. ("I am a HAL 9000 computer, production number 3; I became operational at the HAL plant in Urbana Illinois on January 12, 1997.")
Students were beating down the doors, begging to get into computer science; we were hiring faculty at a frantic pace, unable to keep up with demand; venture capitalists were calling, and virtual reality (VR) was all the rage. In many ways, it was much like today's deep learning and VR boom, powered by cloud services and big data. As Yogi Berra might say, it's like déjà vu all over again, as we consistently underestimate the power of computing to reshape our society.
When Larry and I talked about the future and the NCSA directorship in the fall of 1999, I already had had a long association with NCSA. I had worked on joint research projects, and I had done some of the early visualization of web traffic when NCSA's web server had been the world's busiest. I was also the leader of the data management team – what we would now call big data – for the National Computational Science Alliance (the Alliance), which was anchored by NCSA as part of the NSF Partnerships for Advanced Computational Infrastructure (PACI) program. Based on that conversation with Larry, it was clear that both NCSA and my life were about to change in some profound ways. However, despite the universal sense that computing was a revolutionary force, none of us could have predicted just how profound those changes would be. I am very proud of the NCSA team and humbled by what they accomplished.
In the space of four years, we broke ground on three new buildings, deployed the first Linux clusters in national allocation by NSF, negotiated creation of a 40 gigabit per second transcontinental optical network for research, developed an early GPU cluster based on Sony PlayStation 2's, launched the NSF TeraGrid, and committed to support the Large Synoptic Survey Telescope (LSST). In between all that, I joined the President's Information Technology Advisory Committee (PITAC), there were national meetings (the High-end Computing Revitalization Task Force (HECRTF)) and Congressional hearings about the future of supercomputing, and plans for petascale systems set stage for what would become Blue Waters. It was, in a very real way, an inflection point for the nature of computational science and scientific computing that continues today.
At the end of the 1990s, custom-built, parallel systems (MPPs) such as the Thinking Machines CM5 and SGI Origin 2000 dominated high-performance computing, having displaced vector machines such as the Cray Y-MP. As commodity PCs increased in performance, NCSA and the Alliance began experimenting with home-built Linux clusters, and in 2001, we deployed the first two of these – a one teraflop IA-32 cluster and one teraflop IA-64 Itanium cluster.
As strange as it may seem now, with Linux clusters the unquestioned standard for scientific computing, this was then viewed as a radical and highly risky decision. People often asked me, "Who do you call for support when something fails?" and "Who is responsible for the software?" The answers, of course, were that we were galvanizing a community and building a new support model, as we reinvented the very nature of scientific computing. NCSA led the way, as it has throughout its history.
What did not happen is not as well known. We had an opportunity to launch a big data initiative, in partnership with Microsoft. In 2000, secured a contingent $40M gift from Microsoft for a Windows-based data analytics system that would complement the Linux computational infrastructure, bringing database technology to high-performance computing. Alas, we did not receive the necessary federal funding. I will always wonder if we might have kickstarted the big data revolution a decade earlier. The irony is not lost on me that a few years later I would find myself leading the eXtreme Computing Group at Microsoft to design next-generation cloud computing infrastructure in support of deep learning and cloud services, using lessons and ideas from high-performance computing.
Despite the data analytics setback, the success of two Linux clusters provided the evidence needed for us to propose and deploy the NSF TeraGrid, connecting NCSA, SDSC, Argonne and Caltech in the world's largest open computing infrastructure for scientific discovery. Anchored by NCSA and SDSC, and in partnership with Intel and IBM, the TeraGrid brought distributed Grid services into national production and the subsequent expansion of the Extended Terascale Facility (ETF) created what became the NSF XSEDE program.
The TeraGrid also bought a dramatic increase in national bandwidth, as Charlie Catlett and I negotiated with Qwest to create a 40 gigabit per second wide area network that connected the Illinois and California TeraGrid sites. Our goal was to catalyze a new way of thinking about big data pipes. Ironically, Charlie and I almost nixed this deal, as we had been seeking an even faster 160 gigabits per second connection, only to settle for 40. To put this in perspective, at the time, the Internet2 transcontinental backbone operated at only 2.5 gigabits per second.
While system building, we also had great fun, made possible by the truly wonderful people that were and are the NCSA faculty, staff and students. We built (display) walls and played with smart badges for the SC02 conference. Donna Cox and her team created incredible video of severe storms and galactic evolution that illuminated the beauty of scientific discovery. Rob Pennington and his staff bought Playstation2 systems on eBay and built a GPU-accelerated Linux cluster that caught the attention of the NY Times and helped usher in today's GPU clusters.
In the midst of all this, we ran out of space – offices and computing facilities. The University of Illinois committed to an expansion of the Advanced Computation Building (ACB) that allowed us to deploy the terascale Linux clusters and the TeraGrid. ACB had originally been built to house ILLIAC IV and served as the machine room for NCSA, but the new clusters required more power, more cooling and more space. We designed the ACB expansion with plans for yet another expansion, one rendered unnecessary by the construction of the National Petascale Computing Facility.
I must give enormous credit to the university leadership team, because they built the ACB expansion before we had secured the NSF funding for new machines. I made a handshake, $100M deal with then Provost Richard Herman that if he built the ACB expansion, that NCSA would fill it with hardware. Both of us took a huge political risk, but we honored that commitment. It was just one event in a long history that reflects the repeated willingness of the University of Illinois to invest in the future.
Collectively, the team also secured funding for the NCSA office building and the new Siebel Center for Computer Science, creating an informatics quadrangle that continues to support student, faculty and staff discovery. During this time, I distinctly remember talking to Tom Siebel about the old Digital Computer Laboratory (DCL), which he described as looking like a prison. I remarked that he could help fix that, which began a long series of campus conversations that culminated in his remarkable $32M gift. Likewise, I was pleased that the NCSA building finally came to fruition, after many years of delays. It is not well known, but the two buildings were designed to allow them to connected via an extension to the NCSA building and an arch extending from the northwest corner of Siebel Center.
It was an exhilarating and exciting time. None of it would have happened without an incredible team. We built on Larry Smarr's groundbreaking vision for a supercomputing center. We created machines, software and tools; we engaged researchers, scholars and artists; we invented the future. That is – and always will be – NCSA's mission. Congratulations on 30 great years!
What's a cloud? The word means many things to many people. In true Airplane! movie parody style, one is tempted to say, "It's a visible mass of condensed water vapor floating in the atmosphere. But, that's not important right now."
Some might say that our cloud confusion is indicative of a fog, a low hanging cloud. Indeed, in public perception, the cloud is a vaguely understood and amorphous capability that lurks behind our smartphones, enabling electronic communications, web searches, social networks, e-commerce, and 24x7 news and streaming media. It is a mysterious "something, somewhere" that is only noticed when it is inaccessible.
For those of us in computing, the cloud is more than a buzzword or meme, it is a set of technologies that are transforming the provisioning and delivery of computing cycles, storage and services, with profound implications for government services, research discovery and innovation, and businesses large and small. Given the public confusion and sometimes overhyped marketing, it seems worthwhile to dispel the fog and illuminate a few compelling features of clouds.
In a very real sense, public clouds – those hosted by cloud providers for on-demand access – have much in common with the timesharing systems of the 1960s and 1970s, albeit at unprecedented scale. They shift the capital costs of provisioning, along with the expertise required for operation, from users to providers, allowing users to focus on their core expertise (e.g., research, business, government) and pay for only what computing they need, when they need it.
In turn, cloud elasticity allows servers to be provisioned and de-provisioned dynamically in response to changing interest and demand. One of the few things worse than having your brilliant idea ignored is having it fail publicly from too much attention. Many an organization has first been thrilled to see their product or service be reported in the press and go viral, only to be horrified when on-premise servers collapse under the load of success; this is known as the slashdot effect, named after one of the popular "news for nerds" web sites. Without cloud elasticity, the global phenomenon of Pokémon Go would not be possible.
Although not required, many clouds also virtualize the underlying computing, allowing multiple services and software stacks to be co-resident on the same physical hardware, further reducing costs. With Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service (SaaS) and containerization, developers can choose the level of abstraction, from low level to high level, needed to support existing and new applications.
By allowing innovators to focus on their ideas rather than computing infrastructure, the pay-as-you-go cloud model has made it easier, faster and cheaper to bring ideas to life, unleashing a second dot.com boom of cloud-based services, unprecedented citizen access to government data, and new approaches to scientific discovery via big data.
The latter has sometimes been called the fourth paradigm, where new sensors, data analytics and deep learning have shifted the research approach from "what experiment should I conduct" to "what insights can I extract from available data." It has also stimulated interest in convergence architectures for high-performance computing (HPC) and data analytics that combine the best elements of cloud and HPC capabilities. (See my essay, Exascale Computing and Big Data: Time to Reunite.)
Implicit in this client-cloud model is ubiquitous, reliable and inexpensive wired and wireless broadband services. Without broadband access, clouds are simply isolated sets of computers and data, unconnected to the world of mobile devices and the nascent Internet of Things. To further ensure reliability, most public cloud providers geo-distribute their pools of servers. Even private clouds (i.e., those operated by governments and companies for their own use) often geo-distribute their servers at multiple data centers. Should disaster strike at one site, services can be shifted automatically to another site, perhaps even in another country or continent.
Finally, the unprecedented scale of cloud deployments, with commercial data centers each costing over $1B, has stimulated new approaches to energy efficiency and cooling, component failures and reliability, customized and open hardware, software defined networks (SDNs) and software defined storage (SDS), and system provisioning and management. In short, we are in a time of radical change in the computing ecosystem, all against a backdrop of Moore's law limitations.
What's a cloud? It's an elastic, on-demand, often geo-distributed, frequently virtualized, flexible hardware, software and services infrastructure that powers the 21st century knowledge economy. And that is really important right now.
N.B. An abridged version of this essay was submitted as a white paper to the BDEC Frankfurt workshop.
I am going to be a bit radical and suggest we must reason in new ways about the need to integrate high-performance computing (HPC) and big data analytics. That must begin with our thinking about how we label and discuss them. Any expert in journalism, communications or psychology will immediately and unhesitatingly affirm that names have extraordinary power. If you doubt that, look at the philosophy of proper names and the cultural history associated with true names. And if that's not convincing, ask any child who was taunted because his or her name rhymed with something unfortunate.
Names shape our thinking, define our discourse and reinforce our biases and perspectives. We use names and descriptors to categorize people, organizations and objects; we use names to define the positive and the negative; and we use names to market, reward and punish. Brand equity has real value. Sadly, as many studies have shown, names also play on stereotypes and our implicit bias. Take Harvard's Project Implicit bias test; you will be humbled and chastened as it exposes your own bias.
High-performance computing (HPC) and big data/machine learning are not exempt from the power of names and a "they could learn from us if they'd only listen" dichotomy. Done any low performance computing (LPC) lately? Analyzed any small data? Of course, you did, when you checked your email and text messages on your smartphone a few minutes ago, but you did not think about it in those terms. In fact, you ran a trans-petascale computation and staged data across a global set of fiber optically connected data caches (a content delivery network) when you searched for a coffee shop yesterday. But, none of us call it that – it's just a quick web search in the ill-defined and amorphous cloud to which we are connected via a global broadband (wired and wireless) infrastructure.
Semantic Cul-de-Sacs
I suspect we are trapped in a sematic cul-de-sac when we talk about high-performance computing and computational science. They are code words for physics-inspired numerical solution of partial differential questions; all too frequently, everything else is potentially suspect and inferior. The denotation of the word exascale says nothing about computing at all, but the connotation of FLOPS and batch-oriented HPC platforms is there in all of our minds. In turn, deep learning and big data conjure images of computer science types tuning neural nets and recommender systems in cloud data centers for targeted product marketing.
It's time to stop categorizing and name calling, time to end cultural and disciplinary silos. There is no hierarchy of needs or intellectual purity. There are just ideas and people who can learn from one other. Dare I say it – death to HPC; death to big data. It's time to stop the religious wars over HPC and big data cultures and technologies and focus on informatics-mediated discovery and innovation. The goal of an HPC center should not be self-perpetuation, either of infrastructure or organization. Nor should a cloud service be restricted to commercial domains. That means starting with intellectual outcomes and deriving approaches rather than defining approaches and searching for feasible intellectual outcomes.
From False Dichotomy to Unification
There are many technical ways to envision and realize the continuum and fusion of traditional HPC and computational science and big data and machine learning. These include integration of steam-based workflows and just-in-time schedulers, containerization and software stack packaging for application-specific configuration, fine-grained parallelization of learning packages, learning for algorithm and software adaptation and tuning, custom ASIC designs (see Google's recent TPU announcement), software defined networks and storage (SDN and SDS), and renewable and energy efficient hardware design and configuration. There are an equally large number of social and economic approaches, including denominating informatics costs in currency to illuminate investment priorities, spot pricing for priority access, economics-driven infrastructure selection and deployment, and institutional and funding agency policies for data management.
What we cannot do is allow ourselves to be bound by names, labels and cultures. Richard Hamming wisely noted, "The purpose of computing is insight, not numbers." HPC and machine learning are not goals, they are enablers. And that's the major consensus narrative.
The Raspberry Pi, a credit card sized, $35 computer, is a runaway success, exceeding even the wildest expectations of anyone in the hobbyist or educational communities. The latest version contains a 900 MHz quadcore ARM processor, 1 GB of DRAM, multiple USB ports, a 10/100 Ethernet port, an HDMI interface and 40 GPIO pins (for external device interfaces). Permanent storage is via a Micro SD card slot.
Assuming one has either an HDMI-capable digital television or desktop computer display, the only additional requirements are a USB keyboard/mouse and a Micro SD card of at least 8 GB capacity. Chances are most of us have all of these; if not, another $30 suffices for all but the display. Thus configured, the Raspberry Pi can run several variants of Linux and, as well as the forthcoming embedded version of Windows10.
All of this takes me back to a time in the early 1980s, when I walked to graduate school every day (in the snow, uphill both ways), Pac-Man was a state of the art video game, and the Rubik's Cube was bringing recreational group theory to the masses (though they didn't know it). A DEC VAX 11/780 with 2-8 MB of DRAM running the Berkeley Software Distribution (BSD) of UNIX was the de facto standard research environment for academic computer science.
At a time when batch computing and timeshared university mainframes defined academic research, these BSD VAX systems, along with the National Science Foundation (NSF) Coordinated Experimental Research (CER) program and the CSNET academic network, were transformative. It is no exaggeration to say that modern, experimental computer science can be traced directly to this environment. All of which is why Bill Joy once called the 1 MIPS of the VAX 11/780 the "electron volt of computing."
The Raspberry Pi is a more capable system in almost every way than that departmental VAX – faster, with a larger memory and bigger secondary storage system – as well having a richer base of open source software and access to the global Internet. It is a very tangible and inexpensive manifestation of Moore's Law. However, to quote from the movie Airplane!, that's not important right now. (I am serious. And don't call me Shirley.)
What is important is the transformative effect of the Raspberry Pi and related, inexpensive computing devices on both education and innovation. (The Kickstarter-funded Adapteva Parallela multicore processor is another.) Raspberry Pi's now serve as robot controllers, research sensors, web servers, and software development platforms; they are the darling of hobbyists and the maker movement.
However, the creation nearest and dearest to my heart has been the widespread development of Raspberry Pi Beowulf clusters for parallel computing education. These systems use inexpensive Ethernet switches to create systems similar to the Caltech Cosmic Cube, a seminal design for message-based parallel computing created by Chuck Seitz and Geoffrey Fox in the early 1980s.
Although not fast by today's parallel computing standards, these Raspberry Pi clusters allow students to construct complete parallel systems, ranging from small, development clusters (4-8 nodes) to larger (40-120+ nodes) research systems. More importantly, they expose all aspects of hardware and software configuration, bringing experiential parallel computing to a broad new class of students. (The construction recipe for ORNL'sTiny Titan is one of many such examples.) Even more importantly, the clusters let students run real-world parallel applications.
I would have dearly loved to have one of these clusters for parallel computing research back in the early 1980s. At that time, Richard Fujimoto and I were writing Ph.D. theses at UC-Berkeley and Purdue, respectively, about various aspects of message passing machines. We then collaborated on a book about what we then called multicomputer networks. We have come a long, long way since then.
Yes, I will have some Raspberry Pi. In fact, I'll have a cluster.
N.B. I also write for the Communications of the ACM (CACM). The following essay recently appeared on the CACMblog. In addition, you can find the video of an interview and the full text of the CACM article, co-authored with Jack Dongarra.
In other contexts, I have written at length about the cultural and technical divergence of the data analytics (aka machine learning and big data) and high-performance computing (aka big iron) communities. I have euphemistically called them "twins separated at both." (See HPC, Big Data and the Peloponnesian War and Scientific Clouds: Blowin' in the Wind.) Like all twins, they share technical DNA and innate behaviors, despite their superficial differences. After all, in a time long, long ago, they were once united by their use of BSD UNIX and SUN workstations for software development.
Since then, both communities have successfully built scalable infrastructures using high-performance, low cost x86 hardware and a rich suite of (mostly) open source software tools. Both have addressed ecosystem deficiencies by developing special-purpose software libraries and tools (e.g., SLURM and Zookeeper for resource management and MPI and Hadoop for parallelism), and both have optimized hardware for their problem domains (e.g., Open Compute hardware building block standardization, FPGAs for search and machine learning, and GPU accelerators for computational science).
Like many of you, I have seen this evolution firsthand, as a card-carrying geek in both the HPC and cloud computing worlds. One of the reasons I went to Microsoft was to bring HPC ideas and applications to the nascent world of cloud computing. While at Microsoft, I led a research team to explore energy-efficient cloud hardware designs and new programming models, and I launched a public-private partnership between Microsoft and the National Science Foundation on cloud applications. Now that I am back in academia, I am seeking to bring cloud computing ideas back to HPC.
In that spirit, Jack Dongarra and I recently co-authored an article for the Communications of the ACM on the twin ecosystems of HPC and big data and the challenges facing both. Entitled, Exascale Computing and Big Data, the article examines the commonalities and differences, and discusses many of the unresolved issues associated with resilience, programmability, scalability, and post-Dennard hardware futures. Most importantly, the article makes an impassioned plea for hardware and software integration and cultural convergence.
The possibilities for this convergence are legion. The algorithms underlying deep machine learning would benefit from the parallelization and data movement minimization techniques commonly used in HPC applications and libraries. Similarly, the approaches to failure tolerance and systemic resilience common in cloud software have broad applicability to high-performance computing. Both domains face growing energy constraints on the maximum size of feasible systems, necessitating shared focus on domain-specific architectural optimizations that maximize operations per joule.
Perhaps most important of all, there is increasing overlap of application domains. New generations of scientific instruments and sensors are producing unprecedented volumes of observational data, and intelligent, in situ algorithms are increasingly required to reduce raw data and identify important phenomena in real time. To see this, one need look no further than applications of machine learning to astronomy, which now include automated object identification. Conversely, client plus cloud services are increasingly model-based, with rich physics, imaging processing and context that depend on parallel algorithms to meet real-time needs; augmented reality applications are one such exemplar.
The explosive growth of Docker and containerized software management speaks to the need for lightweight, flexible software configuration management for increasingly complex and rich software environments. My hope is that we can develop a unified hardware/software ecosystem that leverages the technical and social strengths of each community. Each would benefit from the experiences and insights of the other. It is past time for the twins to have a family reunion.
I am seeking a post-doctoral research associate to join a National Science Foundation (NSF) project on high-performance computing system reliability and energy efficiency modeling. If you are interested, or you know someone who might be interested, the details of the position are below.
Post-Doctoral Research Associate
Computer System Reliability and Energy Efficiency Modeling Research Project Description
As node counts for high-performance computing systems grow to tens of thousands and with proposed exascale systems likely to contain hundreds of thousands of nodes, overall system reliability and energy consumption are increasingly critical issues. New approaches to balance hardware, software and support costs are needed to address systemic resilience Likewise, the rising energy requirements of ever-larger high-performance computing systems now pose limits on the practicality of their deployment, due to both energy availability and cost.
To make larger systems useable and cost effective, we must develop and adopt new design and operational models that embody two important realities of large-scale systems: (a) frequent hardware component failures are part of normal operation and (b) system and application optimization must be multivariate, including energy cost and efficiency as complements to performance and scalability. New design ideas drawn from commercial cloud computing, including adaptable designs for hardware failure and energy efficiency, are needed if proposed exascale designs are to be feasible, much less practical.
The project focuses on development of (a) scalable, analytic and simulation models for hardware performability (performance plus reliability) based on the principle of near-complete decomposability, (b) assessment and sizing of zero-touch field replaceable hardware modules (FRMs) to reduce hardware repair errors and total cost of ownership (TCO) models, (c) energy-aware batch scheduling models that incorporate bounds on energy availability and energy costs and (d) user resource allocation cost models with energy as a cost proxy.
Desired Skills
Candidates should have a PhD in computer science, electrical and computer engineering or an allied discipline. Experience in computer system simulation, analytic modeling, high-performance computing systems and parallel applications is highly desirable. The successful candidate will be expected to work independently on original research problems related to the project and help coordinate the activities of PhD students.
In addition to this blog, I also write for the Communications of the ACM (CACM). I recently posted an essay on the future of exascale computing there.As I noted in the CACM blog, we need a catastrophe – in
the mathematical sense – a discontinuity triggered by a sustained research
program and development program that combines academia, industry and government
expertise that leaps the chasm of technical challenges we now face.
You can read the hearing charter and my extended, written testimony on the hearing web site and watch an archived video of the hearing. In my written and oral testimony, I made four points, along with a specific set of recommendations. Many of these points and recommendations are echoes of my previous testimony, along with recommendations from many previous high-performance computing studies.
What that backdrop, here is what I said during the hearing.
Oral Testimony
First, high-performance computing (HPC) is unique among scientific instruments, distinguished by its universality as an intellectual amplifier.
New, more powerful supercomputers and computational models yield insights across all scientific and engineering disciplines. Advanced computing is also essential to analyzing the torrent of experimental data produced by scientific instruments and sensors. However, it is about more than just science. With advanced computing, real-time data fusion and powerful numerical models, we have the potential to predict the tracks of devastating tornadoes such as the recent one Oklahoma, saving lives and ensuring our children's futures.
Second, the future of U.S. computing and HPC leadership is uncertain.
Today, HPC systems from DOE's Oak Ridge, Lawrence Livermore and Argonne National Laboratories occupy the first, second and fourth places on the list of the world's fastest computers. One might surmise that all is well. Yet U.S. leadership in both deployed HPC capability and in the technologies needed to create future HPC systems is under challenge.
Other nations are investing strategically in HPC to advance national priorities. The U.S. research community has repeatedly warned of the eroding U.S. leadership in computing and HPC and the need for sustained, strategic investment. I have chaired many of those studies as a member of PITAC, PCAST, and National Academies boards. Yet these warnings have largely been unheeded.
Third, there is a deep interdependence among basic research, a vibrant U.S. computing industry and HPC capability.
It has long been axiomatic that the U.S. is the world's leader in computing and HPC. However, global leadership is not a U.S. birthright. As Andrew Grove, the former CEO of Intel, noted in his famous aphorism, "only the paranoid survive." U.S. leadership has been repeatedly earned and hard fought, based on a continued Federal government commitment to basic research, translation of research into technological innovations, and the creation of new products.
Fourth, computing is in deep transition to a new era, with profound implications for the future of U.S. industry and HPC.
U.S. consumers and businesses are an increasingly small minority of the global market for mobile devices and cloud services. We live in a "post-PC" world where U.S. companies compete in a global device ecosystem. Unless we are vigilant, these economic and technical changes could further shift the center of enabling technology R&D away from the U.S.
Recommendations for the Future
First, and most importantly, we must change our model for HPC research and deployment if the U.S. is to sustain its leadership. This must include much deeper and sustained interagency collaborations, defined by a regularly updated strategic R&D plan and associated verifiable metrics, and commensurate budget allocations and accountability to realize the plan's goals. DOE, NSF, DOD, NIST and NIH must be active and engaged partners in complementary roles, along with long-term industry engagement.
Second, advanced HPC system deployments are crucial, but the computing R&D journey is more important than any single system deployment by a pre-determined date. A vibrant U.S. ecosystem of talented and trained people and technical innovation is the true lifeblood of sustainable exascale computing.
Finally, we must embrace balanced, "dual use" technology R&D, supporting both HPC and ensuring the competitiveness of the U.S. computing industry. Neither HPC nor big data R&D can sacrificed to advance the other, nor can hardware R&D dominate investments in algorithms, software and applications.
Recent Comments