Twenty years ago, while director of the National Center for Supercomputing Applications (NCSA) at the University of Illinois, I gave a series of presentations on the future of cyberinfrastructure. Looking through my presentation archives, I found this diagram, which emphasized building up (i.e., larger and more capable computing systems and scientific instruments) and building out (i.e., ubiquitous access to computing systems and instruments of all sizes, including distributed sensor/actuator networks).
At the time, we did not understand that fungal infections from chytridiomycosis were in part responsible for widespread death and extinction among amphibian species. I suggested with some seriousness that we instrument selected biological environments with MEMS smart dust and track flora and fauna movements in situ. Or, as I put it at the time, “Why not give every frog an IP address?” Of course, understanding ecosystem evolution eventually became the raison d'etre for the for the National Ecological Observatory Network (NEON) – more on that later. I also had several conversations with then National Science Foundation (NSF) Director, Rita Colwell, about the prospects for HPC-enabled in silico computational modeling of biological processes, something only now becoming possible. (See In Silico Whole Cell Modeling.)
With the perspective afforded by twenty years, how have we fared in realizing this integrated vision? Building up, we have done well in most contexts, but we now face the twin realities of escalating costs and technology challenges at very large scales, both for computing platforms and for scientific instruments. It is unclear how many multibillion-dollar instruments or computing platforms we have either the political or economic will to construct. As for building out, it is a mixed story of broader access to midrange commodity Linux clusters and new instruments, but we are far from having created a ubiquitous scientific infosphere.
Building Up
In 2002, the first terascale computing systems had just become operational, including the NSF TeraGrid, which connected NCSA, SDSC, Argonne National Laboratory, and Caltech via a then extraordinarily fast 40 Gb/s transcontinental network. (See NCSA@30: The Revolution Continues.) At the time, we were still dreaming of petascale data storage systems, and research discussions about the path to petascale computing were well underway. The 1996 and 1999 Caltech Petaflops workshops at Bodega Bay and Santa Barbara were seminal in shaping the future that because petascale computing. As this photograph shows, we have all aged a bit in the intervening twenty five years, and sadly, a few of our colleagues are no longer with us.
Today, terascale and petascale computing systems via GPU accelerators are widely available, spanning desktops to laboratory and university cluster environments. In May 2022, the latest incarnation of the TOP500 list of (some of) the world’s fastest computers was revealed at the ISC High Performance conference in Hamburg, Germany. There was much hype and chest thumping as the Oak Ridge National Laboratory (ORNL) Frontier system broke the exascale barrier based on the high-performance LINPACK (HPL) benchmark for dense matrix factorization.
I say “some of” the world’s fastest computers because submission to the TOP500 list is entirely voluntary, and some organizations chose not to submit entries, for a variety of reasons, some political and economic, some technical, and some related to national security. In particular, China has at least two exascale systems (OceanLight and Tianhe-3), but has not submitted either of them to the TOP500 list, likely to avoid further inflaming tensions with the United States.
Meanwhile, the very definition of exascale is itself now subject to debate, both because the dense linear algebra of the HPL benchmark is widely acknowledged as no longer a good predictor of multidisciplinary application performance and because mixed precision arithmetic is now common in many applications, particularly those with AI-accelerated components. When convolved with system and chip energy constraints, semiconductor fabrication challenges, and ecosystem shifts, a sense of quo vadis now pervades the field. (In a recent essay and associated article, my colleagues Jack Dongarra, Dennis Gannon, and I offered a few thoughts on strategic directions.)
In the world of computing, the locus of innovation has focused on the small and many (think smartphones and the Internet of Things) and the few and large (think commercial cloud infrastructure and leading edge HPC systems); scientific instruments are dominated by a similar dichotomy, though the large and few receives most of the attention and visibility. As for large-scale scientific instruments, one need look no further than the aging Hubble Space Telescope, the recently upgraded Large Hadron Collider (LHC), the International Thermonuclear Experimental Reactor (ITER) being built by a global consortium, the soon to be operational Vera Rubin Observatory (nee Large Synoptic Survey Telescope) and James Webb Space Telescope (JWST), or the three proposed and under construction Extremely Large Telescopes. Biology has joined the physical sciences in large-scale instrumentation, spanning the Human Genome Project to the National Ecological Observatory Network (NEON).
Like leading edge supercomputers, these big instruments are the shiny objects featured in press releases and popular science stories, and well they should be, for they produce scientific results achievable via no other mechanism. Each involves hundreds, thousands, and sometimes tens of thousands of scientists and engineers; each requires billions of dollars, euros, or yuan to design, build, and operate; and produces torrents of data that demand petascale data archives and global content distribution networks. Thanks to these new instruments, the associated explosion of scientific data has been hugely democratizing, allowing many researchers to focus on the key intellectual questions rather than the mechanics of data acquisition.
However, this brave new world and the shift from data paucity to plethora have also raised new questions about data preservation and triage, shared and individual responsibilities, FAIR access, and the costs of data retention, as mandated by government sponsors. (See Research Data Sustainability and Access and My Big Scientific Data Are Lonely.) As the late Nobel Laureate Herbert Simon once noted,
What information consumes is rather obvious: it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it.
In my humble opinion, we have not yet fully internalized the social, political, economic, and technical implications of the shift from data scarcity to data ubiquity. What data we keep and for how long, how we manage access, and how we distribute the very real costs of data management and preservation have not yet been fully resolved. Serious science policy issues remain unresolved.
On Being the Right Size
Although our intellectual curiosity is rightfully unbounded, our financial resources are not. What are the right sizes and relative numbers for our computing platforms and our scientific instruments? Is it all about faster and bigger, which in almost any realistic budget scenario, also means fewer? In the limiting case, as illustrated by ITER and the LHC, fewer may well mean only one per planet, financially sustained only via a global scientific partnership.
Nor does the fewer and bigger choice only affect the high end, it squeezes investment in the long tail of campus and laboratory infrastructure as well. These are important science policy questions, especially when one realizes that the day-to-day, “bread and butter” work of science is conducted with many instruments of much smaller size, educating students in scientific processes. (See On Being the Right Size: Science Scaling and Power Laws.)
Make no mistake; form and scale absolutely follow function, as the British geneticist J. B. S. Haldane first wrote in a now famous essay entitled On Being the Right Size. (You can read the original essay as originally published in Harper’s Magazine, skipping the political pontification at the end.) In it, Haldane offered cogent arguments about surface area to volume ratios, structures, respiration and energy, noting:
The most obvious differences between different animals are differences of size … it is easy to show that a hare could not be as large as a hippopotamus, or a whale as small as a herring. For every type of animal, there is a most convenient size, and a large change in size inevitably carries with it a change of form.
Computing systems and scientific instruments are no different. A simple refractor telescope design cannot be expanded unchanged to create a 30-meter optical observatory; the materials properties do not scale linearly, and the control systems differ markedly. A child’s horseshoe magnet cannot be scaled to create a 100-tesla pulsed research magnet. Nor is an exascale computer just a big deskside cluster, despite some comparable elements. The lack of self-similarity at scale means new design approaches are needed in different operating regimes, each with associated design costs.
Building Out
Returning to my twenty-year-old diagram, we have made progress in building out, with midrange computing clusters now more common and new classes of midrange instruments (e.g., cryo-EM) changing the face of science. Cyberinfrastructure training materials are now widespread, and we even host student cluster construction competitions; building a Linux cluster is no longer an arcane art. Indeed, one can now build an inexpensive Raspberry Pi cluster with almost all the same software tools found on leading edge supercomputers.
Despite this progress, our intellectual fascination with a small number of large instruments and computing platforms often blinds us to the power of distributed discovery and innovation with large numbers of instruments at smaller scales. Despite this, we have made enormous progress in sensor miniaturization, perhaps even more than Feynman envisioned in his seminal, “There’s Plenty of Room at the Bottom” nanotechnology lecture. (See Come to the Supercomputing Big Data Edge.) In this sense, we have a long way to go to realize the vision of a ubiquitous infosphere. Let me illustrate with a specific example.
SAGE: A Software Defined Sensor Network
Over the past two years, I have greatly enjoyed playing a small part in the NSF-funded SAGE project, which is seeking to build a flexible and extensible hardware and software cyberinfrastructure for AI at the edge. Led by Pete Beckman, SAGE builds on many ideas from Charlie Catlett’s urban science and environmental science Array of Things (AoT) project, which deployed sensors on utility poles across the City of Chicago and surrounding areas.
AoT engaged community residents in defining acceptable data use policies, explored data driven social science by direct measurement, and stimulated citizen science and K-12 hands-on education. For a review of the AoT project and the motivating lessons for SAGE, I recommend this IEEE Xplore article, which several of us wrote recently.
SAGE is based on the notion of software-defined, configurable sensors that include edge computing, standard sensor interfaces, a set of baseline sensors, and extension interfaces for environment-specific sensors, along with backend data storage and analysis facilities.
A standard “wild SAGE” node (named for its hardened packaging and fault tolerant design to run unattended in harsh conditions, or “in the wild”) includes the following:
- One or two NVIDIA Xavier NX GPUs with Wi-Fi, Ethernet, and a full Linux stack
- Power over Ethernet (POE) sky and ground facing cameras
- An optical rain sensor and a microphone
- Relative humidity, barometric pressure, and temperature sensors in a Stevenson shield (that’s the box on the left in the photograph, which also contains a Raspberry Pi to manage the Bosch BME 680 sensor)
- POE and USB connectors for additional, domain-specific sensors
Why deploy AI at the sensor edge? First, the volume of streaming data may preclude storage, either due to bandwidth limitations or storage capacity. For these same reasons, the LHC’s detectors rely on real-time triggers detect and record only a small subset of particle collisions. Second, urban and environmental monitoring bring strong security and privacy concerns, particularly involving video. By extracting events and features entirely at the edge, the raw video need never leave the sensor, obviating many privacy concerns (e.g., the timing and frequency of pedestrian interactions may suffice, without saving any identifying video). Third, AI at the edge means the sensor can adapt dynamically to changing conditions (e.g., increasing measurement frequency based on a detected event or even triggering alternative measurements at other locations). Fourth, latency constraints may demand a response time possible only when decisions are made at or near the edge.
In SAGE, the key to this flexibility is an end-to-end architecture – Waggle AI@Edge – that allows domain scientists to focus on their science, rather than building generic infrastructure. One need only connect new sensors, develop domain-specific AI plugins using widely available AI tools (e.g., OpenCV, PyTorch, and TensorFlowLite) to process sensor data in situ on SAGE nodes, and use tools such as Jupyter notebooks to analyze and correlate data.
To demonstrate this integration capability, the SAGE project has been working with NEON to integrate inexpensive SAGE nodes for augmented environmental monitoring and with California and Oregon wildfire researchers to detect and respond to the growing prevalence of wildfires. There are now SAGE plugins for vehicular traffic detection, bird song classification, cloud cover, smoke, and other flora and fauna events.
At Utah, we have deployed one SAGE node at the Natural History Museum of Utah (NHMU), which allows us to track environmental conditions and local flora and fauna. Additional SAGE nodes are being deployed in downtown Salt Lake City to monitor PM 2.5 particulate air pollution and to study advanced wireless technologies in collaboration with the NSF POWDER project.
Much More Room at the Bottom
At ~$15,000 each, SAGE sensors are inexpensive compared to multibillion dollar mega-instruments. Yet even this price is prohibitive for many research contexts, limiting their deployment in science and community education projects where resources are even more bounded and precious. Fortunately, there is much more room at the bottom. In this spirit, members of the SAGE education team are working on lower cost SAGE versions based on the Jetson Nano and the Raspberry Pi.
However, imagine a cyberinfrastructure world containing even lower cost sensors -- tens of thousands of them at ~$50 each! Leveraging a wealth of inexpensive and open source hardware and software, it is now possible to create and deploy a powerful sensor infrastructure for just a few hundred dollars.
Let’s start with server infrastructure. A Raspberry Pi with a complete Linux software stack, including Docker and Kubernetes virtualization, a high-capacity SD card for system software and data storage, Wi-Fi, Ethernet, and LoRaWAN long-distance communication transceiver can be assembled for approximately $100.
Such a system can readily serve as a base station for hundreds of microcontroller nodes, each supporting a wide variety of ~$5 sensors. As an example, one of my former colleagues at Iowa fabricated the sensor node shown at right based on my specifications. This node includes
- Powerful ESP32 Tensilica microcontroller with Wi-Fi and Bluetooth LE
- Bosch BME 280 pressure, temperature, and humidity sensor
- LoRaWAN low bandwidth, long distance communication
- I2C, analog, and digital Grove, solderless sensor connectors
- Battery and/or solar power interface for long-term episodic operation
Such sensor nodes can be readily designed using open source hardware tools such as KiCad and fabricated in quantity for less than $100 each, or similar ones can be purchased commercially (e.g., Heltec Wi-Fi LoRa32 V2 or Adifruit Huzzah32 Feather with Feather Wing addon sensors). Equally importantly, they and the Raspberry Pi base stations can be integrated directly with environments such as SAGE. Just as commodity clusters opened computing to a wide range of practicing scientists, inexpensive sensors can be the missing and catalyzing element to realize the vision of an immersive sensor/actuator infrastructure.
Over the past two years, as a research proof-of-concept, I have developed a generic software infrastructure for Raspberry Pis and ESP32 sensors, while Charlie Catlett has been exploring complementary hardware and software models based on inexpensive home automation tools. We are now drafting a “how to” book, targeting K-12 STEM education and citizen science.
Cyberinfrastructure Futures: The Missing Elements
We have come far in building out a 21st century cyberinfrastructure, but there is more to do if we are to fully exploit the opportunities now afforded by scale – both small numbers of large computing systems and instruments and large numbers of small computing systems and instruments. As we imagine a future world of software defined sensors, I believe we must address the following research challenges, and I encourage all of us think about how we might mount national research programs to explore each of them.
An Extensible Cyberinfrastructure Baseline. At present, we lack a generic sensor toolkit that is broadly applicable to diverse measurement environments. The SAGE project is an early attempt at designing and deploying such a generic infrastructure, drawing on lessons from the earlier Array of Things (AoT) project, but much more experimentation and targeted development is required to separate systemic needs from domain-specific science. Even more importantly, the lifetime of these longitudinal sensor data surveys may last for years or even decades, placing an even greater premium on organizational and infrastructure stability.
Intentional Programming at Scale. Task or node level specification is central to most of our programming models, yet so many of the questions that matter with distributed sensors are systemic ones involving identification of either data outliers (e.g., vibrations on this bridge exceed norms) or broad trends (e.g., rainfall along the severe storm front is approaching critical levels). Equally importantly, such data should trigger certain global actions (e.g., if viral loads in sewage rise, increase global sampling rates). These global, “if this happens, then do this” intentional specifications, with some tolerance for sampling errors and uncertainty, are not typical of our programming models, but are ideal for specifying system behaviors and responses. We need new research on systemic, intentional programming models and specification systems (e.g., based on AI qualitative reasoning and zero shot planners).
Run Forever Heterogeneity. Those of us in computing are intimately acquainted with forklift hardware replacement, when old hardware gives way to new systems. Depending on facility capacity, there may be some period of system overlap, but inevitably the forklifts remove the old hardware, leaving only a fully integrated, new system in their wake. Alas, the real world of distributed sensors bears much more resemblance to that of embedded real-time systems, where replacing sensors can be both costly and impractical. Sometimes this is because sensors are deployed in critical infrastructure with hard real-time constraints and uptime quality of service expectations. In other cases, the cost of replacement may simply be prohibitive.
For example, replacing the Chicago AoT nodes required access to City of Chicago utility bucket trucks, something possible only rarely and as a low priority relative to operational utility needs. Similarly, replacing sensors in remote field sites may require extensive and difficult travel. One consequence of such difficulties is extensive sensor heterogeneity, with new sensors and technology operating in concert with older, less capable sensors. This has profound implications for both programming models and software configuration and deployment, as well as the types and resolution of data being captured and kinds of in situ analysis possible.
Bags of Interdisciplinary Data. In addition to disciplinary access, much of the value of longitudinal sensor data accrues from cross-disciplinary data fusion, asking questions never envisioned by any of the original collectors of the data. The latter necessarily requires integration across diverse data schemas, none of which were designed with cross-disciplinary questions in mind. We need to embrace data schema heterogeneity as an inevitability rather than as an impediment, encouraging development of new techniques that support “come as you are” data analysis. None of the search firm giants have a master schema for web pages; rather, they index data and build search engines capable of responding to a wide range of queries. We need to draw insights and lessons from there experiences.
A Data Economy. Finally, government mandates for data preservation and access are creating new challenges for researchers and research institutions alike. What should be saved and for how long? Can such answers be determined solely by the constituent disciplines or should there be a combination of discipline-specific and university specifications? When can and should data be discarded? Are those choices based solely on usage frequency, uniqueness, or cost of replication relative to cost of retention? Research librarians long ago adopted deaccession policies, triaging holdings and negotiating interlibrary loan policies for rare or unique artifacts? We need a coherent data economy that balances data utility against costs, driven by scientific needs.
Coda
We have come far in developing a coherent national and international cyberinfrastructure for research and innovation. However, there are still unmet needs and unresolved infrastructure and research questions surrounding distributed instruments and large numbers of inexpensive sensors. It is time to reason together and build the missing pieces.
Comments
You can follow this conversation by subscribing to the comment feed for this post.