American Competitiveness: IT and HPC Futures – Follow the Money and the Talent - Reed's Ruminations: The Past, Present, and Future

N.B. A more formal version of these arguments is posted on arXiv (https://arxiv.org/abs/2203.02544), co-authored with Dennis Gannon and Jack Dongarra.

This is a long post, containing perspectives that may be controversial to some. That’s okay; it is why we have lively intellectual debates.

Charles Dickens’ A Tale of Two Cities opens with this line:

IT was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair.

It’s an apt metaphor for the state of information technology and the state of U.S. competitiveness in computing and science, technology, engineering, and mathematics (STEM). In some ways, the hegemony of U.S. IT has never been greater, with Apple and the cloud companies (Google, Microsoft, Amazon, and Facebook) dominating the Nasdaq with trillion dollar (and more) market capitalizations and reshaping the very definitions of computing with global services, deep learning, and advanced AI.

Conversely, the semiconductor shortage has highlighted the economic and national security risks from over-dependence on off-shore fabrication, and the relative influence of traditional computing companies on the IT ecosystem has waned. Meanwhile, the future of U.S. leadership in bleeding edge scientific computing (aka supercomputing) is at a critical crossroads.

Finally, while the global competition for STEM talent in general and IT talent in particular has never been more intense, the U.S. is substantially underperforming relative to other countries in STEM education. Let’s take a look at how all of this has happened. Remember, follow the money and the talent.

Remembering the Past

Let’s start by playing a parlor game. Name the first computer company that comes to mind. If you are an old timer, you might be inclined to mention IBM, Amdahl, and the BUNCH (Burroughs, UNIVAC, NCR, CDC, and Honeywell). Although IBM once bestrode the computing world like a colossus, today, even its relevance has faded. Indeed, few now remember that IBM was once locked in a bitter struggle with the U.S. Department of Justice for monopolistic practices under anti-trust laws. As for the BUNCH, they are either gone from the computer business or niche players at best. Later, others – Compaq, Digital Equipment Corporation (DEC), and SGI – came and went.

As an aside, I learned to program on an IBM System 360/50, wrote lots of code on a DEC PDP-11/45 for my first paid software development job, ported code from a GE 600 timesharing system, and used a CDC 6500 and 6600 during graduate school. I programmed in FORTRAN, COBOL, PASCAL, LISP, and PI/I (in addition to many more recent languages), and I’ve toggled absolute binary into the front panel. I’m also old enough to remember when one could do Mead and Conway VLSI design with colored pencils. I also used a slide rule, while walking barefoot in the snow to school, uphill (both ways), but I digress. (See You Might Be a Computing Old Timer If …)

There’s a sad, parallel tale (pun intended) in the land of high-performance computing, where the cemetery is filled with departed denizens, some large, some small, but all now permanent members of the Dead Computer Society. Whether by misjudging markets or technologies, each in in its own way proved the truth of the old adage that the best way to make a small fortune in supercomputing is by starting with a large fortune, then shipping one or two generations of product. Sic transit gloria mundi SGI, Cray, and a host of others. (See HPC: Making a Small Fortune.)

More to the point, the list of high-end supercomputing players keeps getting shorter. Cray is gone, now part of HPE; and both IBM and Intel are struggling to right themselves, each for different reasons. In the United States, this leaves HPE as the lone high-end HPC integrator in the United States. Time will tell if Nvidia, with its recent acquisitions, steps into this space, though its proposed acquisition of ARM seems unlikely.

Meanwhile, the information technology ecosystem and the locus of innovation and global leadership have shifted dramatically. There are deep and important implications for the U.S.

Contemplating the Present

If you are Gen Z or Gen Alpha, the world begins and ends with Apple, Google, and Samsung smartphones and the cloud services behind them. All of which brings us to the FAANG companies (Meta née Facebook, Amazon, Apple, Netflix, and Alphabet née Google) plus Microsoft and their Asian BAT (Baidu, Alibaba, and Tencent) counterparts. These companies are economically exothermic in ways perhaps unprecedented since the days of the Standard Oil monopoly. Doubt me? Look at their market capitalizations and free cash flow relative to traditional computer companies. (As I write this, Microsoft just made a $69B all-cash offer for Activision and reported over $50B in quarterly revenue.)

I first watched this power dynamic begin to shift while at Microsoft. For years, the Wintel duopoly locked Microsoft and Intel in a collaborative/competitive partnership as rough peers. I sat in more than one “frank and spirited” exchange between leaders of the two companies, where each pushed the other to accommodate the other’s market needs. As server chips, clouds, and infrastructure grew in importance relative to desktops and laptops, the power dynamic began to shift. Today, Intel is the far weaker player, in many cases forced to respond to cloud vendor demands. In hindsight, it is all obvious.

Technological and Economic Shifts

Today, Apple and the other cloud service companies dominate the computing hardware and software ecosystem, both in scale and in technical approaches. Initially, they purchased standard hardware for deployment in traditional colocation centers (colos). Then they began designing purpose-built data centers, optimized for power usage effectiveness (PUE), deployed at sites selected via multifactor optimization – inexpensive energy availability, tax breaks and political subsidies, political and geological stability, network access, and customer demand.

As scale, complexity, and operational experience grew, new optimization and leverage opportunities emerged, including software defined networking, protocol offloads, and custom network architectures (greatly reducing dependence on traditional network hardware vendors), quantitative analysis of processor, memory, and disk failure modes, with consequent redesign for reliability and lower cost (dictating specifications to vendors via consortia like Open Compute), custom processor SKUs, custom accelerators (FPGAs and ASICs), and finally, complete processor design (e.g., Apple silicon, Google TPUs and AWS Gravitons). In between, they deployed their own global fiber networks.

This virtuous cycle of insatiable consumer demand for rich services, business outsourcing to the cloud, expanding data center capacity, and infrastructure cost optimization has had several effects. First, it has dramatically lessened – and in many cases totally eliminated — FAANG/Microsoft/BAT dependence on traditional computing vendors and ameliorated the risks of wholesale transfer pricing. Put another way, Amazon AWS, when necessary, can now negotiate pricing terms with Intel and AMD in the same way Walmart does with Procter and Gamble (i.e., from a position of strength as a dominant customer).

Second, the absolute scale of infrastructure investment, denominated in billions of dollars per year for each cloud service vendor, means these companies are shaping computing technology research and development priorities in ways traditional computing vendors rarely can, driving design and fabrication priorities in every element of the ecosystem.

Bigger is not simply bigger; bigger is different, fueling investment in new technologies at a phenomenal rate. (See The Zeros Matter: Bigger Is Different.) Relatedly, I cannot overemphasize the massive scale of these cloud data center deployments, dwarfing anything seen in academia or national laboratories, with the already large gap widening each year. The CAGR is simply astounding.

Third, custom infrastructure is about more than economic leverage; it’s about market differentiation and competitive advantage. Anything not differentiating is standardized and commoditized, driving unnecessary cost out of the system, and concomitantly reducing profits for commodity purveyors. Conversely, anything differentiating is the focus of intense internal optimization and aggressive deployment.

Equally importantly, client plus cloud services vendors predominately make their money from the services offered atop their infrastructure, much less from the infrastructure itself. Unlike Intel, AMD, and Nvidia, which sell silicon, Apple is not selling A15 SoCs; it is selling iPhones, just as Amazon and Google are selling services atop TPUs and Gravitons.

Fourth, all of this is occurring against a backdrop of big data and machine learning. Large-scale AI itself is fueling a computing revolution, with an insatiable hardware demand to train ever larger machine learning models. As an example, modern speech recognition models now have hundreds of billions of parameters, and multiple groups have announced models with over one trillion parameters. Similarly, work on generative adversarial networks (GANs) has led to impressive advances in game playing and other tasks. Nor is deep learning limited to “AI workloads;” there are impressive results on solving differential equations as well. (See Dennis Gannon’s excellent summary of Karniadakis’ work here.) Similarly, look at developments in generative adversarial networks (GANs),

Finally, all of these trends are convolved with the end of Dennard scaling, dark silicon, functional specialization, chiplets, semiconductor fabrication challenges, rising fabrication facility costs, and an uncertain political future for Taiwan and TSMC. Moreover, the actual cost of transistors (dollars/transistor) is increasing as we move closer to one nanometer scaling. In fact, the transistor cost minimum occurred at roughly the time (~28 nanometers) we switched from planar CMOS to FinFETs. It will likely only get worse as gate-all-around FETs predominate, given increased fabrication complexity.

In the midst of all this, Intel stumbled, losing its lead in process technology just as EUV finally and belatedly became mature, and TSMC staked a lead. Time will tell if Pat Gelsinger can turn things around.

The Future

Okay, let’s put all the technology and economics puzzle pieces together:

Advanced computing requires non-recurring engineering (NRE) investment to develop new technologies and systems.
The FAANG/Microsoft/BAT companies are cash rich (i.e., exothermic), and they are designing, building, and deploying their own hardware and software infrastructure at unprecedented scale.
The traditional computing vendors are now small – in some cases, tiny – economic players in the computing ecosystem, and many are dependent on federal investment (i.e., endothermic) for NRE to advance the bleeding edge.
AI is fueling a revolution in how we think about problems and their numerical solution.
Dennard scaling is over and performance advances increasingly depend on functional specialization via custom ASICs and chiplet-integrated packages.
Transistor costs are likely to continue to increase as we push to smaller feature sizes.
Nimble hardware startups are exploring new ideas, driven by the AI frenzy.
Talent is following the money and the opportunities, which are increasingly in industry.

With this backdrop, what is the future? I believe some of it is relatively obvious and secure, given the current dominance of U.S. cloud service providers. However, broader U.S. leadership in semiconductors, information technology, and bleeding edge HPC is less certain. I believe any viable solutions will require rethinking our approaches and assumptions, including how, where, and when we spend money and what will both grow and retain talent. Let’s start with the broad issues around the U.S. information technology ecosystem.

Information Technology Futures

The big changes outlined above make it obvious that a computing revolution is underway, one that is reshaping global leadership in critical areas. All revolutions create opportunity, but only for those prepared to accept the risks and act accordingly. In Far Side cartoon parlance, we must be tall enough to attack the city.

First, we need to be honest about the social, political, economic, and national security risks of losing U.S. leadership and control of the silicon fabrication ecosystem. Fabless semiconductor firms are important, but we also need onshore, state of the art fabrication facilities, and we need to invest and act accordingly. It is why the U.S. CHIPS Act and its successors are so important, and why the recent announcements by Intel, TSMC, and GlobalFoundries that they are building new chip fabrication faculties in the U.S. (each for different reasons), is so important.

Second, the seismic business and technology shifts have also led to equally dramatic talent shifts. People gravitate to opportunities, and that giant sucking sound is people leaving academia, national laboratories, and traditional computing companies to pursue those opportunities. Seymour Cray once said, “I wanted to make a contribution,” when asked why he had used so many high risk technologies in the Cray-1. The desire to make a difference is no less powerful today.

The academic brain drain among artificial intelligence researchers is well documented. After all, in the commercial world one can develop and test ideas at a scale simply not possible in academia, using truly big data and uniquely scaled hardware and software infrastructure. The same is true for chip designers and system software developers.

Making matters worse, the U.S. talent pipeline is uncertain. As the recent 2022 Science and Engineering Indicators shows, the U.S. is losing ground compared to the rest of the world, trailing China in both patents and publications, as summarized in this Science story about the recent Science and Engineering Indicators report.

Our K-12 STEM education performance lags most of Asia and Europe – as shown in the chart to the right, we rank between Hungary and Turkey in mathematics performance by 15-year-olds. We also have deep and systemic inequalities in STEM education, and we have sadly created an environment that is less attractive to international graduate students and scholars than it once was. The latter is especially worrisome given that over 60 percent of computer science and mathematics Ph.Ds. working in the United States were born overseas.

Practically, this means we need to recognize the need for determined actions:

Raise the levels of federal investment in ongoing information technology R&D. This means billions, not millions of dollars.
Invest in the talent pipelines needed to make academic and laboratory R&D attractive.
Partner deeply with the U.S. private sector in new and creative ways.

Now, what about advanced scientific computing futures?

Leading Edge HPC Futures

It now seems self-evident that supercomputing, at least at the highest levels, is endothermic, requiring regular infusions of non-revenue capital to fund the NRE costs to develop and deploy new technologies and successive generations of integrated systems. In turn, that capital can come from either other, more profitable divisions of the business or from external sources (e.g., government investment).

While at Microsoft, I once pitched a technical computing idea to CEO Steve Ballmer. He listened patiently to my spiel, then simply said, “I’m not interested.” Being nothing if not persistent, I asked why. To which he replied, “It’s not that I think you are wrong. I actually think you are right – we could drive a few hundred million dollars in revenue, but the opportunity cost is too high, and the return on investment (ROI) is too low. Come back when you have an idea that starts with B (billions), not M (millions).”

I was chastened, but it was a valuable lesson that I fear too many in high-end scientific and engineering computing have not learned. A business leader must always look at the opportunity cost (time and cost of money) for any NRE investments. The core business question is always how to make the most money with the money you have, absent some other marketing or cultural reason to spend money on loss leaders, bragging rights, or political positioning. The key phrase here is “the most money;” simply being profitable is not enough.

Therein lies the historical “small fortune” problem for HPC. The NRE costs for bleeding edge supercomputing are now large relative to the revenues and market capitalization of those entities we call “computer companies,” and they are increasingly out of reach for U.S. government agencies, at least under current funding envelopes. The days when a few million dollars could buy a Cray-1/X-MP/Y-MP/2 or a commodity cluster and land in the top ten of the Top500 list are long gone. Today’s game requires hundreds of millions of dollars to deploy a machine at the high end and at least similar, if not larger investments in NRE.

What does this brave new world mean for those of us in the big iron, bleeding edge HPC crowd? First, as I mentioned over a decade ago, we are really the fast iron crowd, as our biggest systems are now dwarfed by the overall scale of commercial cloud infrastructure deployments. Second, we may still be invited to dinner, but economically, we are increasingly relegated to eating SpaghettiOs and drinking Kool-Aid at the kids table, while the adults are dining on lobster and drinking limited release, estate-bottled wine.

Put another way, the golden rule applies – buying a $500M supercomputer every five years is not much financial leverage against a multibillion dollar spend each year. Equally importantly, most of the traditional supercomputing vendors are sitting at the kids table with us. In turn, federal investment (e.g., DOE’s exascale PathForward program) is quite small compared to the scale of commercial cloud investments and their leverage with the same vendors.

First, it means acknowledging that we in bleeding edge HPC have limited leverage; we cannot influence vendors in the same ways we did in the past. The delays and cost overruns for Argonne National Laboratory’s Aurora system and the (resolving) deployment pains for Oak Ridge National Laboratory’s Frontier system speak to these challenges. Meanwhile, China has already deployed two exascale systems, both built using Chinese components.

We also need to recognize that bleeding edge HPC is about more than solving time-dependent partial differential equations on complex meshes. They will always matter, but other areas of advanced computing are also of high and growing importance. After all, the Science 2021 Breakthrough of the Year was for AI-enabled protein structure prediction, with transformative implications for biology and biomedicine. This is a cultural shift that will require us to lessen our hold on some cherished beliefs and acknowledge the world has changed.

Second, it’s time we got serious – really serious – about end-to-end co-design, from device physics to applications, following the lead of the cloud ecosystem operators, and where possible and appropriate, partnering with them. This means more than encouraging tweaks of existing products or product plans. Rather, it means looking holistically at the problem space, then envisioning, designing, testing, and fabricating appropriate solutions. (See Advanced Computing: Integrative Thinking for the Future.) As a contact sport, it means substantially expanded investment in basic research and development, something that unfortunately suffered as the U.S. Exascale Computing Project (ECP) diverted limited funds to timeline-driven deployments

Third, we will need to embrace the chiplet ecosystem and work collaboratively with the commercial players to establish interoperability standards (e.g., the Open Domain-Specific Architecture (OSDA) project within the Open Compute Project). Chiplets are more than a way to mix and match capabilities from multiple sources, they are an economic and engineering reaction to the interplay of chip defect rates, the cadence of feature size reductions, and semiconductor manufacturing costs (CAPEX and OPEX). Concomitantly, we must both work with the new chiplet design and integration companies while also designing chiplets, both driven by the specific needs of quantified scientific workloads.

There ARE interesting things happening in the hardware space, but many of them are in the startup world. One need look no further than companies like Graphcore, Groq, and Cerebras for inspiration, and at the ferment in edge devices and AI. I will not name other companies, as several are still in stealth mode. Part of this requires some flexibility and imagination. (Full disclosure, I started NCSA on the GPU accelerator push by building a cluster of Sony Playstations and formed the first partnership with NSF on cloud computing.)

Fourth, we will need to build real hardware and software prototypes at scale, not just incremental ones, but ones that truly test new ideas using custom silicon and associated software. In turn, that means accepting the risk of failure at substantial scale (e.g., that $50M prototype didn’t work, but we learned some important things, and we are going to build another one). (See Luck Is a Fickle Friend.)

To do these things, we will need to invest in recruiting and sustaining research teams – chip designers, system software developers, and packaging engineers – in an integrated end-to-end way AND create opportunities than make it intellectually attractive to work on science and engineering problems. Simply put, we will need to create integrated teams and build integrated prototypes at scale, both at national laboratories and in academia.

Finally, and I realize this may be heretical to some, we need to ask when and where we would be best served by embracing commercial cloud services for our HPC needs. The performance gaps between cloud services and HPC have lessened substantially over the past decade. See, for example, a recent comparative analysis by the UC-Berkeley group. HPC as a service is now both real and effective.

I realize some in academia and national laboratory community will immediately say, “But, we can do it cheaper, and our systems are bigger!” Perhaps., though proving such claims means being dispassionately honest about technological innovation, NRE costs, opportunity costs, staff career goals, and the true total cost of ownership (TCO). In turn, this requires a mix of economic and cultural realism in making build versus use decisions and asking what cloud service vendors would be willing to do for academic and government consortia at large scale (e.g., a billion dollar annual partnership).

In summary, it is increasingly unlikely that future high-end HPC systems will be procured and assembled solely by commercial integrators from commodity components. Rather, they are likely to be (a) collaborative partnerships with cloud providers, (b) increasingly bespoke, designed and built collaboratively to support key scientific and engineering workload needs, or (c) a combination of these two.

Put another way, bleeding edge, exascale class computing systems are no different than the Vera Rubin Observatory, the LIGO gravity wave detector, or the Advanced Photon Source. Each contains commercially designed and constructed technology, but they also contain large numbers of custom elements for which there is no sustainable business model. Instead, we build these instruments because we want them to explore open scientific questions, and we recognize that their design and construction requires government investment.

Coda

As our U.S. political divisions illustrate, investing in the future will not be easy, but it is critical if the U.S. is to remain globally competitive. Let’s be clear. The price of innovation keeps rising, the talent is following the money, and many of the traditional players – companies and countries – are struggling to keep up. Andy Grove was only partially right. It is true that only the paranoid survive, but they’d best have deep pockets and enticing intellectual opportunities as well. That needs to be us.

N.B. My thanks to Jack Dongarra, Doug Burger, Dennis Gannon, Tren Griffin, Tony Hey, and Jim Larus for thoughtful feedback on drafts of this post. Any errors of fact and all opinions are my own. In full disclosure, I am a former Microsoft corporate vice president and a current stockholder.

Reed's Ruminations: The Past, Present, and Future