Disclaimer

  • The postings on this site are my own and don’t necessarily represent Microsoft's positions, strategies or opinions.

Twitter Updates

    follow me on Twitter
    AddThis Social Bookmark Button

    Technorati

    • Add to Technorati Favorites

    Software

    April 11, 2009

    When Petascale Is Just Too Slow

    N.B. I also write for the Communications of the ACM (CACM). The following essay recently appeared on the CACM blog.

    It seems as if it were just yesterday when I was at NCSA and we deployed a one teraflop Linux cluster as a national resource. We were as excited as proud parents by the configuration: 512 dual processor nodes (1 GHz Intel Pentium III processors), a Myrinet interconnect and (gasp) a stunning 5 terabytes of RAID storage. It achieved a then astonishing 594 gigaflops on the High-Performance Linpack (HPL) benchmark, and it was ranked 41st on the Top500 list.

    The world has changed since then. We hit the microprocessor power (and clock rate) wall, birthing the multicore era; vector processing returned incognito, renamed as graphical processing units (GPUs) ; terabyte disks are available for a pittance at your favorite consumer electronics store; and the top-ranked system on the Top500 list broke the petaflop barrier last year, built from a combination of multicore processors and gaming engines. The last is interesting for several reasons, both sociological and technological.

    Petascale Retrospective

    On the sociological front, I remember participating in the first petascale workshop at Caltech in the 1990s. Seymour Cray, Burton Smith and others were debating future petascale hardware and architectures, a second group debated device technologies, a third discussed application futures, and a final group of us were down the hall debating future software architectures. (I distinctly remember talking to Seymour about his "parity is for farmers" comment regarding memory ECC.) All this was prelude to an extended series of architecture, system software, programming models, algorithms and applications workshops that spanned several years and multiple retreats.

    By the way, you can read the original report here; it is fascinating to look back. Paul Messina, Thomas Sterling and others deserve our thanks for launching the seminal activity.

    At the time, most of us were convinced that achieving petascale performance within a decade would require some new architectural approaches and custom designs, along with radically new system software and programming tools. We were wrong, or at least so it superficially seems. We broke the petascale barrier in 2008 using commodity x86 microprocessors and GPUs, Infiniband interconnects, minimally modified Linux and the same message-based programming model we have been using for the past twenty years.

    However, as peak system performance has risen, the number of users has declined. Programming massively parallel systems is not easy, and even terascale computing is not routine. Horst Simon explained this with an interesting analogy, which I have taken the liberty of elaborating slightly. The ascent of Mt. Everest by Edmund Hillary and Tenzing Norgay in 1953 was heroic. Today, amateurs still die each year attempting to replicate the feat. We may have scaled Mt. Petascale, but we are far from making it pleasant or even routine weekend hike.

    This raises the real question, were we wrong in believing different hardware and software approaches were needed to make petascale computing a reality? I think we were absolutely right that new approaches were needed. However, our recommendations for a new research and development agenda were not realized. At least in part, I believe this is because we have been loathe to mount the integrated research and development needed to change our current hardware/software ecosystem and procurement models.

    Exascale Futures

    I recently participated in the International Exascale Software Project Workshop (IESP), the first in a series of meetings designed to explore organizational and technical approaches to exascale system design and construction. The workshop built on several earlier meetings and studies, including the DARPA exascale hardware study and the forthcoming exascale software study (in which I participated), as well as the DOE exascale applications study. Complementary analyses are underway in the European Union and in Asia.

    Evolution or revolution, it's the persistent question. Can we build reliable exascale systems from extrapolations of current technology or will new approaches be required? There is no definitive answer, as almost any approach might be made to work at some level with enough heroic effort. The bigger question is what design would enable the most breakthrough scientific research in a reliable and cost effective way?

    My personal opinion is that we need to rethink some of our dearly held beliefs and take a different approach. The degree of parallelism required at exascale, even with future manycore designs, will challenge even our most heroic application developers, and the number of components will raise new reliability and resilience challenges. Then there are interesting questions about manycore memory bandwidth, achievable system bisection bandwidth and I/O capability and capacity. There are just a few programmability issues as well!

    I believe it is time for us to move from our deus ex machina model of explicitly managed resources to a fully distributed, asynchronous model that embraces component failure as a standard occurrence. To draw a biological analogy, we must reason about systemic, organism health and behavior rather than cellular signaling and death, and not allow cell death (component failure) to trigger organism death (system failure). Such a shift in world view has profound implications for how we structure the future of international high-performance computing research, academic-government-industrial collaborations and system procurements.

    March 22, 2009

    Twitter Is Three

    Yesterday (March 21, 2009), Twitter, the microblogging service, turned three years old. I've been twittering for two years now, watching the evolution of social networking and the nature of the participants. (Yeah, I had a Blackberry back in 2000-2001, long mobile email became an international addiction, but then I'm a gadget geek.) Twitter is now growing exponentially according to Nielsen Wire, with unique visitors up over 1000 percent since one year ago.

    Twitter would not have been possible in the U.S. a decade ago, as we lacked the ubiquitous smartphones and broadband coverage needed for mobile access. However, if one recalls the popularity of the short message service (SMS) in Europe and Asia, Twitter could easily have appeared as a value-added service long ago. After all, GSM has been around since the 1980s. It remains an open question, however, if Twitter can be a profitable business.

    Extended Friends

    Like its big brother, Facebook, Twitter has gone mainstream, with attributions and social discussions in the popular press. Indeed, some have argued that Twitter is the new Facebook, favored by the digerati. All of this reminds me of Yogi Berra's famous remark about a restaurant, "Nobody goes there anymore; it's too crowded."

    In both the Facebook and Twitter worlds, it does seem crowded. I am now being "friended" by some people I barely know, and by others I do not know at all. (I am sure many of you have had the same experience.) I am not complaining; rather I have realized that friend in this context is a very elastic thing, ranging from long-time personal friend to an extended network of business acquaintances. Who would have thought one would need information visualization tools like Friend Wheel to track one's "friends"?

    Interacting Networks

    Like web search engines and the social insights one can glean from tracking queries (e.g., Google's tracking of flu trends), analyzing and visualizing tweets can illuminate social dynamics and behavior. Watching Twittervision, Twitter StreamGraphs or TwittEarth can at times be fascinating but can also overwhelm one with the minutia of daily life. (A trip to the gas station is not an event of historic proportions.) Such is the nature of a global social dynamic.

    I also find it interesting that these extended social networking sites are themselves increasingly interconnected. For example, my Twitter updates appear on my blog and my Facebook page. Conversely, excerpts and links from my blog posts appear on Twitter via TwitterFeed and are also replicated on Facebook, LinkedIn and FriendFeed. In between, there's microblogging (a tumblelog) with Tumblrhpcdan.tumblr.com).

    Perhaps in the limiting case, all of my social networks can simply chat about me among themselves. They are probably more interesting than I am.

    March 12, 2009

    Scientific Clouds: Blowin’ in the Wind

    N.B. I recently responded to some questions from John West (HPCWire) regarding the Microsoft Cloud Computing Futures (CCF) research project. In that Q&A, I also commented on the relevance of cloud computing to computational science. What follows is an augmented subset of the Q&A, but focused on just the relevance of clouds to technical computing.

    Cirrus, stratus, altostratus, cumulus: they are the scientific names of the common clouds. They drift across the sky, reflecting the changing wind and weather. A new front is blowing into computational science, and cloud computing will soon advance scientific and engineering discovery.

    That is one of the reasons I am excited about cloud services. I believe we are at a technological transition point, just as profound as that engendered by the "attack of the killer micros." This is true whether you are enamored of Microsoft's Azure, Amazon's AWS or Google's Apps.

    Learning from History

    Let's step back and gain some perspective, starting with the "Branscomb pyramid" ("From Desktop to TeraFlop: Exploiting the U.S. Lead in High Performance Computing, Lewis Branscomb et al) and the diverse types of technical computing that now exist. We tend to focus on the apex of the computing pyramid, now exemplified by petascale systems intended to support only a handful of applications and users. However, most science is conducted at lower levels of the pyramid, using desktop computers, laboratory clusters and university-scale computing infrastructure. By analogy, it's exciting to talk about international hypersonic transport, but most people care more about efficiently and painlessly commuting to work each day.

    Over the past decade, we successfully leveraged commodity hardware to create large clusters. What was nearly heretical when we first deployed clusters at NCSA is now commonplace. However, this scaling has not been without cost. Cluster programming remains difficult at scale, we have turned a generation of researchers into parallel programmers and system administrators, institutions are struggling with rising demands for machine space, power and cooling, and duplicated facilities make sharing expertise and data difficult. We are heavily focused on computing at a time when data analysis now dominates much of science and engineering. Like many of you, I contributed to this state of affairs, and I feel some responsibility to help us find a new path.

    Hype and Reality

    Let's separate the hype from the reality. Clouds won't magically restore your 401(k) retirement fund, cure halitosis or even help you drop twenty pounds before your upcoming high school reunion. Like all new technologies, however, they challenge some conventional computing wisdom and change some of our operating assumptions.

    Personal computing was a non sequitur when computers filled rooms. Internet search was nonsensical when there were only a handful of research web sites. Social networking services depend on inexpensive, ubiquitous broadband access and mobile devices. Hosted cloud services and software are now possible given the confluence of inexpensive but powerful multicore processors, high-capacity storage, broadband networks and the economies of scale that consolidation in cloud data centers make possible.

    Five Reasons Clouds Matter

    First, the economies of scale from mega-data center provisioning mean capital and operating costs can be lower. When buying servers in 500,000 unit lots and designing facilities at scale, the provider does have some financial and technical leverage. This would allow universities, laboratories and federal agencies to devote a larger fraction of precious funding to research rather than infrastructure. Remember Dan's computational science corollary; it's the science, not the infrastructure, which matters.

    Second, truly large-scale data analysis, particularly multidisciplinary data fusion, can become routine. In the scientific community, we have worked hard to build workflows for access to distributed data. Consolidation and co-location enable new approaches, and we tend to forget that cloud data centers have many, many petabytes of disk storage. It really is possible to query multiple petabytes of data using intuitive, easy-to-use desktop tools – the business community does it all the time. Jim Gray proved the power of database tools on several scientific data analysis projects, including the SkyServer.

    Third, clouds facilitate time-space tradeoffs. It is just as cost-effective to run 100,000 individual jobs simultaneously as sequentially (e.g., for a parameter study), something that our batch queuing strategies strongly discourage on high-performance computing systems. In geek terms, the area is the same, whether one uses tall, skinny rectangles (lots of resources for a small interval) or short, long rectangles (a few resources for a long interval). The elasticity of clouds, a consequence of multiplexing many users and workloads, means that the resources are always available without waiting.

    Fourth, I also believe that the cloud will offer HPC services at increasing scale, beginning with that typified by today's laboratory clusters. This is already happening, and as I/O device virtualization continues to improve, communication latencies will decrease and tightly coupled computations will be attractive at ever larger scale.

    Finally, clouds can provide seamless extension of familiar desktop tools and interfaces, allowing computing and analysis to scale within the same environment that researchers use every day. We can leverage consumer software, just as we have leveraged consumer hardware. There is no reason our computational science tools and our "every day" tools need be different.

    Shameless Microsoft Plug

    To this point, I've written about clouds in a vendor-neutral sense. With a nod to the company name now on my paycheck, if you haven't already, I encourage you to take a look at Windows Azure and its cloud computing and storage services. In addition to rich web services, there is both open source and Visual Studio programming support. In a future post, I will describe an Azure example application for computational science. Here endeth the marketing pitch.

    Insight, Not Infrastructure

    As Richard Hamming famously noted, "The purpose of computing is insight, not numbers." Dan's computational science corollary is simple, "The purpose of computational science infrastructure is scientific discovery, not big iron bragging rights." It's time to focus on what matters and embrace the future. Our graduate students and post-docs will thank us.

    February 24, 2009

    Seeding The Clouds

    Since I joined Microsoft in late 2007, I have written about science policy, Federal government interactions, and national competitiveness studies, in my role as a member of PCAST and chair of the Computing Research Association (CRA). Throughout, I have emphasized the need for strategic investment in long-term, basic research, especially as part of the economic stimulus package..

    I have also discussed the rise of multicore computing, the consequent software crisis and the need for innovation in both architecture and software, including Microsoft's support for the Microsoft/Intel-funded Universal Parallel Computing Research Centers (UPCRCs) at Illinois and UC-Berkeley. I have also mused on the future of high-performance computing and its role as an enabler of scientific discovery. I have even written about my family, my rural childhood and my life experiences.

    What I have not done is write about why I came to Microsoft and what I am doing – until now. Yes, my team manages the UPCRCs in partnership with Intel. Yes, I devote time and energy to research policy, both for the community and on behalf of Microsoft. Yes, I am involved in the future of high-performance computing, both politically and technically. However, that's not the entire story.

    It's time to talk infrastructure so large it makes petascale systems seem small. It's time to talk about why I can't remember the last time I had this much fun. It's time to pull back the curtain and talk about the future of clouds. No, I'm not talking about weather forecasting, though I really enjoyed my past collaboration with the LEAD partnership.

    I came to Microsoft to lead a new research initiative in cloud computing, one that complements our production data center infrastructure and our nascent Azure cloud software platform. You can read the press release and the web site for the official story. What follows is my personal perspective.

    The Infrastructure of Our Lives

    We all know the cloud premise – Internet delivery of software and services to distributed clients, from mobile devices to desktops. We tend not to think about how dependent we now are on those delivered services, though we are, just as we depend on the telephone and our water and electrical utilities.

    Imagine a day without the web, without search engines, without social networks, without online games, without electronic commerce, without streaming audio and video. Our world has changed, and government, business, education, recreation and social interaction are now critically dependent on reliable Internet services and the hardware and software infrastructure behind them. However, more research and technology evaluation are needed to make them as trustworthy as the telephone network.

    Building Internet services infrastructure using standard, off-the-shelf technology made sense during the 1990s Internet boom. (And yes, I remember how cool Mosaic was, when I first saw it at Illinois.) The facilities were small by today's standards, and the infrastructure could be deployed quickly. Today, however, the scale is vastly larger, our social and economic dependence is much greater and the consequences of failure are profound. Web service outages are now international news, and a cyberattack is considered an act of war.

    For background on some of the challenges and problems in scaling, you might want to follow the Data Center Knowledge and High Scalability web sites. If you are new to this space, they and other reading will redefine your notions of large and reliable. You might not think 100 megawatts could be a data center design constraint, but it is. More importantly, you should fear – yea, verily, be absolutely terrified by –the wrath of 100 million unhappy customers should your Internet service fail. Every nightmare that has ever awakened a CIO in a cold sweat at 2am is real, but magnified a thousand fold. If it were easy, though, it would neither be exciting nor fun.

    Cloud Infrastructure Challenges

    Microsoft's business, like that of other cloud service providers -- Amazon, Google, Yahoo and others – depends on an ever-expanding network of massive data centers: hundreds of thousands of servers, many, many petabytes of data, hundreds of megawatts of power, and billions of dollars in capital and operational expenses. This enormous scale – far larger than even the largest high-performance computing facilities – brings new design, deployment and management challenges, including energy efficiency, rapid deployment, resilience, geo-distribution, composability, and graceful recovery.

    I have been a "big iron" guy for a long time, and Internet and cloud services infrastructures do have analogs with petascale and exascale computing, but the workloads and optimization axes are different. Like today's HPC systems, cloud computing facilities are being built with hardware and software technologies not originally designed for deployment at such massive scale. Consequently, they are less efficient and less flexible than they either can or should be. If we built utility power plants the same way we build cloud infrastructure, we would start by visiting The Home Depot and buying millions of gasoline-powered generators. This must change.

    Imagine a world where heterogeneous multicore processors are design and optimized for diverse workloads, where solid state storage changes our historical notions of latency and bandwidth, where on-chip optics, system interconnects and LAN/WAN networking simplify data movement, where scalable systems are resilient to component failures, where programming abstractions facilitate functional dispersion across devices and facilities, where new applications are developed more quickly and efficiently. This can be.

    Cloud Computing Futures

    Over the past fourteen months, I have been quietly building the Cloud Computing Futures (CCF) team, starting with a key concept. We must treat cloud service infrastructure as an integrated system—a holistic entity—and optimize all aspects of hardware and software. I have recruited hardware and software researchers, software developers and industry partners to pursue this vision. It's been a blast.

    The CCF agenda spans next-generation storage devices and memories, new processors and processor architectures, networks, system packaging, programming models and software tools. We are a research and technology transfer team, whose roles are to explore radical new alternatives – "blank sheet of paper" approaches to cloud hardware and software infrastructure – and to drive those ideas into implementation and practice.

    Effective research in this space requires changes to both hardware and software, and the resulting prototypes must be constructed and tested at a scale difficult for small teams. This type of research and technology transfer is in academia, because the efforts often cross many research disciplines.

    For this reason, the CCF team is taking an integrated approach, drawing insights and lessons from Microsoft's production services and data center operations, and partnering with researchers, vendors and product teams worldwide. Our work builds on technical partnerships and collaborations across Microsoft, including Microsoft Research, Debra Chrapaty's Global Foundation Services (GFS) data center construction, operations and delivery team, and Ray Ozzie's Azure cloud services group. We are also partnering with an array of hardware-technology providers and companies as we build prototypes.

    Now You Know

    For me, CCF has been an opportunity to apply research experiences and ideas gleaned over the past twenty-five years of my academic career. Equally importantly, it is a chance to build prototypes at scale to test those ideas, and then help drive the promising technologies into practice. The past year has been great fun, and I have been privileged to attract and partner with some wonderful people to this adventure, including Jim Larus and Dennis Gannon.

    Now you know why I came to Microsoft. It was a chance to practice what I've been preaching. It was a chance to help design the biggest of big iron. It was a chance to help invent the future. It's a pretty cool gig for a balding old geezer like me!

    October 28, 2008

    Beyond The Azure Blue

    From the first day I arrived at Microsoft, my academic colleagues have been asking me about Microsoft's strategy for cloud computing and when (or if) there would be public announcements. Those questions rose to a crescendo as academic groups prepared responses to the NSF eXtreme Digital (XD) TeraGrid solicitation. All I could say was that we were working on a plan, and it would become clear soon.

    I don't normally pitch Microsoft products in the blog, preferring to discuss science policy, technology research and development and global competitiveness. However, something big just happened at Microsoft, something I think will affect all of us. Moreover, as I write this, the Pacific Northwest sky is clear and azure blue, and that doesn't happen often this time of year. An omen, perhaps?

    Microsoft Azure Cloud Services

    At our Professional Developers Conference (PDC), Microsoft announced Azure, our cloud computing platform, with on-demand compute and storage to host, scale and manage Internet or cloud applications. The press release has additional business perspective and a link to the presentation. Azure is one element of the vision Ray Ozzie (See "Mind to Mind: Building Innovation") described in his 2005 Internet Services Disruption memorandum.

    The simplest description of Azure is that the initial release allows you to develop hosted Windows applications using .NET Services, though future releases will support unmanaged code and open source tools as well (Eclipse, Ruby, PHP, and Python). Within Azure, a fabric controller manages application instances and access to storage via SQL Data Services (SDS), and it hosts applications atop virtualized multicore hardware. Finally, Microsoft's Live Services offerings will be layered atop the Azure framework.

    You can read the white paper for details on the Azure design and usage approach. In addition, the software development kit (SDK) is available for download. In addition to the Azure SDK itself, there are SDKs for Visual Studio, .NET and SDS Services. Finally, there are Java and Ruby SDKs for .NET Services as well. This is a Community Technology Preview (CTP), meaning Microsoft welcomes feedback on these early capabilities and will continue to expand the capabilities of Azure over the coming months.

    Science and Technology Implications

    Earlier in the year, I wrote on both my blog and in HPCWire ("Dan's Cloudy Crystal Ball") about the possibility of outsourcing research computing services and infrastructure to the cloud. I noted then that the explosive growth of computing as an enabler of scientific discovery had strained university capabilities and Federal research budgets. Given our current economic crisis, university operating budgets and Federal research expenditures will be under even greater strain and there will be increased scrutiny on the need for each investment.

    In a world of (at best) modest research budget increases, we must ask hard questions about the best use of limited funds. Cloud computing offers a potential mechanism to increase the efficiency of current research, ensure continuity of critical data and enable new kinds of research not now feasible.

    In this model, researchers focus on the higher levels of the software stack -- applications and innovation, not low-level infrastructure. University and Federal research agency administrators, in turn, procure services from the providers based on capabilities and pricing. Finally, the cloud service providers deliver economies of scale and capabilities driven by a large market base and energy efficient infrastructure. Remember, computing infrastructure exists to enable discovery, not as monuments to technological prowess.

    In addition to efficiency, the scalability of cloud services and infrastructure opens new research possibilities. Not only is it possible federate multidisciplinary research data at far larger scales than possible in a university environment (think tens to hundreds of petabytes of low latency storage), we can escape the pernicious cycle of transitory research infrastructure.

    How often have we created data repositories as part of research projects, only to find few mechanisms to ensure their long-term sustainability and access by the broader research community? How often have we faced a miasma of distributed data sources with unknown provenance and non-compatible metadata, each supported pro bono on a best effort basis? (See my recent comments on digital document preservation.) Instead, imagine multidisciplinary data fusion and mining, where students can pose queries against integrated but diverse data sources using robust tools?

    Finally, by leveraging "pay as you go" models, we can trade time and scale on a continuous basis. Imagine applying 50,000 processors for one hour at the same cost as 50 processors for one thousand hours. In the cloud, the integral under the curve is the same and the costs are comparable, but the research effects are qualitatively different.

    The Standard Questions

    The standard questions always arise about new approaches to computing. Cloud services and data storage inevitably raise the standard ones.

    • Is it reliable and will my data persist?
    • Is it safe, private and secure?
    • Will I be captured and become captive?
    • What does it cost and what if I can't continue paying?

    We tend to forget that there are complementary issues about local infrastructure because we have already internalized and accepted the implications and risks. Moreover, local failures are rarely publicized.

    • What happens if my disks crash?
    • What if I can't pay for backups or maintenance or physical plant or …?
    • What if my network is penetrated?

    These are the standard cost/benefit/risk tradeoffs. One must make them based on statistics, economics and practical constraints. Remember that we debated the same issues when we shifted research computing from vendor-backed HPC designs to predominantly commodity components.

    Let's Reason Together

    I welcome discussion of how we can exploit cloud services and infrastructure effectively – all cloud infrastructure, not just Microsoft's Azure. To do this, the cloud service providers, hardware vendors, universities and Federal government must work together to outline an agenda, conduct experiments at scale and speak with a united voice on the opportunities.

    It's a sunny day, but my head is in the clouds.