If you have ever been sent to the henhouse to fetch eggs, you know the practical truth of the old aphorism, "Don't put all your eggs in one basket." With just one stumble – wait for it – you can have egg on your face. (Sorry, I just could not avoid the pun.) All of which is to say, distributing one's assets is normally a successful risk mitigation strategy. At least that is what we all thought about our retirement mutual funds until the world financial markets collectively declined.
I hear rustling among my faithful readers, collectively wondering how these colorful, though meandering, introductory paragraphs will segue to more salient topics, such as clouds or high-performance computing. Fear not, the answer is here, manifest as a few observations on massive data centers, wide area networking, and techniques for geo-distribution and resilience.
Protecting the Big Asset
Every operator of massive cloud data centers chooses the locations of those facilities after careful, qualitative and quantitative analysis of several factors, including the tax climate, availability and cost of electricity, the natural stability of the site (i.e., probability of earthquakes, tsunamis, hurricanes and other natural disasters), the region's political stability and physical security, access to optical fiber and broadband networks and a host of other issues. Given an objective scoring of these criteria, it is not surprising that multiple operators and competitors often site their facilities in the same region.
Regardless of location, a substantial fraction of the infrastructure associated with a data center has historically been devoted to electrical power and resilience. This has typically included multiple diesel generators, collectively capable of generating many megawatts of electricity, and large, lead acid battery banks to ensure an uninterruptable power supply (UPS). Not only are these generators and battery banks expensive, their deployment and acceptance are time consuming and often difficult. Indeed, before the economic downturn, demand for generators was so high that sites competed to secure a place in line for delivery, further delaying data center construction.
The reason we strive to protect data centers is obvious, their failure often has catastrophic business implications. Without doubt, though, some workloads are less affected by failure than others. If your web search for basket weaving supplies times out, you are probably content to retry the query a few seconds later. If your attempt to file a tax return electronically fails, that is more problematic.
This focus on resilience raises one rather obvious question, is there a better way? Returning to my initial metaphor, rather than protecting the one big basket of eggs, might one distribute the eggs in several, smaller baskets?
Geo-dispersion: The Other Alternative
If it were possible to replicate data and computation across multiple, geographically distributed data centers, one could reduce or eliminate UPS costs, and the failure of a single data center would not disrupt the cloud service or unduly affect its customers. Rather, requests to the service would simply be handled by one of the service replicas at another data center, perhaps with slightly greater latency due to time of flight delays. This is, of course, more easily imagined than implemented, but its viability is assessable on both economic and technical grounds.
In this spirit, let me begin by suggesting that we may need to rethink our definition of broadband WANs. Today, we happily talk of deploying 10 Gb/s lambdas, and some of our fastest transcontinental and international networks provision a small number of lambdas (i.e., 10, 40 or 100 Gb/s). However, a single mode optical fiber has much higher total capacity with current dense wave division multiplexing (DWDM) technology, and typical multistrand cables contain many fibers. Thus, the cable has an aggregate bandwidth of many terabits, even with current DWDM.
Despite the aggregate potential bandwidth of the cables, we are really provisioning many narrowband WANs across a single fiber. Rarely, if ever, do we consider bonding all of those lambdas to provision a single logical network. What might one do with terabits of bandwidth between data centers? If one has the indefeasible right to use (IRU) or owns the dark fiber, one need only provision the equipment to exploit multiple fibers for a single purpose.
Of course, exploiting this WAN bandwidth would necessitate dramatic change in the bipartite separation of local area networks (LANs) and WANs in cloud data centers. Melding these would also expose the full bisection bandwidth of the cloud data center to the WAN and its interfaces, simplifying data and workload replication and moving us closer to true geo-dispersion and geo-resilience. There are deep technical issues related to on-chip photonics, VCSELs and ROADMs, among others, to make this a reality.
In the end, these technical questions devolve to risk assessment and economics. First, the cost of replicated, smaller data centers without UPS must be less than that of a larger, non-replicated data center with UPS. Second, the wide area network (WAN) bandwidth, its fusion with data center LANs and their cost must be included in the economic assessment.
These are interesting technical and economic questions, and I invite economic analyses and risk assessments. I suspect, though, that it is time we embraced the true meeting of high-speed networking and put our eggs in multiple baskets.