Inside the Microsoft datacentres

A datacentre is just a datacentre isn’t it? After all, isn’t it just a bigger version of the server room in the basement? But what about the huge datacentres that run cloud services? What’s it like inside the Microsoft datacentres that host Azure, Office 365, etc.?

Last week, Microsoft’s Modern Workplace webcast titled “An Inside Look at Your Secure Cloud” gave a sneak peek inside some of the Microsoft datacentres – comparing various generations and showing the improvements along the way.  And, as you might expect, these are the very definition of operating at scale…

As Doug Hauger (General Manager for National Cloud Programs at Microsoft) explained, organisations look to use a cloud datacentre for scale and professionalism.  Anyone can run a datacentre but the Microsoft Cloud is about robustness and security – whether that’s how staff are monitored or the physical and logical security models.

Each time Microsoft moves into a new region (like the two regions that opened in the UK earlier this month) there’s not just one super-scale datacentre but multiple facilities per region, providing redundancy and disaster recovery capability. Each facility has multiple power sources and multiple network ingress and egress points. Then there’s the investment Microsoft is making in physical infrastructure around the world – for example the joint project with Facebook for a new Europe-North America undersea cable (MAREA).

Each time Microsoft considers expanding into a new market they perform a business case analysis on the potential opportunity, considering the scale that they will go in at (tens of thousands of servers). Microsoft now has more than 100 datacentres in 30 regions around the world (with four more under construction). Because of the huge range of locations covered, Microsoft is now the industry leader for compliance and certification – whether that is meeting global or local requirements. Then there is the question of meeting customer needs around data residency, compliance, etc. (for example with the German datacentres that operate under a unique data trustee model in partnership with Deutsche Telekom).

With its cloud datacentres, Microsoft is aiming to meet customer needs around digital transformation, where the question is no longer “why should I go to the cloud” but one of “how to innovate more quickly in the cloud”. That’s what drives the agenda for where to geographically expand, where enhance scalability, etc.

Despite the question I posed in the opening paragraph of this post, a true datacentre is worlds apart from the typical server room in the basement (or wherever). The last time I got to visit a datacentre was when I was working at Fujitsu and I visited the London North facility, an Uptime Institute Tier III datacentre that won awards when it was built in 2008. Seeing the scale at which a modern datacentre operates is impressive. Then ramp it up some more for the big cloud service providers.

In the webcast, Christian Belady (General Manager Cloud Infrastructure Strategy and Architectures at Microsoft) explained that datacentres are the foundation of the Internet – they are where all the cloud services are served from (whether that is Microsoft services, or those provided by other major players).

There are several layers of physical security from the outside fence in, screening people, controlling access to parts of the buildings, even to cabinets themselves with critical customer data in locked cabinets covered with video surveillance. Used disks are destroyed, being wiped and then crushed on site! The physical security surpasses anything provided for on-premises servers and the logical security continues that defence in depth.

Each custom-built server is actually 2 computers with 10s of 1000s of computers per room, 100s of 1000s per datacentre, each datacentre the size of 20-30 football fields. Look at the racks and you can see the attention to detail – keeping things orderly not only adds to operational efficiency but it looks good too! The enterprise servers that most of us run on-premises have plastic bezels to make them look pleasant. Instead, Microsoft’s servers have focused on eliminating anything that has no useful function…

Each iteration of datacentres becomes more industrialised – with improvements to factors such as cooling (which is one of the biggest power usage factors).

A generation 2 datacentre from around 2007 has a Power Usage Effectiveness (PUE) efficiency score of 1.4-1.6 (for comparison, the Fujitsu facility I mentioned earlier has a PUE of 1.4 but a typical enterprise datacentre from the 2000s with a normal raised floor would have a PUE of 2-3). Cool and hot aisles are used with hot air returned to coolers and recirculated. Microsoft then raised the temperature of their servers to a level that is acceptable (working with manufacturers), rather than the lower levels they used to have (reducing the cooling demands).

Moving on to generation 4, efficiency is improved further (a PUE of 1.1-1.2), eliminating chillers by removing roofs, driving down costs and using outside air to chill. Containers use the outside cooling and a system of adiabatic cooling, spraying mist into the air to cool down – which evaporates before it hits the server”. Such datacentres use a lot less water too (compared with older styles of datacentre).

With the latest (generation 5) datacentres, further improvements are made, culminating the features of other generations – learning and adapting. The PUE is now down to 1.1 (and below at certain times of year) with running costs also improved. There are still hot a cold aisles but no raise floor and, instead of outside air, the datacentres use a closed liquid loop system (no chiller – cool the water outside) – and that water doesn’t need to be potable.

The actual datacentre design changes for each facility, based on the geography and the environmental impact. Backup power generation is a key component in the design, with several days of fuel onsite and contracts to keep bringing more fuel in. Power is often sustainably sourced, be that cheap and carbon-free hydro-electric power, wind or solar. Microsoft Research is even working on a tidal-powered under-sea datacentre (Project Natick).

Inside the Microsoft datacentres is very industrial. Whole racks are brought in (pre-tested), rather than single servers and, as previously mentioned, Microsoft design and build the servers for use at scale, stripping out enterprise features and retaining only what’s needed for the Microsoft environment.

Whilst I’ve worked with customers who have visited Microsoft datacentres in Dublin, it seems unlikely that I’ll ever get the chance. Watching the Modern Workplace webcast gave me a fascinating look at how Microsoft operates datacentres at scale though – and it truly is awe-inspiring. To find out more, visit the Microsoft website.

Microsoft’s Windows Azure datacentres: some statistics

Last week I blogged about designing a private cloud infrastructure, based on the practices employed by the major cloud service providers.

Today I got a taste of the scale of some of those cloud operations, when Microsoft gave an online presentation on Windows Azure to their International Customer Advisory Board (ICAB) for Server and Cloud (of which I’m a participant).

Remember the shipping contains that I mentioned as units of scale in a modern datacentre? Here are a few stats about Microsoft’s Azure datacentres:

  • Each datacentre runs at around 95°F (or 35°C): that’s pretty warm but, even though there is air conditioning installed, it’s rarely used, as the containers are self-cooling (using a water system).
  • Containers are stacked in units that are two high and then connected to power, water and networks. (Now that’s some appliance!)

Microsoft's Azure appliances

  • Each container unit contains around 2500 servers and a whole datacentre has 360,000 servers.

Inside onr of the containers

  • The containers are normally dark – I described resource decay in my earlier post – that means that it’s rarely necessary to enter the datacentre.
  • In fact, the datacentres are so highly automated, that there are just 12 staff: 9 armed security guards and 3 administrators. (I’m guessing that’s working 3 shifts, so only 3 or 4 on duty at any one time.)
  • Humans are never alone – systems exist to ensure that people can only enter in pairs, and leave in pairs too.
  • So far, Microsoft has spent $2.5bn on its six Azure data centres, with more planned (and that doesn’t include the datacentres for its other operations).

Designing a private cloud infrastructure

A couple of months ago, Facebook released a whole load of information about its servers and datacentres in a programme it calls the Open Compute Project. At around about the same time, I was sitting in a presentation at Microsoft, where I was introduced to some of the concepts behind their datacentres.  These are not small operations – Facebook’s platform currently serves around 600 million users and Microsoft’s various cloud properties account for a good chunk of the Internet, with the Windows Azure appliance concept under development for partners including Dell, HP, Fujitsu and eBay.

It’s been a few years since I was involved in any datacentre operations and it’s interesting to hear how times have changed. Whereas I knew about redundant uninterruptible power sources and rack-optimised servers, the model is now about containers of redundant servers and the unit of scale has shifted.  An appliance used to be a 1U (pizza box) server with a dedicated purpose but these days it’s a shipping container full of equipment!

There’s also been a shift from keeping the lights on at all costs, towards efficiency. Hardly surprising, given that the IT industry now accounts for around 3% of the world’s carbon emissions and we need to reduce the environmental impact.  Google’s datacentre design best practices are all concerned with efficiency: measuring power usage effectiveness; measuring managing airflow; running warmer datacentres; using “free” cooling; and optimising power distribution.

So how do Microsoft (and, presumably others like Amazon too) design their datacentres? And how can we learn from them when developing our own private cloud operations?

Some of the fundamental principles include:

  1. Perception of infinite capacity.
  2. Perception of continuous availability.
  3. Drive predictability.
  4. Taking a service provider approach to delivering infrastructure.
  5. Resilience over redundancy mindset.
  6. Minimising human involvement.
  7. Optimising resource usage.
  8. Incentivising the desired resource consumption behaviour.

In addition, the following concepts need to be adopted to support the fundamental principles:

  • Cost transparency.
  • Homogenisation of physical infrastructure (aggressive standardisation).
  • Pooling compute resource.
  • Fabric management.
  • Consumption-based pricing.
  • Virtualised infrastructure.
  • Service classification.
  • Holistic approach to availability.
  • Computer resource decay.
  • Elastic infrastructure.
  • Partitioning of shared services.

In short, provisioning the private cloud is about taking the same architectural patterns that Microsoft, Amazon, et al use for the public cloud and implementing them inside your own data centre(s). Thinking service, not server to develop an internal infrastructure as a service (IaaS) proposition.

I won’t expand on all of the concepts here (many are self-explanitory), but some of the key ones are:

  • Create a fabric with resource pools of compute, storage and network, aggregated into logical building blocks.
  • Introduced predictability by defining units of scale and planning activity based on predictable actions (e.g. certain rates of growth).
  • Design across fault domains – understand what tends to fail first (e.g. the power in a rack) and make sure that services span these fault domains.
  • Plan upgrade domains (think about how to upgrade services and move between versions so service levels can be maintained as new infrastructure is rolled out).
  • Consider resource decay – what happens when things break?  Think about component failure in terms of service delivery and design for that. In the same way that a hard disk has a number of spare sectors that are used when others are marked bad (and eventually too many fail, so the disk is replaced), take a unit of infrastructure and leave faulty components in place (but disabled) until a threshold is crossed, after which the unit is considered faulty and is replaced or refurbished.

A smaller company, with a small datacentre may still think in terms of server components – larger organisations may be dealing with shipping containers.  Regardless of the size of the operation, the key to success is thinking in terms of services, not servers; and designing public cloud principles into private cloud implementations.