A logical view on a virtual datacentre services architecture

A couple of years ago, I wrote a post about a logical view of an End-User Computing (EUC) architecture (which provides a platform for Modern Workplace). It’s served me well and the model continues to be developed (although the changes are subtle so it’s not really worth writing a new post for the 2019 version).

Building on the original EUC/Modern Workplace framework, I started to think what it might look like for datacentre services – and this is something I came up with last year that’s starting to take shape.

Just as for the EUC model, I’ve tried to step up a level from the technology – to get back to the logical building blocks of the solution so that I can apply them according to a specific client’s requirements. I know that it’s far from complete – just look at an Azure or AWS feature list and you can come up with many more classifications for cloud services – but I think it provides the basics and a starting point for a conversation:

Logical view of a virtual datacentre environment

Starting at the bottom left of the diagram, I’ll describe each of the main blocks in turn:

  • Whether hosted on-premises, co-located or making use of public cloud capabilities, Connectivity is a key consideration for datacentre services. This element of the solution includes the WAN connectivity between sites, site-to-site VPN connections to secure access to the datacentre, Internet breakout and network security at the endpoints – specifically the firewalls and other network security appliances in the datacentre.
  • Whilst many of the SBBs in the virtual datacentre services architecture are equally applicable for co-located or on-premises datacentres, there are some specific Cloud Considerations. Firstly, cloud solutions must be designed for failure – i.e. to design out any elements that may lead to non-availability of services (or at least to fail within agreed service levels). Depending on the organisation(s) consuming the services, there may also be considerations around data location. Finally, and most significantly, the cloud provider(s) must practice trustworthy computing and, ideally, will conform to the UK National Cyber Security Centre (NCSC)’s 14 cloud security principles (or equivalent).
  • Just as for the EUC/Modern Workplace architecture, Identity and Access is key to the provision of virtual datacentre services. A directory service is at the heart of the solution, combined with a model for limiting the scope of access to resources. Together with Role Based Access Control (RBAC), this allows for fine-grained access permissions to be defined. Some form of remote access is required – both to access services running in the datacentre and for management purposes. Meanwhile, identity integration is concerned with integrating the datacentre directory service with existing (on-premises) identity solutions and providing SSO for applications, both in the virtual datacentre and elsewhere in the cloud (i.e. SaaS applications).
  • Data Protection takes place throughout the solution – but key considerations include intrusion detection and endpoint security. Just as for end-user devices, endpoint security covers such aspects as firewalls, anti-virus/malware protection and encryption of data at rest.
  • In the centre of the diagram, the Fabric is based on the US National Institute of Standards and Technology (NIST)’s established definition of essential characteristics for cloud computing.
  • The NIST guidance referred to above also defines three service models for cloud computing: Infrastructure as a Service (IaaS); Platform as a Service (PaaS) and Software as a Service (SaaS).
  • In the case of IaaS, there are considerations around the choice of Operating System. Supported operating systems will depend on the cloud service provider.
  • Many cloud service providers will also provide one or more Marketplaces with both first and third-party (ISV) products ranging from firewalls and security appliances to pre-configured application servers.
  • Application Services are the real reason that the virtual datacentre services exist, and applications may be web, mobile or API-based. There may also be traditional hosted server applications – especially where IaaS is in use.
  • The whole stack is wrapped with a suite of Management Tools. These exist to ensure that the cloud services are effectively managed in line with expected practices and cover all of the operational tasks that would be expected for any datacentre including: licensing; resource management; billing; HA and disaster recovery/business continuity; backup and recovery; configuration management; software updates; automation; management policies and monitoring/alerting.

If you have feedback – for example, a glaring hole or suggestions for changes, please feel free to leave a comment below.

Inside the Microsoft datacentres


A datacentre is just a datacentre isn’t it? After all, isn’t it just a bigger version of the server room in the basement? But what about the huge datacentres that run cloud services? What’s it like inside the Microsoft datacentres that host Azure, Office 365, etc.?

Last week, Microsoft’s Modern Workplace webcast titled “An Inside Look at Your Secure Cloud” gave a sneak peek inside some of the Microsoft datacentres – comparing various generations and showing the improvements along the way.  And, as you might expect, these are the very definition of operating at scale…

As Doug Hauger (General Manager for National Cloud Programs at Microsoft) explained, organisations look to use a cloud datacentre for scale and professionalism.  Anyone can run a datacentre but the Microsoft Cloud is about robustness and security – whether that’s how staff are monitored or the physical and logical security models.

Each time Microsoft moves into a new region (like the two regions that opened in the UK earlier this month) there’s not just one super-scale datacentre but multiple facilities per region, providing redundancy and disaster recovery capability. Each facility has multiple power sources and multiple network ingress and egress points. Then there’s the investment Microsoft is making in physical infrastructure around the world – for example the joint project with Facebook for a new Europe-North America undersea cable (MAREA).

Each time Microsoft considers expanding into a new market they perform a business case analysis on the potential opportunity, considering the scale that they will go in at (tens of thousands of servers). Microsoft now has more than 100 datacentres in 30 regions around the world (with four more under construction). Because of the huge range of locations covered, Microsoft is now the industry leader for compliance and certification – whether that is meeting global or local requirements. Then there is the question of meeting customer needs around data residency, compliance, etc. (for example with the German datacentres that operate under a unique data trustee model in partnership with Deutsche Telekom).

With its cloud datacentres, Microsoft is aiming to meet customer needs around digital transformation, where the question is no longer “why should I go to the cloud” but one of “how to innovate more quickly in the cloud”. That’s what drives the agenda for where to geographically expand, where enhance scalability, etc.

Despite the question I posed in the opening paragraph of this post, a true datacentre is worlds apart from the typical server room in the basement (or wherever). The last time I got to visit a datacentre was when I was working at Fujitsu and I visited the London North facility, an Uptime Institute Tier III datacentre that won awards when it was built in 2008. Seeing the scale at which a modern datacentre operates is impressive. Then ramp it up some more for the big cloud service providers.

In the webcast, Christian Belady (General Manager Cloud Infrastructure Strategy and Architectures at Microsoft) explained that datacentres are the foundation of the Internet – they are where all the cloud services are served from (whether that is Microsoft services, or those provided by other major players).

There are several layers of physical security from the outside fence in, screening people, controlling access to parts of the buildings, even to cabinets themselves with critical customer data in locked cabinets covered with video surveillance. Used disks are destroyed, being wiped and then crushed on site! The physical security surpasses anything provided for on-premises servers and the logical security continues that defence in depth.

Each custom-built server is actually 2 computers with 10s of 1000s of computers per room, 100s of 1000s per datacentre, each datacentre the size of 20-30 football fields. Look at the racks and you can see the attention to detail – keeping things orderly not only adds to operational efficiency but it looks good too! The enterprise servers that most of us run on-premises have plastic bezels to make them look pleasant. Instead, Microsoft’s servers have focused on eliminating anything that has no useful function…

Each iteration of datacentres becomes more industrialised – with improvements to factors such as cooling (which is one of the biggest power usage factors).

A generation 2 datacentre from around 2007 has a Power Usage Effectiveness (PUE) efficiency score of 1.4-1.6 (for comparison, the Fujitsu facility I mentioned earlier has a PUE of 1.4 but a typical enterprise datacentre from the 2000s with a normal raised floor would have a PUE of 2-3). Cool and hot aisles are used with hot air returned to coolers and recirculated. Microsoft then raised the temperature of their servers to a level that is acceptable (working with manufacturers), rather than the lower levels they used to have (reducing the cooling demands).

Moving on to generation 4, efficiency is improved further (a PUE of 1.1-1.2), eliminating chillers by removing roofs, driving down costs and using outside air to chill. Containers use the outside cooling and a system of adiabatic cooling, spraying mist into the air to cool down – which evaporates before it hits the server”. Such datacentres use a lot less water too (compared with older styles of datacentre).

With the latest (generation 5) datacentres, further improvements are made, culminating the features of other generations – learning and adapting. The PUE is now down to 1.1 (and below at certain times of year) with running costs also improved. There are still hot a cold aisles but no raise floor and, instead of outside air, the datacentres use a closed liquid loop system (no chiller – cool the water outside) – and that water doesn’t need to be potable.

The actual datacentre design changes for each facility, based on the geography and the environmental impact. Backup power generation is a key component in the design, with several days of fuel onsite and contracts to keep bringing more fuel in. Power is often sustainably sourced, be that cheap and carbon-free hydro-electric power, wind or solar. Microsoft Research is even working on a tidal-powered under-sea datacentre (Project Natick).

Inside the Microsoft datacentres is very industrial. Whole racks are brought in (pre-tested), rather than single servers and, as previously mentioned, Microsoft design and build the servers for use at scale, stripping out enterprise features and retaining only what’s needed for the Microsoft environment.

Whilst I’ve worked with customers who have visited Microsoft datacentres in Dublin, it seems unlikely that I’ll ever get the chance. Watching the Modern Workplace webcast gave me a fascinating look at how Microsoft operates datacentres at scale though – and it truly is awe-inspiring. To find out more, visit the Microsoft website.

Microsoft’s Windows Azure datacentres: some statistics


Last week I blogged about designing a private cloud infrastructure, based on the practices employed by the major cloud service providers.

Today I got a taste of the scale of some of those cloud operations, when Microsoft gave an online presentation on Windows Azure to their International Customer Advisory Board (ICAB) for Server and Cloud (of which I’m a participant).

Remember the shipping contains that I mentioned as units of scale in a modern datacentre? Here are a few stats about Microsoft’s Azure datacentres:

  • Each datacentre runs at around 95°F (or 35°C): that’s pretty warm but, even though there is air conditioning installed, it’s rarely used, as the containers are self-cooling (using a water system).
  • Containers are stacked in units that are two high and then connected to power, water and networks. (Now that’s some appliance!)

Microsoft's Azure appliances

  • Each container unit contains around 2500 servers and a whole datacentre has 360,000 servers.

Inside onr of the containers

  • The containers are normally dark – I described resource decay in my earlier post – that means that it’s rarely necessary to enter the datacentre.
  • In fact, the datacentres are so highly automated, that there are just 12 staff: 9 armed security guards and 3 administrators. (I’m guessing that’s working 3 shifts, so only 3 or 4 on duty at any one time.)
  • Humans are never alone – systems exist to ensure that people can only enter in pairs, and leave in pairs too.
  • So far, Microsoft has spent $2.5bn on its six Azure data centres, with more planned (and that doesn’t include the datacentres for its other operations).

Designing a private cloud infrastructure


A couple of months ago, Facebook released a whole load of information about its servers and datacentres in a programme it calls the Open Compute Project. At around about the same time, I was sitting in a presentation at Microsoft, where I was introduced to some of the concepts behind their datacentres.  These are not small operations – Facebook’s platform currently serves around 600 million users and Microsoft’s various cloud properties account for a good chunk of the Internet, with the Windows Azure appliance concept under development for partners including Dell, HP, Fujitsu and eBay.

It’s been a few years since I was involved in any datacentre operations and it’s interesting to hear how times have changed. Whereas I knew about redundant uninterruptible power sources and rack-optimised servers, the model is now about containers of redundant servers and the unit of scale has shifted.  An appliance used to be a 1U (pizza box) server with a dedicated purpose but these days it’s a shipping container full of equipment!

There’s also been a shift from keeping the lights on at all costs, towards efficiency. Hardly surprising, given that the IT industry now accounts for around 3% of the world’s carbon emissions and we need to reduce the environmental impact.  Google’s datacentre design best practices are all concerned with efficiency: measuring power usage effectiveness; measuring managing airflow; running warmer datacentres; using “free” cooling; and optimising power distribution.

So how do Microsoft (and, presumably others like Amazon too) design their datacentres? And how can we learn from them when developing our own private cloud operations?

Some of the fundamental principles include:

  1. Perception of infinite capacity.
  2. Perception of continuous availability.
  3. Drive predictability.
  4. Taking a service provider approach to delivering infrastructure.
  5. Resilience over redundancy mindset.
  6. Minimising human involvement.
  7. Optimising resource usage.
  8. Incentivising the desired resource consumption behaviour.

In addition, the following concepts need to be adopted to support the fundamental principles:

  • Cost transparency.
  • Homogenisation of physical infrastructure (aggressive standardisation).
  • Pooling compute resource.
  • Fabric management.
  • Consumption-based pricing.
  • Virtualised infrastructure.
  • Service classification.
  • Holistic approach to availability.
  • Computer resource decay.
  • Elastic infrastructure.
  • Partitioning of shared services.

In short, provisioning the private cloud is about taking the same architectural patterns that Microsoft, Amazon, et al use for the public cloud and implementing them inside your own data centre(s). Thinking service, not server to develop an internal infrastructure as a service (IaaS) proposition.

I won’t expand on all of the concepts here (many are self-explanitory), but some of the key ones are:

  • Create a fabric with resource pools of compute, storage and network, aggregated into logical building blocks.
  • Introduced predictability by defining units of scale and planning activity based on predictable actions (e.g. certain rates of growth).
  • Design across fault domains – understand what tends to fail first (e.g. the power in a rack) and make sure that services span these fault domains.
  • Plan upgrade domains (think about how to upgrade services and move between versions so service levels can be maintained as new infrastructure is rolled out).
  • Consider resource decay – what happens when things break?  Think about component failure in terms of service delivery and design for that. In the same way that a hard disk has a number of spare sectors that are used when others are marked bad (and eventually too many fail, so the disk is replaced), take a unit of infrastructure and leave faulty components in place (but disabled) until a threshold is crossed, after which the unit is considered faulty and is replaced or refurbished.

A smaller company, with a small datacentre may still think in terms of server components – larger organisations may be dealing with shipping containers.  Regardless of the size of the operation, the key to success is thinking in terms of services, not servers; and designing public cloud principles into private cloud implementations.