SQL Server and Hadoop – unlikely bedfellows but a powerful combination

Big Data is hard to avoid – what does Microsoft’s embrace of Hadoop mean for IT Managers?

There are two words that seem particularly difficult to avoid at the moment: big data. Infrastructure guys instinctivly shy away from data but such is its prevalence that big data is much more than just the latest IT buzzword and is becoming a major theme in our industry right now

But what does “big data” actually mean? It’s one of those phrases that, like cloud computing earlier, it is being “adopted” by vendors to mean whatever they want it to.

The McKinsey Global Institute describes big data as “the next frontier for innovation, competition and productivity” but, put simply, it’s about analysing masses of unstructured (or semi-structured) data which, until recently, was considered too expensive to do anything with.

That data comes from a variety of sources including sensors, social networks and digital media and it includes text, audio, video, click-streams, log files and more. Cynics who scoff at the description of “big” data (what’s next, “huge” data?) miss the point that it’s not just about the volume of the data (typically many petabytes) but also the variety and frequency of that data. Some even refer to it as “nano data” because what we’re actually looking at is massive sets of very small data.

Processing big data typically involves distributed computer systems and one project that has come to the fore is Apache Hadoop – a framework for development of open-source software for reliable, scalable distributed computing.

Over the last few weeks though, there have been some significant announcements from established IT players, not all of whom are known for embracing open source technology. This indicates a growing acceptance for big data solutions in general and specifically for solutions that include both open- and closed- source elements.

When Microsoft released a SQL Server-Hadoop (SQOOP) Connector,there were questions about what this would mean for CIOs and IT Managers who may previously have viewed technologies like Hadoop as a little esoteric.

The key to understanding what this would mean would be understanding the two main types of data: structured and unstructured. Structured data tends to be stored in a relational database management system (RDBMS), for example Microsoft SQL Server, IBM DB2, Oracle 11G or MySQL.

By structuring the data with a schema, tables, keys and all manner of relationships it’s possible to run queries (with a language like SQL) to analyse the data and techniques have developed over the years to optimise those queries. By contrast, unstructured data has no schema (at least not a formal one) and may be as simple as a set of files.  Structured data offers maturity, stability and efficiency but unstructured data offers flexibility.

Secondly, there needs to be an understanding of the term “NoSQL”.  Commonly misinterpreted as an instruction (no to SQL), it really means not only SQL – i.e. there are some types of data that are not worth storing in an RDBMS.  Rather than following the database model of extract, transform and load (ETL), with a NoSQL system the data arrives and the application knows how to interpret the data, providing a faster time to insight from data acquisition.

Just as there are two main types of data, there are two main types of NoSQL system: key/value stores (like MongoDB or Windows Azure Table Storage) can be thought of as NoSQL OLTP; Hadoop is more like NoSQL data warehousing and is particularly suited to storing and analysing massive data sets.

One of the key elements towards understanding Hadoop is understanding how the various Hadoop components work together. There’s a degree of complexity so perhaps it’s best to summarise  by saying that the Hadoop stack consists of a highly distributed, fault tolerant, file system (HDFS) and the MapReduce framework for writing and executing distributed, fault tolerant, algorithms. Built on top of that are query languages (live Hive and Pig) and then we have the layer where Microsoft’s SQOOP connector sits, connecting the two worlds of structured and unstructured data.

The trouble is that SQOOP is just a bridge – and not a particularly efficient one either – working on SQL data in the unstructured world involves subdivision of the SQL database so that MapReduce can work correctly.

Because most enterprises have both the structured and unstructured data, we really need tools that allow us to analyse and manage data in multiple environments – ideally without having to go back and forth. That’s why there are  so many vendors jumping on the big data bandwagon but it seems that a SQOOP connector is not the only work Microsoft is doing in the big data space:

In our increasingly cloudy world, infrastructure and platforms are rapidly becoming commoditised. We need to focus on software that allows us to derive value from data to gain some business value. Consider that Microsoft is only one vendor, then think about what Oracle, IBM, Fujitsu and others are doing. If you weren’t convinced before, maybe HP’s Autonomy purchase is starting to make sense now?

Looking specifically at Microsoft’s developments in the big data world, it therefore makes sense to see the company get closer to Hadoop. The world has spoken and the de facto solution for analysing large data sets seems to be HDFS/MapReduce/Hive (or similar).

Maybe Hadoop’s success comes down to HDFS and MapReduce being based on work from Google whilst Hive and Pig are supported by Facebook and Yahoo respectively (i.e. they are all from established Internet businesses).  But, by embracing Hadoop (together with porting its tools to competitive platforms), Microsoft is better placed to support the entire enterprise with both their structured and unstructured needs.

[This post was originally written as an article for Cloud Pro.]

Azure Connect – the missing link between on-premise and cloud

Azure Connect offers a way to connect on-premise infrastructure with Windows Azure but it’s lacking functionality that may hinder adoption.

While Microsoft is one of the most dominant players in client-server computing, until recently, its position in the cloud seemed uncertain.  More recently, we’ve seen Microsoft lay out its stall with both Software as a Service (SaaS) products including Office 365 and Platform as a Service (PaaS) offerings such as Windows Azure joining their traditional portfolio of on-premise products for consumers, small businesses and enterprise customers alike.

Whereas Amazon’s Elastic Compute Cloud (EC2) and Simple Storage Service (S3) offer virtualised Infrastructure as a Service (IaaS) and Salesforce.com is about consumption of Software as a Service (SaaS), Windows Azure fits somewhere in between. Azure offers compute and storage services, so that an organisation can take an existing application, wrap a service model around it and specify how many instances to run, how to persist data, etc.

Microsoft also provides middleware to support claims based authentication and an application fabric that allows simplified connectivity between web services endpoints, negotiating firewalls using outbound connections and standard Internet protocols. In addition, there is a relational database component (SQL Azure), which exposes relational database services for cloud consumption, in addition to the standard Azure table storage.

It all sounds great – but so far everything I’ve discussed runs on a public cloud service and not all applications can be moved in their entirety to the cloud.

Sometimes makes it makes sense to move compute operations to the cloud and keep the data on-premise (more on that in a moment). Sometimes, it’s appropriate to build a data hub with multiple business partners connecting to a data source in cloud but with applications components in a variety of locations.

For European CIOs, information security, in particular data residency, is a real issue. I should highlight that I’m not a legal expert, but CIO Magazine recently reported how the Patriot Act potentially gives the United States authorities access to data hosted with US-based service providers – and selecting a European data centre won’t help.  That might make CIOs nervous about placing certain types of data in the cloud although they might consider a hybrid cloud solution.

Azure already provides federated security, application layer connectivity (via AppFabric) and some options for SQL Azure data synchronisation (currently limited to synchronisation between Microsoft data centres, expanding later this year to include synchronisation with on-premise SQL Server) but the missing component has been the ability to connect Windows Azure with on-premise infrastructure and applications. Windows Azure Connect provides this missing piece of the jigsaw.

Azure Connect is a new component for Windows Azure that provides secure network communications between compute instances in Azure and servers on premise (ie behind the corporate firewall). Using standard IP protocols (both TCP and UDP) it’s possible to take a web front end to the cloud and leave the SQL Server data on site, communicating over a virtual private network, secured with IPSec. In another scenario, a compute instance can be joined to an on-premise Active Directory  domain so a cloud-based application can take advantage of single sign-on functionality. IT departments can also use Azure Connect for remote administration and troubleshooting of cloud-based computing instances.

Currently in pre-release form, Microsoft is planning to make Azure Connect available during the first half of 2011. Whilst setup is relatively simple and requires no coding, Azure Connect is reliant on an agent running on the connected infrastructure (ie on each server that connects to Azure resources) in order to establish IPSec connectivity (a future version of Azure Connect will be able to take advantage of other VPN solutions). Once the agent is installed, the server automatically registers itself with the Azure Connect relay in the cloud and network policies are defined to manage connectivity. All that an administrator has to do is to enable Windows Azure roles for external connectivity via the service model; enable local computers to initiate an IPSec connection by installing the Azure Connect agent; define network policies and, in some circumstances, define appropriate outbound firewall rules on servers.

The emphasis on simplicity is definitely an advantage as many Azure operations seem to require developer knowledge and this is definitely targeted at Windows Administrators. Along with automatic IPSec provisioning (so no need for certificate servers) Azure Connect makes use of DNS so that there is no requirement to change application code (the same server names can be used when roles move between the on premise infrastructure and Azure).

For some organisations though, the presence of the Azure Connect agent may be seen as a security issue – after all, how many database servers are even Internet-connected? That’s not insurmountable but it’s not the only issue with Azure Connect.

For example, connected servers need to run Windows Vista, 7, Server 2008, or Server 2008 R2 [a previous version of this story erroneously suggested that only Windows Server 2008 R2 was supported] and many organisations will be running their applications on older operating system releases. This means that there may be server upgrade costs to consider when integrating with the cloud – and it certainly rules out any heterogeneous environments.

There’s an issue with storage. Windows Azure’s basic compute and storage services can make use of table-based storage. Whilst SQL Azure is available for applications that require a relational database, not all applications have this requirement – and SQL Azure presents additional licensing costs as well as imposing additional architectural complexity.  A significant number of cloud-based applications make use of table storage or combination of table storage and SQL Server – for them, the creation of a hybrid model for customers that rely on on-premise data storage may not be possible.

For many enterprises, Azure Connect will be a useful tool in moving applications (or parts of applications) to the cloud. If Microsoft can overcome the product’s limitations, it could represent a huge step forward for Microsoft’s cloud services in that it provides a real option for development of hybrid cloud solutions on the Microsoft stack, but there still some way to go.

[This post was originally written as an article for Cloud Pro.]