Tag: Big data

Microsoft Ignite | The Tour: London Recap

One of the most valuable personal development activities in my early career was a trip to the Microsoft TechEd conference in Amsterdam. I learned a lot – not just technically but about making the most of events to gather information, make new industry contacts, and generally top up my knowledge. Indeed, even as a relatively junior consultant, I found that dipping into multiple topics for an hour or so gave me a really good grounding to discover more (or just enough to know something about the topic) – far more so than an instructor-led training course.

Over the years, I attended further “TechEd”s in Amsterdam, Barcelona and Berlin. I fought off the “oh Mark’s on another jolly” comments by sharing information – incidentally, conference attendance is no “jolly” – there may be drinks and even parties but those are after long days of serious mental cramming, often on top of broken sleep in a cheap hotel miles from the conference centre.

Microsoft TechEd is no more. Over the years, as the budgets were cut, the standard of the conference dropped and in the UK we had a local event called Future Decoded. I attended several of these – and it was at Future Decoded that I discovered risual – where I’ve been working for almost four years now.

Now, Future Decoded has also fallen by the wayside and Microsoft has focused on taking it’s principal technical conference – Microsoft Ignite – on tour, delivering global content locally.

So, a few weeks ago, I found myself at the ExCeL conference centre in London’s Docklands, looking forward to a couple of days at “Microsoft Ignite | The Tour: London”.

Conference format

Just like TechEd, and at Future Decoded (in the days before I had to use my time between keynotes on stand duty!), the event was broken up into tracks with sessions lasting around an hour. Because that was an hour of content (and Microsoft event talks are often scheduled as an hour, plus 15 minutes Q&A), it was pretty intense, and opportunities to ask questions were generally limited to trying to grab the speaker after their talk, or at the “Ask the Experts” stands in the main hall.

One difference to Microsoft conferences I’ve previously attended was the lack of “level 400” sessions: every session I saw was level 100-300 (mostly 200/300). That’s fine – that’s the level of content I would expect but there may be some who are looking for more detail. If it’s detail you’re after then Ignite doesn’t seem to be the place.

Also, I noticed that Day 2 had fewer delegates and lacked some of the “hype” from Day 1: whereas the Day 1 welcome talk was over-subscribed, the Day 2 equivalent was almost empty and light on content (not even giving airtime to the conference sponsors). Nevertheless, it was easy to get around the venue (apart from a couple of pinch points).

Personal highlights

I managed to cover 11 topics over two days (plus a fair amount of networking). The track format of the event was intended to let a delegate follow a complete learning path but, as someone who’s a generalist (that’s what Architects have to be), I spread myself around to cover:

Dealing with a massive onset of data ingestion (Jeramiah Dooley/@jdooley_clt).
Enterprise network connectivity in a cloud-first world (Paul Collinge/@pcollingemsft).
Building a world without passwords.
Discovering Azure Tooling and Utilities (Simona Cotin/@simona_cotin).
Selecting the right data storage strategy for your cloud application (Jeramiah Dooley/@jdooley_clt).
Governance in Azure (Sam Cogan/@samcogan).
Planning and implementing hybrid network connectivity (Thomas Maurer/@ThomasMaurer).
Transform device management with Windows Autopilot, Intune and OneDrive (Michael Niehaus/@mniehaus and Mizanur Rahman).
Maintaining your hybrid environment (Niel Peterson/@nepeters).
Windows Server 2019 Deep Dive (Jeff Woolsey/@wsv_guy).
Consolidating infrastructure with the Azure Kubernetes Service (Erik St Martin/@erikstmartin).

In the past, I’d have written a blog post for each topic. I was going to say that I simply don’t have the time to do that these days but by the time I’d finished writing this post, I thought maybe I could have split it up a bit more! Regardless, here are some snippets of information from my time at Microsoft Ignite | The Tour: London. There’s more information in the slide decks – which are available for download, along with the content for the many sessions I didn’t attend.

Data ingestion

Ingesting data can be broken into:

Real-time ingestion.
Real-time analysis (see trends as they happen – and make changes to create a competitive differentiator).
Producing actions as patterns emerge.
Automating reactions in external services.
Making data consumable (in whatever form people need to use it).

Azure has many services to assist with this – take a look at IoT Hub, Azure Event Hubs, Azure Databricks and more.

Enterprise network connectivity for the cloud

Cloud traffic is increasing whilst traffic that remains internal to the corporate network is in decline. Traditional management approaches are no longer fit for purpose.

Office applications use multiple persistent connections – this causes challenges for proxy servers which generally degrade the Office 365 user experience. Remediation is possible, with:

Differentiated traffic – follow Microsoft advice to manage known endpoints, including the Office 365 IP address and URL web service.
Let Microsoft route traffic (data is in a region, not a place). Use DNS resolution to egress connections close to the user (a list of all Microsoft peering locations is available). Optimise the route length and avoid hairpins.
Assess network security using application-level security, reducing IP ranges and ports and evaluating the service to see if some activities can be performed in Office 365, rather than at the network edge (e.g. DLP, AV scanning).

For Azure:

Azure ExpressRoute is a connection to the edge of the Microsoft global backbone (not to a datacentre). It offers 2 lines for resilience and two peering types at the gateway – private and public (Microsoft) peering.
Azure Virtual WAN can be used to build a hub for a region and to connect sites.
Replace branch office routers with software-defined (SDWAN) devices and break out where appropriate.

Real-world demo of local and direct network egress cf. centralised approach #Office365 #MSIgniteTheTour pic.twitter.com/1wuVNegXQs
— Mark Wilson ???? (@markwilsonit) February 26, 2019

Passwordless authentication

Basically, there are three options:

Windows Hello.
Microsoft Authenticator.
FIDO2 Keys.

Azure tooling and utilities

Useful resources include:

Azure Portal:
- Create dashboards to optimise the experience.
Visual Studio Code:
- An extensible code editor with a marketplace.
- Edit, debug and run code including native integration with source control.
- Rich coding experience including Intellisense.
- Interactive Playground to learn new skills (find out more – along with other tips and tricks in this article).
Azure Command Line Interface (CLI) and Azure PowerShell:
- CLI is cross-platform (Mac/Windows/Linux) and can be run locally or in the Cloud Shell.
- There are PowerShell Modules for Azure.
Azure Cloud Shell:
- Bash or PowerShell.
- Access via >_ icon in the Azure Portal – or go to shell.azure.com.
- Uses an underlying storage account so can access files from other locations.
- Can even open an instance of Visual Studio Code inside the Cloud Shell.
Azure Resource Manager (ARM) Templates:
- JSON format templates, supporting parameters, variables, resources and objects.

Selecting data storage for a cloud application

What to use? It depends! Classify data by:

Type of data:
- Structured (fits into a table)
- Semi-structured (may fit in a table but may also use outside metadata, external tables, etc.)
- Unstructured (documents, images, videos, etc.)
Properties of the data:
- Volume (how much)
- Velocity (change rate)
- Variety (sources, types, etc.)

Item	Type	Volume	Velocity	Variety
Product catalogue	Semi-structured	High	Low	Low
Product photos	Unstructured	High	Low	Low
Sales data	Semi-structured	Medium	High	High

How to match data to storage:

Storage-driven: build apps on what you have.
Cloud-driven: deploy to the storage that makes sense.
Function-driven: build what you need; storage comes with it.

Governance in Azure

It’s important to understand what’s running in an Azure subscription – consider cost, security and compliance:

Review (and set a baseline):
- Tools include: Resource Graph; Cost Management; Security Center; Secure Score.
Organise (housekeeping to create a subscription hierarchy, classify subscriptions and resources, and apply access rights consistently):
- Tools include: Management Groups; Tags; RBAC;
Audit:
- Make changes to implement governance without impacting people/work. Develop policies, apply budgets and audit the impact of the policies.
- Tools include: Cost Management; Azure Policy.
Enforce
- Change policies to enforcement, add resolution actions and enforce budgets.
- Consider what will happen for non-compliance?
- Tools include: Azure Policy; Cost Management; Azure Blueprints.
(Loop back to review)
- Have we achieved what we wanted to?
- Understand what is being spent and why.
- Know that only approved resources are deployed.
- Be sure of adhering to security practices.
- Opportunities for further improvement.

Planning and implementing hybrid network connectivity

Moving to the cloud allows for fast deployment but planning is just as important as it ever was. Meanwhile, startups can be cloud-only but most established organisations have some legacy and need to keep some workloads on-premises, with secure and reliable hybrid communication.

Considerations include:

Extension of the internal protected network:
- Should workloads in Azure only be accessible from the Internal network?
- Are Azure-hosted workloads restricted from accessing the Internet?
- Should Azure have a single entry and egress point?
- Can the connection traverse the public Internet (compliance/regulation)?
IP addressing:
- Existing addresses on-premises; public IP addresses.
- Namespaces and name resolution.
Multiple regions:
- Where are the users (multiple on-premises sites); where are the workloads (multiple Azure regions); how will connectivity work (should each site have its own connectivity)?
Azure virtual networks:
- Form an isolated boundary with secure communications.
- Azure-assigned IP addresses (no need for a DHCP server).
- Segmented with subnets.
- Network Security Groups (NSGs) create boundaries around subnets.
Connectivity:
- Site to site (S2S) VPNs at up to 1Gbps
  - Encrypted traffic over the public Internet to the GatewaySubnet in Azure, which hosts VPN Gateway VMs.
  - 99.9% SLA on the Gateway in Azure (not the connection).
  - Don’t deploy production workloads on the GatewaySubnet; /26, /27 or /28 subnets recommended; don’t apply NSGs to the GatewaySubnet – i.e. let Azure manage it.
- Dedicated connections (Azure ExpressRoute): private connection at up to 10Gbps to Azure with:
  - Private peering (to access Azure).
  - Microsoft peering (for Office 365, Dynamics 365 and Azure public IPs).
  - 99.9% SLA on the entire connection.
- Other connectivity services:
  - Azure ExpressRoute Direct: a 100Gbps direct connection to Azure.
  - Azure ExpressRoute Global Reach: using the Microsoft network to connect multiple local on-premises locations.
  - Azure Virtual WAN: branch to branch and branch to Azure connectivity with software-defined networks.
Hybrid networking technologies:
- Azure network adapter for Windows Server 2019 is a point to site (P2S) VPN solution to connect a Windows Server to an Azure virtual network.
- Azure File Sync offers tiered storage for Windows Server with Azure file storage and rapid disaster recovery. Can be thought of as “OneDrive for servers”.

Modern Device Management (Autopilot, Intune and OneDrive)

The old way of managing PC builds:

Build an image with customisations and drivers
Deploy to a new computer, overwriting what was on it
Expensive – and the device has a perfectly good OS – time-consuming

Instead, how about:

Unbox PC
Transform with minimal user interaction
Device is ready for productive use

The transformation is:

Take OEM-optimised Windows 10:
- Windows 10 Pro and drivers.
- Clean OS.
Plus software, settings, updates, features, user data (with OneDrive for Business).
Ready for productive use.

The goal is to reduce the overall cost of deploying devices. Ship to a user with half a page of instructions…

Autopilot deployment is cloud driven and will eventually be centralised through Intune:

Register device:
- From OEM or Channel (manufacturer, model and serial number).
- Automatically (existing Intune-managed devices).
- Manually using a PowerShell script to generate a CSV file with serial number and hardware hash, which is then uploaded to the Intune portal.
Assign Autopilot profile:
- Use Azure AD Groups to assign/target.
- The profile includes settings such as deployment mode, BitLocker encryption, device naming, out of box experience (OOBE).
- An Azure AD device object is created for each imported Autopilot device.
Deploy:
- Needs Azure AD Premium P1/P2
- Scenarios include:
  - User-driven with Azure AD:
    - Boot to OOBE, choose language, locale, keyboard and provide credentials.
    - The device is joined to Azure AD, enrolled to Intune and policies are applied.
    - User signs on and user-assigned items from Intune policy are applied.
    - Once the desktop loads, everything is present, including file links in OneDrive) – time depends on the software being pushed.
  - Self-deploying (e.g. kiosk, digital signage):
    - No credentials required; device authenticates with Azure AD using TPM 2.0.
  - User-driven with hybrid Azure AD join:
    - Requires Offline Domain Join Connector to create AD DS computer account.
    - Device connected to the corporate network (in order to access AD DS), registered with Autopilot, then as before.
    - Sign on to Azure AD and then to AD DS during deployment. If they use the same UPN then it makes things simple for users!
  - Autopilot for existing devices (Windows 7 to 10 upgrades):
    - Backup data in advance (e.g. with OneDrive)
    - Deploy generic Windows 10.
    - Run Autopilot user-driven mode (can’t harvest hardware hashes in Windows 7 so use a JSON config file in the image – the offline equivalent of a profile. Intune will ignore unknown device and Autopilot will use the file instead; after deployment of Windows 10, Intune will notice a PC in the group and apply the profile so it will work if the PC is reset in future).

Autopilot roadmap (1903) includes:

“White glove” pre-provisioning for end users: QR code to track, print welcome letter and shipping label!
Enrolment status page (ESP) improvements.
Cortana voiceover disabled on OOBE.
Self-updating Autopilot (update Autopilot without waiting to update Windows).

Windows #Autopilot roadmap #MSIgniteTheTour – some great enhancements coming for #Windows10 1903 – “white glove” deployments sound particularly useful pic.twitter.com/iKlsrSJFZk
— Mark Wilson ???? (@markwilsonit) February 27, 2019

Maintaining your hybrid environment

Common requirements in an IaaS environment include wanting to use a policy-based configuration with a single management and monitoring solution and auto-remediation.

Azure Automation allows configuration and inventory; monitoring and insights; and response and automation. The Azure Portal provides a single pane of glass for hybrid management (Windows or Linux; any cloud or on-premises).

For configuration and state management, use Azure Automation State Configuration (built on PowerShell Desired State Configuration).

Inventory can be managed with Log Analytics extensions for Windows or Linux. An Azure Monitoring Agent is available for on-premises or other clouds. Inventory is not instant though – can take 3-10 minutes for Log Analytics to ingest the data. Changes can be visualised (for state tracking purposes) in the Azure Portal.

Azure Monitor and Log Analytics can be used for data-driven insights, unified monitoring and workflow integration.

Responding to alerts can be achieved with Azure Automation Runbooks, which store scripts in Azure and run them in Azure. Scripts can use PowerShell or Python so support both Windows and Linux). A webhook can be triggered with and HTTP POST request. A Hybrid runbook worker can be used to run on-premises or in another cloud.

It’s possible to use the Azure VM agent to run a command on a VM from Azure portal without logging in!

Windows Server 2019

Windows Server strategy starts with Azure. Windows Server 2019 is focused on:

Hybrid:
- Backup/connect/replicate VMs.
- Storage Migration Service to migrate unstructured data into Azure IaaS or another on-premises location (from 2003+ to 2016/19).
  1. Inventory (interrogate storage, network security, SMB shares and data).
  2. Transfer (pairings of source and destination), including ACLs, users and groups. Details are logged in a CSV file.
  3. Cutover (make the new server look like the old one – same name and IP address). Validate before cutover – ensure everything will be OK. Read-only process (except change of name and IP at the end for the old server).
- Azure File Sync: centralise file storage in Azure and transform existing file servers into hot caches of data.
- Azure Network Adapter to connect servers directly to Azure networks (see above).
Hyper-converged infrastructure (HCI):
- The server market is still growing and is increasingly SSD-based.
- Traditional rack looked like SAN, storage fabric, hypervisors, appliances (e.g. load balancer) and top of rack Ethernet switches.
- Now we use standard x86 servers with local drives and software-defined everything. Manage with Admin Center in Windows Server (see below).
- Windows Server now has support for persistent memory: DIMM-based; still there after a power-cycle.
- The Windows Server Software Defined (WSSD) programme is the Microsoft approach to software-defined infrastructure.
Security: shielded VMs for Linux (VM as a black box, even for an administrator); integrated Windows Defender ATP; Exploit Guard; System Guard Runtime.
Application innovation: semi-annual updates are designed for containers. Windows Server 2019 is the latest LTSC channel so it has the 1709/1803 additions:
- Enable developers and IT Pros to create cloud-native apps and modernise traditional apps using containers and micro services.
- Linux containers on Windows host.
- Service Fabric and Kubernetes for container orchestration.
- Windows subsystem for Linux.
- Optimised images for server core and nano server.

Windows Admin Center is core to the future of Windows Server management and, because it’s based on remote management, servers can be core or full installations – even containers (logs and console). Download from http://aka.ms/WACDownload

50MB download, no need for a server. Runs in a browser and is included in Windows/Windows Server licence
Runs on a layer of PowerShell. Use the >_ icon to see the raw PowerShell used by Admin Center (copy and paste to use elsewhere).
Extensible platform.

What’s next?

More cloud integration
Update cadence is:
- Insider builds every 2 weeks.
- Semi-annual channel every 6 months (specifically for containers):
  - 1709/1803/1809/19xx.
- Long-term servicing channel
  - Every 2-3 years.
  - 2016, 2019 (in September 2018), etc.

Windows Server 2008 and 2008 R2 reach the end of support in January 2020 but customers can move Windows Server 2008/2008 R2 servers to Azure and get 3 years of security updates for free (on-premises support is chargeable).

Further reading: What’s New in Windows Server 2019.

Containers/Azure Kubernetes Service

Containers:

Are fully-packaged applications that use a standard image format for better resource isolation and utilisation.
Are ready to deploy via an API call.
Are not Virtual machines (for Linux).
Do not use hardware virtualisation.
Offer no hard security boundary (for Linux).
Can be more cost effective/reliable.
Have no GUI.

Kubernetes is:

An open source system for auto-deployment, scaling and management of containerized apps.
Container Orchestrator to manage scheduling; affinity/anti-affinity; health monitoring; failover; scaling; networking; service discovery.
Modular and pluggable.
Self-healing.
Designed by Google based on a system they use to run billions of containers per week.
Described in “Phippy goes to the zoo”.

Azure container offers include:

Azure Container Instances (ACI): containers on demand (Linux or Windows) with no need to provision VMs or clusters; per-second billing; integration with other Azure services; a public IP; persistent storage.
Azure App Service for Linux: a fully-managed PaaS for containers including workflows and advanced features for web applications.
Azure Kubernetes Service (AKS): a managed Kubernetes offering.

Wrap-up

So, there you have it. An extremely long blog post with some highlights from my attendance at Microsoft Ignite | The Tour: London. It’s taken a while to write up so I hope the notes are useful to someone else!

Fantastic couple of days at #MSIgniteTheTour (although quieter/smaller than I expected). Thanks to all the speakers – it’s been great to dip into such a wide variety of topics. Now, back to the day job (and normal tweeting levels!) pic.twitter.com/NtTDXOG22h
— Mark Wilson ???? (@markwilsonit) February 27, 2019

Tuesday 19 March 2019

Seven technology trends to watch 2017-2020

Just over a week ago, risual held its bi-annual summit at the risual HQ in Stafford – the whole company back in the office for a day of learning with a new format: a mini-conference called risual:NXT.

I was given the task of running the technical track – with 6 speakers presenting on a variety of topics covering all of our technical practices: Cloud Infrastructure; Dynamics; Data Platform; Unified Intelligent Communications and Messaging; Business Productivity; and DevOps – but I was also privileged to be asked to present a keynote session on technology trends. Unfortunately, my 35-40 minutes of content had to be squeezed into 22 minutes… so this blog post summarises some of the points I wanted to get across but really didn’t have the time.

1. The cloud was the future once

For all but a very small number of organisations, not using the cloud means falling behind. Customers may argue that they can’t use cloud service because of regulatory or other reasons but that’s rarely the case – even the UK Police have recently been given the green light (the blue light?) to store information in Microsoft’s UK data centres.

Don’t get me wrong – hybrid cloud is more than tactical. It will remain part of the landscape for a while to come… that’s why Microsoft now has Azure Stack to provide a means for customers to run a true private cloud that looks and works like Azure in their own datacentres.

Thankfully, there are fewer and fewer CIOs who don’t see the cloud forming part of their landscape – even if it’s just commodity services like email in Office 365. But we need to think beyond lifting and shifting virtual machines to IaaS and running email in Office 365.

Organisations need to transform their cloud operations because that’s where the benefits are – embrace the productivity tools in Office 365 (no longer just cloud versions of Exchange/Lync/SharePoint but a full collaboration stack) and look to build new solutions around advanced workloads in Azure. Microsoft is way ahead in the PaaS space – machine learning (ML), advanced analytics, the Internet of Things (IoT) – there are so many scenarios for exploiting cloud services that simply wouldn’t be possible on-premises without massive investment.

And for those who still think they can compete with the scale that Microsoft (Amazon and Google) operate at, this video might provide some food for thought…

(and for a similar video from a security perspective…)

2. Data: the fuel of the future

I hate referring to data as “the new oil”. Oil is a finite resource. Data is anything but finite! It is a fuel though…

Data is what provides an economic advantage – there are businesses without data and those with. Data is the business currency of the future. Think about it: Facebook and Google are entirely based on data that’s freely given up by users (remember, if you’re not paying for a service – you are the service). Amazon wouldn’t be where it is without data.

So, thinking about what we do with that data: the 1st wave of the Internet was about connecting computers, 2nd was about people, the 3rd is devices.

Despite what you might read, IoT is not about connected kettles/fridges. It’s not even really about home automation with smart lightbulbs, thermostats and door locks. It’s about gathering information from billions of sensors out there. Then, we take that data and use it to make intelligent decisions and apply them in the real world. Artificial intelligence and machine learning feed on data – they are ying and yang to each other. We use data to train algorithms, then we use the algorithms to process more data.

The Microsoft Data Platform is about analytics and data driving a new wave of insights and opening up possibilities for new ways of working.

James Watt’s 18th Century steam engine led to an industrial revolution. The intelligent cloud is today’s version – moving us to the intelligence revolution.

3 Blockchain

Bitcoin is just one implementation of something known as the Blockchain. In this case as a digital currency.

But Blockchain is not just for monetary transactions – it’s more than that. It can be used for anything transactional. Blockchain is about a distributed ledger. Effectively, it allows parties to trust one another without knowing each other. The ledger is a record of every transaction, signed and tamper-proof.

The magic about Blockchain is that as the chain gets longer so does the entropy and the encryption level – effectively, the more the chain is used, the more secure it gets. That means infinite integrity.

(Read more in Jamie Skella’s “A blockchain explaination your parents could understand”.)

Blockchain is seen as strategic by Microsoft and by the UK government and it’s early days but we will see where people want to talk about integrity and data resilience with integrity. Databases – anything transactional – can be signed with blockchain.

A group of livestock farmers in Arkansas is using blockchain technology so customers can tell where their dinner comes from. They are applying blockchain technology to trace products from ‘farm to fork’ aiming to provide consumers with information about the origin and quality of the meat they buy.

Blockchain is finding new applications in the enterprise and Microsoft has announced the CoCo Framework to improve performance, confidentiality and governance characteristics of enterprise blockchain networks (read more in Simon Bisson’s article for InfoWorld). There’s also Blockchain as a service (in Azure) – and you can find more about Microsoft’s plans by reading up on “Project Bletchley”.

(BTW, Bletchley is a town in Buckinghamshire that’s now absorbed into Milton Keynes. Bletchley Park was the primary location of the UK Government’s wartime code-cracking efforts that are said to have shortened WW2 by around 2 years. Not a bad name for a cryptographic technology, hey?)

4 Into the third dimension

So we’ve had the ability to “print” in 3 dimensions for a while but now 3D is going further.Now we’re taking physical worlds into the virtual world and augmenting with information.

Microsoft doesn’t like the term augmented reality (because it’s being used for silly faces on photos) and they have coined the term mixed reality to describe taking untethered computing devices and creating a seamless overlap between physical and virtual worlds.

To make use of this we need to be able to scan and render 3D images, then move them into a virtual world. 3D is built into next Windows 10 release (the Fall Creators update, due on 17 October 2017). This will bring Paint 3D, a 3D Gallery, View 3D for our phones – so we can scan any object and import to a virtual world. With the adoption rates of new Windows 10 releases then that puts 3D on a market of millions of PCs.

This Christmas will see lots of consumer headsets in the market. Mixed reality will really take off after that. Microsoft is way ahead in the plumbing – all whilst we didn’t notice. They held their Hololens product back to be big in business (so that it wasn’t a solution without a problem). Now it can be applied to field worker scenarios, visualising things before they are built.

To give an example, recently, I had a builder quote for a loft extension at home. He described how the stairs will work and sketched a room layout – but what if I could have visualised it in a headset? Then imagine picking the paint, sofas, furniture, wallpaper, etc.

The video below shows how Ford and Microsoft have worked together to use mixed reality to shorten and improve product development:

5 The new dawn of artificial intelligence

All of the legends of AI are set by sci-fi (Metropolis, 2001 AD, Terminator). But AI is not about killing us all! Humans vs. machines? Deep Blue beating people at Chess, Jeopardy, then Google taking on Go. Heading into the economy and displacing jobs. Automation of business process/economic activity. Mass unemployment?

Let’s take a more optimistic view! It’s not about sentient/thinking machines or giving human rights to machines. That stuff is interesting but we don’t know where consciousness comes from!

AI is a toolbox of high-value tools and techniques. We can apply these to problems and appreciate the fundamental shift from programming machines to machines that learn.

Ai is not about programming logical steps – we can’t do that when we’re recognising images, speech, etc. Instead, our inspiration is biology, neural networks, etc. – using maths to train complex layers of neural networks led to deep learning.

Image recognition was “magic” a few years ago but now it’s part of everyday life. Nvidia’s shares are growing massively due to GPU requirements for deep learning and autonomous vehicles. And Microsoft is democratising AI (in its own applications – with an intelligent cloud, intelligent agents and bots).

So, about those bots…

A bot is a web app and a conversational user interface. We use them because natural language processing (NLP) and AI are here today. And because messaging apps rule the world. With bots, we can use Human language as a new user interface; bots are the new apps – our digital assistants.

We can employ bots in several scenarios today – including customer service and productivity – and this video is just one example, with Microsoft Cortana built into a consumer product:

The device is similar to Amazon’s popular Echo smart speaker and a skills kit is used to teach Cortana about an app; Ask “skillname to do something”. The beauty of Cortana is that it’s cross-platform so the skill can show up wherever Cortana does. More recently, Amazon and Microsoft have announced Cortana-Alexa integration (meanwhile Siri continues to frustrate…)

AI is about augmentation, not replacement. It’s true that bots may replace humans for many jobs – but new jobs will emerge. And it’s already here. It’s mainstream. We use recommendations for playlists, music, etc. We’re recognising people, emotions, etc. in images. We already use AI every day…

6 From silicon to cells

Every cell has a “programme” – DNA. And researchers have found that they can write code in DNA and control proteins/chemical processes. They can compile code to DNA and execute, creating molecular circuits. Literally programming biology.

This is absolutely amazing. Back when I was an MVP, I got the chance to see Microsoft Research talk about this in Cambridge. It blew my mind. That was in 2010. Now it’s getting closer to reality and Microsoft and the University of Washington have successfully used DNA for storage:

The benefits of DNA are that it’s very dense and it lasts for thousands of years so can always be read. And we’re just storing 0s and 1s – that’s much simpler than what DNA stores in nature.

Another incredible example of DNA data storage https://t.co/xNquWAdlzN

— Satya Nadella (@satyanadella) September 29, 2017

7 Quantum computing

With massive data storage… the next step is faster computing – that’s where Quantum computing comes in.

I’m a geek and this one is tough to understand… so here’s another video:

Quantum computing is starting to gain momentum. Dominated by maths (quantum mechanics), it requires thinking in equations, not translating into physical things in your head. It has concepts like superposition (multiple states at the same time) and entanglement. Instead of gates being turned on/off it’s about controlling particles with nanotechnology.

A classical 2 bit on-off takes 2 clock cycles. One quantum bit (a Qubit) has multiple states at the same time. It can be used to solve difficult problems (the RSA 2048 challenge problem would take a billion years on a supercomputer but just 100 seconds on a 250-bit quantum computer). This can be applied to encryption and security, health and pharma, energy, biotech, environment, materials and engineering, AI and ML.

There’s a race for quantum computing hardware taking place and China sees this as a massively strategic direction. Meanwhile, the UK is already an academic centre of excellence – now looking to bring quantum computing to market. We’ll have usable devices in 2-3 years (where “usable” means that they won’t be cracking encryption, but will have initial applications in chemistry and biology).

Microsoft Research is leading a consortium called Station Q and, later this year, Microsoft will release a new quantum computing programming language, along with a quantum computing simulator. With these, developers will be able to both develop and debug quantum programs implementing quantum algorithms.

Predicting the future?

Amazon, Google and Microsoft each invest over $12bn p.a. on R&D. As demonstrated in the video above, their datacentres are not something that many organisations can afford to build but they will drive down the cost of computing. That drives down the cost for the rest of us to rent cloud services, which means more data, more AI – and the cycle continues.

I’ve shared 7 “technology bets” (and there are others, like the use of Graphene) that I haven’t covered – my list is very much influenced by my work with Microsoft technologies and services. We can’t always predict the future but all of these are real… the only bet is how big they are. Some are mainstream, some are up and coming – and some will literally change the world.

Credit: Thanks to Rob Fraser at Microsoft for the initial inspiration – and to Alun Rogers (@AlunRogers) for helping place some of these themes into context.

Tuesday 10 October 2017
The “wheel of fortune”

Last week, I wrote about the White Book of Big Data – a publication I co-authored last year at Fujitsu.

One of the more interesting (for me) sections of the document was an idea from one of my colleagues, providing a model to determine next steps in forming a strategy for embracing a new approach (in this case to move forward towards gaining value from the use of a big data solution but it can be applied to other scenarios too).

The model starts with a “wheel” diagram and, at the centre is the first decision point. All organisations exist to generate profit (even non-profits work on the same principles, they just don’t return those profits to shareholders). There are two ways to increase profit: reducing cost; or increasing revenue.

For each of the reduce cost/increase revenue sectors, there are two more options: direct or indirect.

These four selections lead to a number of other opportunities and these may be prioritised to determine which areas to focus on in a particular business scenario.

With those priorities highlighted, a lookup table can be used to suggest appropriate courses of action to take next.

It’s one of those models that’s simple and, I think, quite elegant. I’ll be looking to adopt this in other scenarios in future and I thought that readers of this blog might find it useful too…

Take a look at the book if you want to see this working in practice – “the wheel” is on page 37.

Tuesday 24 September 2013
The White Book of Big Data

Almost exactly a year ago, I was part of a team at Fujitsu that wrote a short publication called the White Book of Big Data.

This was the third book in the successful “white book” series, aimed at helping CIOs to cut through vendor hype on technology and business trends, following on from the White Book of Cloud Adoption and the White Book of Cloud Security.

At the time, I was keen to shout about this work but couldn’t track down an externally-visible link (and I was asked not to publish it directly myself). Now, when big data has become such an incredibly over-hyped term (so much so that I try not to use the term myself), I’ve found that the book has been available for some time via the Cloud Solutions page on the Fujitus website!

Irrespective of the time it’s taken for me to be able to write about this (and any bias I may have as one of the authors) I still think it’s a useful resource for anyone trying to cut through the vendor hype. At no point does it try to directly sell Fujitsu products – and I’d be interested in any feedback that anyone has after reading it. If you’d like to read the book, you can download a PDF.

As I’ve changed roles since the book was published, I think it’s unlikely I’ll be involved in any future publications of this type (I always wanted to create a White Book of “Bring Your Own” Computing) – unless I can encourage any of my marketing colleagues to sponsor a White Book of Messaging!

Thursday 19 September 2013
The annotated world – the future of geospatial technology? (@EdParsons at #DigitalSurrey)
Tonight’s Digital Surrey was, as usual, a huge success with a great speaker (Google’s @EdParsons) in a fantastic venue (Farnham Castle). Ed spoke about the future of geospatial data – about annotating our world to enhance the value that we can bring from mapping tools today but, before he spoke of the future, he took a look at how we got to where we are.

What is geospatial information? And how did we get to where we are today?

Geospatial information is very visual, which makes it powerful for telling stories and one of the most famous and powerful images is that of the Earth viewed from space – the “blue marble”. This emotive image has been used many times but has only been personally witnessed by around 20 people, starting with the Apollo 8 crew, 250000 miles from home, looking at their own planet. We see this image with tools like Google Earth, which allows us to explore the planet and look at humankind’s activities. Indeed about 1 billion people use Google Maps/Google Earth every week – that’s about a third of the Internet population, roughly equivalent to Facebook and Twitter combined [just imagine how successful Google would be if they were all Google+ users…]. Using that metric, we can say that geospatial data is now pervasive – a huge shift over the last 10 years as it has become more accessible (although much of the technology has been around longer).

The annotated world is about going beyond the image and pulling out info otherwise invisible information, so, in a digital sense, it’s now possible to have map of 1:1 scale or even beyond. For example, in Google Maps we can look at StreetView and even see annotations of buildings. This can be augmented with further information (e.g restrictions in the directions in which we can drive, details about local businesses) to provide actionable insight. Google also harvests information from the web to create place pages (something that could be considered ethically dubious, as it draws people away from the websites of the businesses involved) but it can also provide additional information from image recognition – for example identifying the locations of public wastebins or adding details of parking restrictions (literally from text recognition on road signs). The key to the annotated web is collating and presenting information in a way that’s straightforward and easy to use.

Using other tools in the ecosystem, mobile applications can be used to easily review a business and post it via Google+ (so that it appears on the place page); or Google MapMaker may be used by local experts to add content to the map (subject to moderation – and the service is not currently available in the UK…).

So, that’s where we are today… we’re getting more and more content online, but what about the next 10 years?

A virtual (annotated) world

Google and others are building a virtual world in three dimensions. In the past, Google Earth pulled data from many sets (e.g. building models, terrain data, etc.) but future 3D images will be based on photographs (just as, apparently, Nokia have done for a while). We’ll also see 3D data being using to navigate inside buildings as well as outside. In one example, Google is working with John Lewis, who have recently installed Wi-Fi in their stores – to use this to determine a user’s location determination and combine this with maps to navigate the store. The system is accurate to about 2-3 metres [and sounds similar to Tesco’s “in store sat-nav” trial] and apparently it’s also available in London railway stations, the British Museum, etc.

[blackbirdpie url=”https://twitter.com/markwilsonit/status/228568510723411968″]

Ed made the point that the future is not driven by paper-based cartography, although there were plenty of issues taken with this in the Q&A later, highlighting that we still use ancient maps today, and that our digital archives are not likely to last that long.

Moving on, Ed highlighted that Google now generates map tiles on the fly (it used to take 6 weeks to rebuild the map) and new presentation technologies allow for client-side rendering of buildings – for example St Pauls Cathedral, in London. With services such as Google Now (on Android), contextual info may be provided, driven by location and personality

With Google’s Project Glass, that becomes even more immersive with augmented reality driven by the annotated world:

Although someone also mentioned to me the parody which also raises some good points:

Seriously, Project Glass makes Apple’s Siri look way behind the curve – and for those who consider the glasses to be a little uncool, I would expect them to become much more “normal” over time – built into a normal pair of shades, or even into prescription glasses… certainly no more silly than those Bluetooth earpieces the we used to use!

Of course, there are privacy implications to overcome but, consider what people share today on Facebook (or wherever) – people will share information when they see value in it.

Big data, crowdsourcing 2.0 and linked data

At this point, Ed’s presentation moved on to talk about big data. I’ve spent most of this week co-writing a book on this topic (I’ll post a link when it’s published) and nearly flipped when I heard the normal big data marketing rhetoric (the 3 Vs) being churned out. Putting aside the hype, Google should know quite a bit about big data (Google’s search engine is a great example and the company has done a lot of work in this area) and the annotated world has to address many of the big data challenges including:
- Data integration.
- Data transformation.
- Near-real-time analysis using rules to process data and take appropriate action (complex event processing).
- Semantic analysis.
- Historical analysis.
- Search.
- Data storage.
- Visualisation.
- Data access interfaces.
Moving back to Ed’s talk, what he refers to as “Crowdsourcing 2.0” is certainly an interesting concept. Citing Vint Cerf (Internet pioneer and Google employee), Ed said that there are an estimated 35bn devices connected to the Internet – and our smartphones are great examples, crammed full of sensors. These sensors can be used to provide real-time information for the annotated world: average journey times based on GPS data, for example; or even weather data if future smartphones were to contain a barometer.

Linked data is another topic worthy of note, which, at its most fundamental level is about making the web more interconnected. There’s a lot of work been done into ontologies, categorising content, etc. [Plug: I co-wrote a white paper on the topic earlier this year] but Google, Yahoo, Microsoft and others are supporting schema.org as a collection of microformats, which are tags that websites can use to mark up content in a way that’s recognised by major search providers. For example, a tag like <span itemprop="addresscountry">Spain</span> might be used to indicate that Spain is a country with further tags to show that Barcelona is a city, and that Noucamp is a place to visit.

Ed’s final thoughts

Summing up, Ed reiterated that paper maps are dead and that they will be replaced with more personalised information (of which, location is a component that provides content). However, if we want the advantages of this, we need to share information – with those organisations that we trust and where we know what will happen with that info.

Mark’s final thoughts

The annotated world is exciting and has stacks of potential if we can overcome one critical stumbing point that Ed highliughted (and I tweeted):

[blackbirdpie url=”https://twitter.com/markwilsonit/status/228571728740245504″]

Unfortunately, there are many who will not trust Google – and I find it interesting that Google is an advocate of consuming open data to add value to its products but I see very little being put back in terms of data sets for others to use. Google’s argument is that it spent a lot of money gathering and processing that data; however it could also be argued that Google gets a lot for free and maybe there is a greater benefit to society in freely sharing that information in a non-proprietary format (rather than relying on the use of Google tools). There are also ethical concerns with Google’s gathering of Wi-Fi data, scraping website content and other such issues but I expect to see a “happy medium” found, somewhere between “Don’t Be Evil” and “But we are a business after all”…

Thanks as always to everyone involved in arranging and hosting tonight’s event – and to Ed Parsons for an enlightening talk!
Friday 27 July 2012
Big data according to the Oracle
After many years of working mostly with Microsoft infrastructure products, the time came for me to increase my breadth of knowledge and, with that, comes the opportunity to take a look at what some of the other big players in our industry are up to. Last year, I was invited to attend the Oracle UK User Group Conference where I had my first experience of the world of Oracle applications; and last week I was at the Oracle Big Data and Extreme Analytics Summit in Manchester, where Fujitsu was one of the sponsors (and an extract from one of my white papers was in the conference programme).

It was a full day of presentations and I’m not sure that reproducing all of the content here makes a lot of sense, so here’s an attempt to summarise it… although even a summary could be a long post…

Big data trends, techniques and opportunities

Tim Jennings (@tjennings) from Ovum set the scene and explained some of the ways in which big data has the potential to change the way in which we work as businesses, citizens and consumers (across a variety of sectors).

Summing up his excellent overview of big data trends, techniques and opportunities, Tim’s key messages were that:
1. Big data is characterised by volume, variety and velocity [I’d add value to that list].
2. Big data represents a change in the mentality of analytics, away from precise analysis of well-bound sources to rough-cut exploratory analysis of all the data that’s practical to aggregate.
3. Enterprise should identify business cases for big data and the techniques and processes required to exploit them.
4. Enterprises should review existing business intelligence architectures and methods and plan the evolution towards a broader platform capable of handling the big data lifecycle.
And he closed by saying that “If you don’t think that big data is relevant to your organisation, then you are almost certainly missing an opportunity that others will take.”

Some other points I picked up from Tim’s presentation:
- Big data is not so much unstructured as variably-structured.
- The mean size of an analytical data set is 3TB (growing but not that huge) – don’t think you need petabytes of data for big data tools and techniques to be relevant.
- Social network analytics is probably the world’s largest (free) marketing focus group!
Big Data – Are You Ready?

Following the analyst introduction, the event moved on to the vendor pitch. This was structured around a set of videos which I’ve seen previously, in which a fictitious American organisation grapples with a big data challenge, using an over-sized actor (and an under-sized one) to prove their point. I found these videos a little tedious the first time I saw them, and this was the second viewing for me. For those who haven’t had the privilege, the videos are on YouTube and I’ve embedded the first one below (you can find the links on an Oracle’s Data Warehouse Insider blog post).

The key points I picked up from this session were:
- Oracle see big data as a process towards making better decisions based on four stages: decide, acquire, organise and analyse.
- Oracle considers that there are three core technologies for big data: Oracle NoSQL, Hadoop, and R; brought together by Oracle Engineered Systems (AKA the “buy our stuff” pitch).
Cloudera

Had I been at the London event I would have been extremely privileged to see Doug Cutting, Hadoop creator and now Chief Architect at Cloudera speak about his work in this field. Doug wasn’t available to speak at the Manchester event so Oracle showed us a pre-recorded interview.

For those who aren’t familiar with Cloudera (I wasn’t), it’s effectively a packaged open source big data solution (based on Hadoop and related technologies) providing an enterprise big data solution, with support.

The analogy given was that of a “big data operating system” with Cloudera doing for Hadoop what Red Hat does for Linux.

Perhaps most pertenent of Doug Cutting’s commenst was that we are at the beginning of a revolution in data processing where people can afford to save data and use it to learn, to get a “higher resolution picture of what’s going on and use it to make more informed decisions”.

Capturing the asset – acquire and organise

After a short pitch from Infosys (who have a packaged data platform, although personally, I’d be looking to the cloud…) and an especially cringeworthy spoof Lady Gaga video (JavaZone’s Lady Java) we moved on to enterprise NoSQL. In effect, Oracle has created a NoSQL database using the Berkeley key value database and a Java driver (containing much of the logic to avoid single points of failure) that they claim offers a simple data model, scalability, high availability, transparent load balancing and simple administration.

Above all, Oracle’s view is that, because it’s provided and maintained by Oracle, there is a “single throat to choke”. In effect, in the same way that we used to say no-one got fired for buying IBM, they are suggesting no-one gets fired for buying Oracle.

That may be true, but it’s my understanding that big data is fuelled by low-cost commodity hardware (infrastructure as a service) and open source software – and whilst Oracle may have a claim on the open source front, the low-cost commodity hardware angle is not one that sits well in the Oracle stable…

Through partnership with Cloudera (which leaves some wondering if that will last any longer than the Red Hat partnership did?), Oracle is positioning a Hadoop solution for their customer base:

[blackbirdpie url=”http://twitter.com/#!/debralilley/status/197285366091362304″]

Despite (or maybe in spite of) the overview of HDFS and MapReduce, I’m still not sure how Cloudera sits alongside Oracle NoSQL but their “big data appliance” includes both options. Now, when I used to install servers, appliances were typically 1U “pizza box” servers. Then they got virtualised – but now it seems they have grown to become whole racks (Oracle) or even whole containers (Microsoft).

Oracle’s view on big data is that we can:
1. Acquire data with their Big Data Appliance.
2. Organise/Analyse aggregated results with Exadata.
3. Decide at “the speed of thought” with Exalytics.
That’s a lot of Oracle hardware and software…

In an attempt not to position Oracle’s more traditional products as old hat, the next presenter suggested that big data is complementary and not really about old and new but about familiar and unfamiliar. Actually, I think he has a point: at some point “big” data just becomes “data” (and gets boring again?) but this session gave an overview of an information architecture challenge as new classes of data (videos and images, documents, social data, machine-generated data, etc.) create a divide between transactional data and big data, which is not really unstructured but better described as semi-structured and which uses sandboxes to analyse and discover new meaning from data.

Oracle has big data connectors to integrate with other (Oracle) solutions including: a HiveQL-based data integrator; a loader to move Hadoop data into Oracle 11G; a SQL-HDFS connector; and an R connector to run scripts with API access to both Hadoop and more traditional Oracle databases. There are also Oracle products such as GoldenGate to replicate data in heterogeneous data environments

[My view, for what it’s worth, is that we shouldn’t be moving big data around, duplicating (or triplicating) data – we should be linking and indexing it to bridge the divide between the various silos of “big” data and “traditional” data.]

Finding the value – analyse and decide

Speaking of a race to gain insight analytics becoming the CIO’s top priority for 2013 and business intelligence usage doubling by 2014, the next session looked at some business analytics techniques and characteristics, which can be summarised as:
- I suspect something – a data scientist or analyst needs to find proof and turn into a predictive model to deploy into business process (classification).
- I want to know if that matters – “I wish I knew” (visual exploration and discovery).
- I want to make the best decision now – decisions at the speed of thought in the context of a business process.
This led on to a presentation about the rise of the data scientist and making maths cool (except it didn’t, especially with a demo of some not-very-attractive visualisations run on an outdated Windows XP platform) and introduction of the R language for statistical analysis and visualisation.

Following this was a presentation about Oracle’s recently-acquired Endeca technology which actually sounds pretty interesting as it digests a variety of data sources and creates a data model with an information-discovery front-end that promises “the simplicity of search plus the power of BI”.

The last presentation of this segment looked at Oracle’s Exalytics in-memory database servers (a competitor to SAP Hana) bundling bsuiness intelligence software, adaptive in-memory caching (and columnar compression) with information discovery tools.

Wrap-up

I learned a lot about Oracle’s view of big data but that’s exactly what it was – one vendor’s view on this massively hyped and expanding market segment. For me, the most useful session of the day was from Ovum’s Tim Jennings and if that was all I took away, it would have been worthwhile.

In fairness, it was good to learn some more about the Oracle solutions too but I do wish vendors (including my own employer) would sometimes drop the blatant product marketing and consider the value of some vendor agnostic thought leadership. I truly believe that, by showing customers a genuine understanding of their business, the issues that they face and the directions that business and technology and heading in, the solutions will sell themselves if they truly provide value. On the other hand, by telling me that Oracle has a complete, open and integrated solution for everything and what I really need is to buy more technology from the Oracle stack and… well, I’d better have a good story to convince the CFO that it’s worthwhile…

Slidedecks and other materials from the Oracle Big Data and Extreme Analytics Summit are available on the Oracle website.
Wednesday 9 May 2012
Short takes: Flexible working and data protection for mobile devices
It’s been another busy week and I’m still struggling to get a meaningful volume of blog posts online so here are the highlights from a couple of online events I attended recently…

Work smarter, not harder… the art of flexible working

Citrix Online has been running a series of webcasts to promote its Go To Meeting platform and I’ve attended a few of them recently. The others have been oriented towards presenting but, this week, Lynne Copp from the Work Life Company (@worklifecompany) was talking about embracing flexible working. As someone who has worked primarily from home for a number of years now, it would have been great for me to get a bit more advice on how to achieve a better work/life balance (it was touched upon, but most of the session seemed to be targeted how organisations need to change to embrace flexible working practices) but some interesting resources have been made available including:
- Flexible working “paper coach”.
- The Smart Working Handbook.
- Workshifting blog (@Workshifting).
Extending enterprise data protection to mobile devices

Yesterday, I joined an IDC/Autonomy event looking at the impact of mobile devices on enterprise data protection.

IDC’s Carla Arend (@carla_arend) spoke about how IDC sees four forces of IT industry transformation in cloud, mobility, big data/analytics and social business. I was going to say “they forgot consumerisation” but then it was mentioned as an overarching topic. I was certainly surprised that the term used to describe the ease of use that many consumer services provide was that we have been “spoiled” but the principle that enterprise IT often lags behind is certainly valid!

Critically the “four forces of IT industry transformation” are being driven by business initiatives – and IT departments need to support those requirements. The view put forward was that IT organisations that embrace these initiatives will be able to get funding; whilst those who still take a technology-centric view will be forced to continue down the line of doing more with less (which seems increasingly unsustainable to me…).

This shift has implications for data management and protection – managing data on premise and in the cloud, archiving data generated outside the organisation (e.g. in social media, or other external forums), managing data on mobile devices, and deciding what to do with big data (store it all, or just some of the results?)

Looking at BYOD (which is inevitable for most organisations, with or without the CIO’s blessing!) there are concerns about: who manages the device; who protects it (IDC spoke about backup/archive but I would add encryption too); what happens to data when a device is lost/stolen, or when the device is otherwise replaced; and how can organisations ensure compliance on unmanaged devices?

Meanwhile, organisational application usage is moving outside traditional office applications too, with office apps, enterprise apps, and web apps running on increasing numbers of devices and new machine (sensor) and social media data sets being added to the mix (often originating outside the organisation). Data volumes create challenges too, as well as the variety of locations from which that data originates or resides. This leads to a requirement to carefully consider which data needs to be retained and which may be deleted.

Cloud services can provide some answers and many organisations expect to increasingly adopt cloud services for storage – whether that is to support increasing volumes of application data, or for PC backups. IDC is predicting that the next cloud wave will be around the protection of smart mobile devices.

There’s more detail in IDC’s survey results (European Software Survey 2012, European Storage Survey 2011) but I’ve certainly given the tl;dr view here…

Unfortunately I didn’t stick around for the Autonomy section… it may have been good but the first few minutes were feeling too much like a product pitch to me (and to my colleague who was also online)… sometimes I want the views, opinions and strategic view – thought leadership, rather than sales – and I did say it’s been a busy week!
Friday 27 April 2012
Linked data: connecting and exploiting big data

Earlier this year, I gave a lightning talk on Structuring Big Data at the CloudCamp London Big Data Special – the idea being that, if we’re not careful, big data will provide yet another silo of information to manage and that linked data could be useful to connect the various data sources (transactional databases, data warehouses, and now big data too).

Structuring Big Data

View more presentations from Fujitsu UK

At the time I mentioned that this was part of a white paper that I was writing with my manager, Ian Mitchell (@IanMitchell2) and our paper on using linked data to connect and exploit big data has now been published on the Fujitsu website.

This week Oracle kicks off its Big Data and Extreme Analytics Summit and Fujitsu are one of the sponsors. An except from the paper is included in the conference brochure and I’ll be at the Manchester event next Tuesday – do come along and say hello if you’re at the event and, even if you’re not, please do check out the paper – I’d love to hear your feedback.

Tuesday 24 April 2012
More on NoSQL, Hadoop and Microsoft’s entry to the world of big data

Yesterday, my article on Microsoft’s forays into the world of big data went up on Cloud Pro. It’s been fun learning a bit about the subject (far more than is in that article – because big data is a big theme in my work at the moment) and I wanted to share some more info that didn’t fit into my allotted 1000 words.

Microsoft Fellow Dr David DeWitt gave an excellent keynote on Day 3 of the SQL PASS 2011 summit last month and it’s a great overview of how Hadoop works. Of course, he has a bias towards use of RDBMS systems but the video is well worth watching for it’s introduction to NoSQL, the differences between key value stores and Hadoop-type systems, and the description of the Hadoop components and how they fit together (skip the first 18 minutes and, if the stream doesn’t work, try the download – the deck is available too). Grant Fritchey and Jen McCown have written some great notes to go with Dr DeWitt’s keynote too. For more about when you might use Hadoop, Jeremiah Peschka has a good post.

Microsoft’s SQOOP implementation is not the first – Cloudera have been integrating SQL and Hadoop for a couple of years now. Meanwhile, Buck Woody has a great overview of Microsoft’s efforts in the big data space.

I also mentioned Microsoft StreamInsight (formerly code-named “Austin”) in the post (the Complex Event Processing capability inside SQL Server 2008 R2) and Microsoft’s StreamInsight Team has posted what they call “the basics” of event processing. It seems to require coding, but is probably useful to anyone who is getting started with this stuff. For those of us who are a little less code-oriented, Andrew Fryer’s overview of StreamInsight (together with a more general post on CEP) is worth a read, together with Simon Munro’s post on where StreamInsight fits in.

Shortly after I sent my article to Cloud Pro’s Editor, I saw Mike Walsh’s “Microsoft Loves Your Big Data” post. I like this because it cuts through the press announcements and talks about what is really going on: interoperability; and becoming a player themselves. Critically:

“They aren’t copying, or borrowing or trying to redo… they are embracing”

And that is what I really think makes a refreshing change.

Tuesday 8 November 2011
SQL Server and Hadoop – unlikely bedfellows but a powerful combination
Big Data is hard to avoid – what does Microsoft’s embrace of Hadoop mean for IT Managers?

There are two words that seem particularly difficult to avoid at the moment: big data. Infrastructure guys instinctivly shy away from data but such is its prevalence that big data is much more than just the latest IT buzzword and is becoming a major theme in our industry right now

But what does “big data” actually mean? It’s one of those phrases that, like cloud computing earlier, it is being “adopted” by vendors to mean whatever they want it to.

The McKinsey Global Institute describes big data as “the next frontier for innovation, competition and productivity” but, put simply, it’s about analysing masses of unstructured (or semi-structured) data which, until recently, was considered too expensive to do anything with.

That data comes from a variety of sources including sensors, social networks and digital media and it includes text, audio, video, click-streams, log files and more. Cynics who scoff at the description of “big” data (what’s next, “huge” data?) miss the point that it’s not just about the volume of the data (typically many petabytes) but also the variety and frequency of that data. Some even refer to it as “nano data” because what we’re actually looking at is massive sets of very small data.

Processing big data typically involves distributed computer systems and one project that has come to the fore is Apache Hadoop – a framework for development of open-source software for reliable, scalable distributed computing.

Over the last few weeks though, there have been some significant announcements from established IT players, not all of whom are known for embracing open source technology. This indicates a growing acceptance for big data solutions in general and specifically for solutions that include both open- and closed- source elements.

When Microsoft released a SQL Server-Hadoop (SQOOP) Connector,there were questions about what this would mean for CIOs and IT Managers who may previously have viewed technologies like Hadoop as a little esoteric.

The key to understanding what this would mean would be understanding the two main types of data: structured and unstructured. Structured data tends to be stored in a relational database management system (RDBMS), for example Microsoft SQL Server, IBM DB2, Oracle 11G or MySQL.

By structuring the data with a schema, tables, keys and all manner of relationships it’s possible to run queries (with a language like SQL) to analyse the data and techniques have developed over the years to optimise those queries. By contrast, unstructured data has no schema (at least not a formal one) and may be as simple as a set of files. Structured data offers maturity, stability and efficiency but unstructured data offers flexibility.

Secondly, there needs to be an understanding of the term “NoSQL”. Commonly misinterpreted as an instruction (no to SQL), it really means not only SQL – i.e. there are some types of data that are not worth storing in an RDBMS. Rather than following the database model of extract, transform and load (ETL), with a NoSQL system the data arrives and the application knows how to interpret the data, providing a faster time to insight from data acquisition.

Just as there are two main types of data, there are two main types of NoSQL system: key/value stores (like MongoDB or Windows Azure Table Storage) can be thought of as NoSQL OLTP; Hadoop is more like NoSQL data warehousing and is particularly suited to storing and analysing massive data sets.

One of the key elements towards understanding Hadoop is understanding how the various Hadoop components work together. There’s a degree of complexity so perhaps it’s best to summarise by saying that the Hadoop stack consists of a highly distributed, fault tolerant, file system (HDFS) and the MapReduce framework for writing and executing distributed, fault tolerant, algorithms. Built on top of that are query languages (live Hive and Pig) and then we have the layer where Microsoft’s SQOOP connector sits, connecting the two worlds of structured and unstructured data.

The trouble is that SQOOP is just a bridge – and not a particularly efficient one either – working on SQL data in the unstructured world involves subdivision of the SQL database so that MapReduce can work correctly.

Because most enterprises have both the structured and unstructured data, we really need tools that allow us to analyse and manage data in multiple environments – ideally without having to go back and forth. That’s why there are so many vendors jumping on the big data bandwagon but it seems that a SQOOP connector is not the only work Microsoft is doing in the big data space:
- SQL Server 2008 R2 includes a complex event processing (CEP) capability called StreamInsight. The principle is that streams of data can be monitored, managed and mined for particular events (instead of running queries across data, run the data through a set of queries looking for matches) and this can help organisations to respond quickly to new opportunities – maybe even adopting a predictive business model.
- The next version of SQL Server will include a new data analysis tool called Power View which will even be supported on competitive mobile operating systems (including iOS and Android).
- Windows Azure includes table storage – a key/value pair storage solution with partitioning.
- Also on Azure, Microsoft is creating a new Data Explorer tool to create rich data sets that can be published as a service and an iterative MapReduce runtime codenamed “Daytona” for scaling data analytics across hundreds of processing cores.
- Microsoft is also creating new implementations of the Hadoop stack for Windows Azure and Windows Server (including a Hive ODBC driver and a Hive Add-in for Excel) but it also has a competing technology called LINQ to HPC (formerly codenamed Dryad) that allows a Windows High Performance Compute (HPC) cluster to not only perform parallel computing but also to integrate with Azure (the theory behind this is that big data jobs are typically I/O-bound, rather than compute-bound).
In our increasingly cloudy world, infrastructure and platforms are rapidly becoming commoditised. We need to focus on software that allows us to derive value from data to gain some business value. Consider that Microsoft is only one vendor, then think about what Oracle, IBM, Fujitsu and others are doing. If you weren’t convinced before, maybe HP’s Autonomy purchase is starting to make sense now?

Looking specifically at Microsoft’s developments in the big data world, it therefore makes sense to see the company get closer to Hadoop. The world has spoken and the de facto solution for analysing large data sets seems to be HDFS/MapReduce/Hive (or similar).

Maybe Hadoop’s success comes down to HDFS and MapReduce being based on work from Google whilst Hive and Pig are supported by Facebook and Yahoo respectively (i.e. they are all from established Internet businesses). But, by embracing Hadoop (together with porting its tools to competitive platforms), Microsoft is better placed to support the entire enterprise with both their structured and unstructured needs.

[This post was originally written as an article for Cloud Pro.]
Monday 7 November 2011