Hadoop Archives - markwilson.it

Big data according to the Oracle

Posted on Wednesday 9 May 2012Tuesday 18 September 2012 By Mark Wilson

After many years of working mostly with Microsoft infrastructure products, the time came for me to increase my breadth of knowledge and, with that, comes the opportunity to take a look at what some of the other big players in our industry are up to. Last year, I was invited to attend the Oracle UK User Group Conference where I had my first experience of the world of Oracle applications; and last week I was at the Oracle Big Data and Extreme Analytics Summit in Manchester, where Fujitsu was one of the sponsors (and an extract from one of my white papers was in the conference programme).

It was a full day of presentations and I’m not sure that reproducing all of the content here makes a lot of sense, so here’s an attempt to summarise it… although even a summary could be a long post…

Big data trends, techniques and opportunities

Tim Jennings (@tjennings) from Ovum set the scene and explained some of the ways in which big data has the potential to change the way in which we work as businesses, citizens and consumers (across a variety of sectors).

Summing up his excellent overview of big data trends, techniques and opportunities, Tim’s key messages were that:

Big data is characterised by volume, variety and velocity [I’d add value to that list].
Big data represents a change in the mentality of analytics, away from precise analysis of well-bound sources to rough-cut exploratory analysis of all the data that’s practical to aggregate.
Enterprise should identify business cases for big data and the techniques and processes required to exploit them.
Enterprises should review existing business intelligence architectures and methods and plan the evolution towards a broader platform capable of handling the big data lifecycle.

And he closed by saying that “If you don’t think that big data is relevant to your organisation, then you are almost certainly missing an opportunity that others will take.”

Some other points I picked up from Tim’s presentation:

Big data is not so much unstructured as variably-structured.
The mean size of an analytical data set is 3TB (growing but not that huge) – don’t think you need petabytes of data for big data tools and techniques to be relevant.
Social network analytics is probably the world’s largest (free) marketing focus group!

Big Data – Are You Ready?

Following the analyst introduction, the event moved on to the vendor pitch. This was structured around a set of videos which I’ve seen previously, in which a fictitious American organisation grapples with a big data challenge, using an over-sized actor (and an under-sized one) to prove their point. I found these videos a little tedious the first time I saw them, and this was the second viewing for me. For those who haven’t had the privilege, the videos are on YouTube and I’ve embedded the first one below (you can find the links on an Oracle’s Data Warehouse Insider blog post).

The key points I picked up from this session were:

Oracle see big data as a process towards making better decisions based on four stages: decide, acquire, organise and analyse.
Oracle considers that there are three core technologies for big data: Oracle NoSQL, Hadoop, and R; brought together by Oracle Engineered Systems (AKA the “buy our stuff” pitch).

Cloudera

Had I been at the London event I would have been extremely privileged to see Doug Cutting, Hadoop creator and now Chief Architect at Cloudera speak about his work in this field. Doug wasn’t available to speak at the Manchester event so Oracle showed us a pre-recorded interview.

For those who aren’t familiar with Cloudera (I wasn’t), it’s effectively a packaged open source big data solution (based on Hadoop and related technologies) providing an enterprise big data solution, with support.

The analogy given was that of a “big data operating system” with Cloudera doing for Hadoop what Red Hat does for Linux.

Perhaps most pertenent of Doug Cutting’s commenst was that we are at the beginning of a revolution in data processing where people can afford to save data and use it to learn, to get a “higher resolution picture of what’s going on and use it to make more informed decisions”.

Capturing the asset – acquire and organise

After a short pitch from Infosys (who have a packaged data platform, although personally, I’d be looking to the cloud…) and an especially cringeworthy spoof Lady Gaga video (JavaZone’s Lady Java) we moved on to enterprise NoSQL. In effect, Oracle has created a NoSQL database using the Berkeley key value database and a Java driver (containing much of the logic to avoid single points of failure) that they claim offers a simple data model, scalability, high availability, transparent load balancing and simple administration.

Above all, Oracle’s view is that, because it’s provided and maintained by Oracle, there is a “single throat to choke”. In effect, in the same way that we used to say no-one got fired for buying IBM, they are suggesting no-one gets fired for buying Oracle.

That may be true, but it’s my understanding that big data is fuelled by low-cost commodity hardware (infrastructure as a service) and open source software – and whilst Oracle may have a claim on the open source front, the low-cost commodity hardware angle is not one that sits well in the Oracle stable…

Through partnership with Cloudera (which leaves some wondering if that will last any longer than the Red Hat partnership did?), Oracle is positioning a Hadoop solution for their customer base:

[blackbirdpie url=”http://twitter.com/#!/debralilley/status/197285366091362304″]

Despite (or maybe in spite of) the overview of HDFS and MapReduce, I’m still not sure how Cloudera sits alongside Oracle NoSQL but their “big data appliance” includes both options. Now, when I used to install servers, appliances were typically 1U “pizza box” servers. Then they got virtualised – but now it seems they have grown to become whole racks (Oracle) or even whole containers (Microsoft).

Oracle’s view on big data is that we can:

Acquire data with their Big Data Appliance.
Organise/Analyse aggregated results with Exadata.
Decide at “the speed of thought” with Exalytics.

That’s a lot of Oracle hardware and software…

In an attempt not to position Oracle’s more traditional products as old hat, the next presenter suggested that big data is complementary and not really about old and new but about familiar and unfamiliar. Actually, I think he has a point: at some point “big” data just becomes “data” (and gets boring again?) but this session gave an overview of an information architecture challenge as new classes of data (videos and images, documents, social data, machine-generated data, etc.) create a divide between transactional data and big data, which is not really unstructured but better described as semi-structured and which uses sandboxes to analyse and discover new meaning from data.

Oracle has big data connectors to integrate with other (Oracle) solutions including: a HiveQL-based data integrator; a loader to move Hadoop data into Oracle 11G; a SQL-HDFS connector; and an R connector to run scripts with API access to both Hadoop and more traditional Oracle databases. There are also Oracle products such as GoldenGate to replicate data in heterogeneous data environments

[My view, for what it’s worth, is that we shouldn’t be moving big data around, duplicating (or triplicating) data – we should be linking and indexing it to bridge the divide between the various silos of “big” data and “traditional” data.]

Finding the value – analyse and decide

Speaking of a race to gain insight analytics becoming the CIO’s top priority for 2013 and business intelligence usage doubling by 2014, the next session looked at some business analytics techniques and characteristics, which can be summarised as:

I suspect something – a data scientist or analyst needs to find proof and turn into a predictive model to deploy into business process (classification).
I want to know if that matters – “I wish I knew” (visual exploration and discovery).
I want to make the best decision now – decisions at the speed of thought in the context of a business process.

This led on to a presentation about the rise of the data scientist and making maths cool (except it didn’t, especially with a demo of some not-very-attractive visualisations run on an outdated Windows XP platform) and introduction of the R language for statistical analysis and visualisation.

Following this was a presentation about Oracle’s recently-acquired Endeca technology which actually sounds pretty interesting as it digests a variety of data sources and creates a data model with an information-discovery front-end that promises “the simplicity of search plus the power of BI”.

The last presentation of this segment looked at Oracle’s Exalytics in-memory database servers (a competitor to SAP Hana) bundling bsuiness intelligence software, adaptive in-memory caching (and columnar compression) with information discovery tools.

Wrap-up

I learned a lot about Oracle’s view of big data but that’s exactly what it was – one vendor’s view on this massively hyped and expanding market segment. For me, the most useful session of the day was from Ovum’s Tim Jennings and if that was all I took away, it would have been worthwhile.

In fairness, it was good to learn some more about the Oracle solutions too but I do wish vendors (including my own employer) would sometimes drop the blatant product marketing and consider the value of some vendor agnostic thought leadership. I truly believe that, by showing customers a genuine understanding of their business, the issues that they face and the directions that business and technology and heading in, the solutions will sell themselves if they truly provide value. On the other hand, by telling me that Oracle has a complete, open and integrated solution for everything and what I really need is to buy more technology from the Oracle stack and… well, I’d better have a good story to convince the CFO that it’s worthwhile…

Slidedecks and other materials from the Oracle Big Data and Extreme Analytics Summit are available on the Oracle website.

More on NoSQL, Hadoop and Microsoft’s entry to the world of big data

Posted on Tuesday 8 November 2011Monday 7 November 2011 By Mark Wilson

Yesterday, my article on Microsoft’s forays into the world of big data went up on Cloud Pro. It’s been fun learning a bit about the subject (far more than is in that article – because big data is a big theme in my work at the moment) and I wanted to share some more info that didn’t fit into my allotted 1000 words.

Microsoft Fellow Dr David DeWitt gave an excellent keynote on Day 3 of the SQL PASS 2011 summit last month and it’s a great overview of how Hadoop works. Of course, he has a bias towards use of RDBMS systems but the video is well worth watching for it’s introduction to NoSQL, the differences between key value stores and Hadoop-type systems, and the description of the Hadoop components and how they fit together (skip the first 18 minutes and, if the stream doesn’t work, try the download – the deck is available too). Grant Fritchey and Jen McCown have written some great notes to go with Dr DeWitt’s keynote too. For more about when you might use Hadoop, Jeremiah Peschka has a good post.

Microsoft’s SQOOP implementation is not the first – Cloudera have been integrating SQL and Hadoop for a couple of years now. Meanwhile, Buck Woody has a great overview of Microsoft’s efforts in the big data space.

I also mentioned Microsoft StreamInsight (formerly code-named “Austin”) in the post (the Complex Event Processing capability inside SQL Server 2008 R2) and Microsoft’s StreamInsight Team has posted what they call “the basics” of event processing. It seems to require coding, but is probably useful to anyone who is getting started with this stuff. For those of us who are a little less code-oriented, Andrew Fryer’s overview of StreamInsight (together with a more general post on CEP) is worth a read, together with Simon Munro’s post on where StreamInsight fits in.

Shortly after I sent my article to Cloud Pro’s Editor, I saw Mike Walsh’s “Microsoft Loves Your Big Data” post. I like this because it cuts through the press announcements and talks about what is really going on: interoperability; and becoming a player themselves. Critically:

“They aren’t copying, or borrowing or trying to redo… they are embracing”

And that is what I really think makes a refreshing change.

SQL Server and Hadoop – unlikely bedfellows but a powerful combination

Posted on Monday 7 November 2011Saturday 14 January 2017 By Mark Wilson

Big Data is hard to avoid – what does Microsoft’s embrace of Hadoop mean for IT Managers?

There are two words that seem particularly difficult to avoid at the moment: big data. Infrastructure guys instinctivly shy away from data but such is its prevalence that big data is much more than just the latest IT buzzword and is becoming a major theme in our industry right now

But what does “big data” actually mean? It’s one of those phrases that, like cloud computing earlier, it is being “adopted” by vendors to mean whatever they want it to.

The McKinsey Global Institute describes big data as “the next frontier for innovation, competition and productivity” but, put simply, it’s about analysing masses of unstructured (or semi-structured) data which, until recently, was considered too expensive to do anything with.

That data comes from a variety of sources including sensors, social networks and digital media and it includes text, audio, video, click-streams, log files and more. Cynics who scoff at the description of “big” data (what’s next, “huge” data?) miss the point that it’s not just about the volume of the data (typically many petabytes) but also the variety and frequency of that data. Some even refer to it as “nano data” because what we’re actually looking at is massive sets of very small data.

Processing big data typically involves distributed computer systems and one project that has come to the fore is Apache Hadoop – a framework for development of open-source software for reliable, scalable distributed computing.

Over the last few weeks though, there have been some significant announcements from established IT players, not all of whom are known for embracing open source technology. This indicates a growing acceptance for big data solutions in general and specifically for solutions that include both open- and closed- source elements.

When Microsoft released a SQL Server-Hadoop (SQOOP) Connector,there were questions about what this would mean for CIOs and IT Managers who may previously have viewed technologies like Hadoop as a little esoteric.

The key to understanding what this would mean would be understanding the two main types of data: structured and unstructured. Structured data tends to be stored in a relational database management system (RDBMS), for example Microsoft SQL Server, IBM DB2, Oracle 11G or MySQL.

By structuring the data with a schema, tables, keys and all manner of relationships it’s possible to run queries (with a language like SQL) to analyse the data and techniques have developed over the years to optimise those queries. By contrast, unstructured data has no schema (at least not a formal one) and may be as simple as a set of files. Structured data offers maturity, stability and efficiency but unstructured data offers flexibility.

Secondly, there needs to be an understanding of the term “NoSQL”. Commonly misinterpreted as an instruction (no to SQL), it really means not only SQL – i.e. there are some types of data that are not worth storing in an RDBMS. Rather than following the database model of extract, transform and load (ETL), with a NoSQL system the data arrives and the application knows how to interpret the data, providing a faster time to insight from data acquisition.

Just as there are two main types of data, there are two main types of NoSQL system: key/value stores (like MongoDB or Windows Azure Table Storage) can be thought of as NoSQL OLTP; Hadoop is more like NoSQL data warehousing and is particularly suited to storing and analysing massive data sets.

One of the key elements towards understanding Hadoop is understanding how the various Hadoop components work together. There’s a degree of complexity so perhaps it’s best to summarise by saying that the Hadoop stack consists of a highly distributed, fault tolerant, file system (HDFS) and the MapReduce framework for writing and executing distributed, fault tolerant, algorithms. Built on top of that are query languages (live Hive and Pig) and then we have the layer where Microsoft’s SQOOP connector sits, connecting the two worlds of structured and unstructured data.

The trouble is that SQOOP is just a bridge – and not a particularly efficient one either – working on SQL data in the unstructured world involves subdivision of the SQL database so that MapReduce can work correctly.

Because most enterprises have both the structured and unstructured data, we really need tools that allow us to analyse and manage data in multiple environments – ideally without having to go back and forth. That’s why there are so many vendors jumping on the big data bandwagon but it seems that a SQOOP connector is not the only work Microsoft is doing in the big data space:

SQL Server 2008 R2 includes a complex event processing (CEP) capability called StreamInsight. The principle is that streams of data can be monitored, managed and mined for particular events (instead of running queries across data, run the data through a set of queries looking for matches) and this can help organisations to respond quickly to new opportunities – maybe even adopting a predictive business model.
The next version of SQL Server will include a new data analysis tool called Power View which will even be supported on competitive mobile operating systems (including iOS and Android).
Windows Azure includes table storage – a key/value pair storage solution with partitioning.
Also on Azure, Microsoft is creating a new Data Explorer tool to create rich data sets that can be published as a service and an iterative MapReduce runtime codenamed “Daytona” for scaling data analytics across hundreds of processing cores.
Microsoft is also creating new implementations of the Hadoop stack for Windows Azure and Windows Server (including a Hive ODBC driver and a Hive Add-in for Excel) but it also has a competing technology called LINQ to HPC (formerly codenamed Dryad) that allows a Windows High Performance Compute (HPC) cluster to not only perform parallel computing but also to integrate with Azure (the theory behind this is that big data jobs are typically I/O-bound, rather than compute-bound).

In our increasingly cloudy world, infrastructure and platforms are rapidly becoming commoditised. We need to focus on software that allows us to derive value from data to gain some business value. Consider that Microsoft is only one vendor, then think about what Oracle, IBM, Fujitsu and others are doing. If you weren’t convinced before, maybe HP’s Autonomy purchase is starting to make sense now?

Looking specifically at Microsoft’s developments in the big data world, it therefore makes sense to see the company get closer to Hadoop. The world has spoken and the de facto solution for analysing large data sets seems to be HDFS/MapReduce/Hive (or similar).

Maybe Hadoop’s success comes down to HDFS and MapReduce being based on work from Google whilst Hive and Pig are supported by Facebook and Yahoo respectively (i.e. they are all from established Internet businesses). But, by embracing Hadoop (together with porting its tools to competitive platforms), Microsoft is better placed to support the entire enterprise with both their structured and unstructured needs.

[This post was originally written as an article for Cloud Pro.]