by clicking on the page. A slider will appear, allowing you to adjust your zoom level. Return to the original size by clicking on the page again.
the page around when zoomed in by dragging it.
the zoom using the slider on the top right.
by clicking on the zoomed-in page.
by entering text in the search field and click on "In This Issue" or "All Issues" to search the current issue or the archive of back issues respectively.
by clicking on thumbnails to select pages, and then press the print button.
this publication and page.
displays a table of sections with thumbnails and descriptions.
displays thumbnails of every page in the issue. Click on a page to jump.
allows you to browse through every available issue.
GCN : June 2014
IT'S A FUNNY WORD. You have only a vague notion of what it is. You've heard that it takes a lot of work but is po- tentially beneficial. Maybe if you learned more about it you too could enjoy its benefits. But even if you wanted to try it, you wouldn't even know where to start. If it isn't obvious, I'm talking about Hadoop. Not Zumba. Google first described Hadoop 10 years ago---an eternity in technology---but it is only recently that the rest of us have begun to explore it more fully. In order to assess its potential, IT managers must first understand what Hadoop is. WHY HADOOP? Hadoop is not just one thing. It is a combination of compo- nents that work together to facilitate cloud-scale analyt- ics. Hadoop provides an ab- straction for running analyt- ics on a cluster of commodity hardware when there is too much data for a single ma- chine. The analytics program need not know about the cluster, how work is divided across it, nor the vagaries of cluster management. If a ma- chine fails, Hadoop handles that. HDFS stands for the Ha- doop distributed file system. It's optimized for storing lots of data across a computing cluster. Users simply load files into HDFS, and it figures out how to distribute the data. MapReduce is often mis- taken for Hadoop itself, but in fact it is Hadoop's pro- gramming model (commonly in Java) for analytics on data in HDFS. To understand the conceptual foundations of MapReduce, imagine two re- lational database tables---one for bank accounts and the second for account transac- tions. To find the average transaction amount for each account, a user would "map" (or transform) the two origi- nal tables to a single dataset via a join. Then all the individual transaction amounts with the same account number would be "reduced" (or aggregated) to a single amount via a "GROUP BY" clause. MapRe- duce allows users to apply precisely these same concepts to a large data set distributed across a cluster, but the op- erations can be quite slow. Hive allows users to project a tabular structure on the data so they can eschew the MapReduce API in favor of a SQL-like abstraction called HiveQL. Anyone used to SQL staples like "CREATE TABLE," "SELECT," and "GROUP BY" will find HiveQL eases the transition to Hadoop. BEYOND HADOOP As powerful as it is, many aspects of Hadoop remain too low-level, error-prone and slow for developers who need higher levels of abstraction. Cascading enables simpler and more testable workflows for multiple MapReduce jobs. Apache Spark lets developers treat data sets like simple lists and uses cluster memory to make jobs run faster. Apache Accumulo was orig- inally built by the National Security Agency to supple- ment HBase with cell-level security. Numerous projects, including the analytics and visualization tool Lumify, are built on Accumulo. In 2010, Google wrote a paper on Dremel, which facilitates fast queries on cloud-scale data. Dremel supports Google's BigQuery product and has never been released, but Cloudera's Im- pala and MapR's Apache Drill are open-source implementa- tions. With network data, Map- Reduce can be especially slow. A popular alternative programming model is Bulk- Synchronous Parallel (BSP), which abandons disk I/O for messages sent along the network. Apache Hama and Apache Giraph both use BSP to support graph analytics. While these tools excel with batch analytics, what about analytics on data streaming from a feed like a message queue or Twitter? Apache Storm and Spark Streaming can help there. All these tools leverage existing Hadoop artifacts like HDFS files and Hive queries. For example, I have written analytics with Spark against an existing HBase table. Hopefully you can now as- sess Hadoop to see if it has a place in your enterprise. As for me, I am still trying to wrap my head around Zumba. • --- Neil A. Chaudhuri is founder and president of Vidya and has over a decade of experience building complex software projects for commercial and government clients. INDUSTRY INSIGHT BY NEIL CHAUDHURI GCN JUNE 2014 • GCN.COM 17 Do you speak Hadoop? What you need to know. NOTABLE COMPONENTS IN HADOOP HBase The Hadoop database and an example of the so-called NoSQL databases. Zookeeper A centralized service for coordinating activities among the machines in the cluster. Hadoop Streaming A MapReduce API that lets developers use popular scripting languages (e.g. Ruby or Python). Pig An analytic abstraction similar to Hive but with a query syntax called Pig Latin (yes, seriously), which prefers a scripting pipeline approach to the SQL-like HiveQL.