by clicking on the page. A slider will appear, allowing you to adjust your zoom level. Return to the original size by clicking on the page again.
the page around when zoomed in by dragging it.
the zoom using the slider on the top right.
by clicking on the zoomed-in page.
by entering text in the search field and click on "In This Issue" or "All Issues" to search the current issue or the archive of back issues respectively.
by clicking on thumbnails to select pages, and then press the print button.
this publication and page.
displays a table of sections with thumbnails and descriptions.
displays thumbnails of every page in the issue. Click on a page to jump.
allows you to browse through every available issue.
GCN : March 2015
With the “next big thing” in IT there inevitably comes a time when user experience falls short of the hype. So it is with big data and its promise of fast and precise analysis of huge volumes of distributed data. In the current big data universe, Hadoop is the software used to store and distribute large amounts of data and MapReduce is the engine used to process it. The combination has proven itself in non-time-critical batch pro- cessing of data. But what about analysis of near-real- time big data? Apache Spark, the most advanced of these next-generation, open-source technologies, sets the stage for analy- sis of streaming data from video, sen- sors and transactions. Like MapReduce it can be used for batch processing, but for those algorithms that perform a number of interactions on a dataset, Spark can store the intermediate results of those actions in cache memory. MapReduce, in contrast, has to write the result of each action to disk before it can be brought back into the system for fur- ther processing. That rapid in-memory processing of resilient distributed datasets (RDDs) is the “core capability” of Apache Spark. “Once operations are done (on the datasets) they can be streamed and connected to each other so that trans- formations can be made very quickly,” said Dave Vennergrund, director of predictive analytics for Salient Federal Solutions, which is working on devel- oping analytics products for govern- ment organizations using Spark. “Couple that with the ability to do this across many machines at the same time, and you have a recipe for a very strong response,” he added. Proponents of Spark claim both scale and speed advantages for the Apache tool compared to its competi- tors. It’s been shown to work well for small datasets up to volumes measur- ing in the petabytes. A November 2014 benchmark contest had Apache Spark sorting 100 terabytes of data three times faster than Hadoop MapReduce, on a ma- chine cluster one-tenth of the size of that used for the MapReduce sort. A recent survey by Typesafe, a soft- ware developer, showed a rising level of interest by organizations in using Spark. Only 13 percent were currently us- ing it, but more than 30 percent were evaluating it, 20 percent of the respon- dents were planning to begin using it sometime this year, and another 6 percent expected to use it in 2016 or later. However, 28 percent of those surveyed had no knowledge of Spark, which emphasizes its still “bleeding edge” status. For the government space, “test- ing and evaluation is where it’s at right now,” said Cindy Walker, vice president of Salient’s Data Analytics Center of Excellence. Agencies that have “sand- boxes and R&D budgets” are the early adopters, she said. “Many of our customers aren’t yet signing on the bottom line to imple- ment big data, in-memory analytics, streaming solutions,” she said. “So, at this time, we are using Spark to help guide them to what they can expect once they get to that point.” So while Spark won’t be a re- placement for MapReduce, it will eventually claim a section of the big data analytics spectrum devoted to speedy data processing, according to analysts. • ‘Spark’ triggers the next stage of near-real-time big data GCN MARCH 2015 • GCN.COM 15 BY BRIAN ROBINSON The Apache Spark ecosystem comprises several integrated components: • Spark Core, the underlying execution engine for the platform, supports a range of applications, as well as Java, Scala and Python application programming interfaces. • Spark SQL (Structured Query Language) allows users to explore data. • Spark Streaming enables analysis of streaming data from sources such as Twitter. This is in addition to Spark’s ability to also do batch processing. • Machine Learning Library (MLlib), a distributed machine learning framework, delivers high- quality algorithms up to 100 times faster than MapReduce. • Graph X helps users build and manipulate graph-based representations of text and tabular data to find various relationships within the data. • SparkR, is a package for the R statistical language with which R users can use Spark functionality from within the R shell. What’s in the Apache Spark ecosystem? 0315gcn_006-015.indd 15 3/5/15 12:42 PM