by clicking on the page. A slider will appear, allowing you to adjust your zoom level. Return to the original size by clicking on the page again.
the page around when zoomed in by dragging it.
the zoom using the slider on the top right.
by clicking on the zoomed-in page.
by entering text in the search field and click on "In This Issue" or "All Issues" to search the current issue or the archive of back issues respectively.
by clicking on thumbnails to select pages, and then press the print button.
this publication and page.
displays a table of sections with thumbnails and descriptions.
displays thumbnails of every page in the issue. Click on a page to jump.
allows you to browse through every available issue.
GCN : December 2013
in real time, "government customers are starting to merge big and fast data tools," said Rich Campbell, chief technologist of EMC Federal. Agencies are combining tools such as Pivotal HD, a commercially sup- ported distribution of the Apache Hadoop and Pivotal Gemfire, for data ingestion to run traditional big data management and analytics. They are also using Pivotal HawQ for advanced database services and data fabric services. "These solutions let government agencies perform more real- time queries, with less of a requirement to move data, allowing for better response times," Campbell said. Other tools include IBM's InfoSphere Streams, an advanced analytic platform that allows user-developed applications to quickly ingest, analyze and correlate infor- mation from multiple data sources in real time. BIG DATA PARKING LOT 3 By far, the most common system for stor- ing and batch processing big data is HDFS, said Abe Usher, chief innovation officer of the HumanGEO Group, where he works with defense and intelligence agencies. "Hadoop is a unifying element for people using big data because it is a standard to store and retrieve large data sets. It is like a big data parking lot," Usher said. Hadoop is an open-source framework that breaks up large data sets and distrib- utes the processing work across a cluster of servers. Once the data is loaded into the cluster, a user queries the data with the MapReduce framework, which "maps" the query to the proper node where it is pro- cessed then "reduces" the results from the queries on the distributed machines to one answer. Commercial versions of Hadoop from companies such as Cloudera, Horton- works and IBM are available. Agencies should realize that there are manpower and costs concerns with the framework, though. It is a relatively new technology that requires people with Ha- doop expertise, and it needs to run on multiple servers in a Tier 1 data center with good internal bandwidth and man- agement. And because Hadoop is a batch processing engine, it is not optimized for real-time analysis. Deployment costs can hit the $50,000 range, Usher said. Hadoop has been used in several suc- cessful government programs, including the National Cancer Institute's Frederick National Laboratory, which built an infra- structure capable of cross-referencing the relationships between 17,000 genes and five major cancer subtypes. In 2010, GSA revamped its USASearch, a hosted search service used by more than 550 govern- ment websites. Using HDFS, Hadoop, and Apache Hive, GSA improved search results by aggregating and analyzing big data on users' search behavior. Oracle has moved to address issues of cost and complexity with the Oracle Big Data Appliance, which incorporates Clou- dera's software (including Apache Ha- doop) into Oracle hardware, said Mark Johnson, senior vice president of Oracle Public Sector. The big data appliance comes pre-built, optimized and tuned to lower the costs of big data projects. Simi- larly, IBM offers InfoSphere BigInsights, a Hadoop-based analysis tool that includes visualization, advanced analytics, security and administration. DATA INTEGRATION 4 Traditional relational databases weren't designed to cope with the variety, velocity and volume of unstructured data coming from audio devices, machines, cell phones, sensors, social media platforms and video. NoSQL databases can write data much fast- er than an RDBMS and deliver fast query speeds across large volumes. They are dis- tributed tools that manage unstructured and semi-structured data that requires fre- quent access. Some examples include: • MongoDB leverages in-memory comput- ing and is built for scalability, performance and high availability, scaling from single- server deployments to large, complex mul- tisite architectures. • Apache Cassandra handles big data work- loads across multiple data centers with no single point of failure, providing enterpris- es with high database performance and availability. • Apache HBase is an open-source, dis- tributed, versioned, column-oriented store modeled after Google's BigTable data stor- age system. • MarkLogic's enterprise-grade NoSQL platform can integrate diverse data from legacy databases, open-source technolo- gies, and Web information sources. The government-grade security NoSQL data- base has been used for fraud detection, risk analysis and vendor and bid management. Accumulo was created in 2008 by the National Security Agency and contributed to the Apache Foundation as an incuba- tor project in September 2011. Because it includes cell-level security, it can restrict users' access to only particular fields of the database. This enables data of various se- curity levels to be stored within the same row, and users of varying degrees of access to query the same table, while preserving data confidentiality. According to the NSA, hundreds of developers are currently using Accumulo. Extraction, Transformation and Loading (ETL) processes are critical components for migrating data from one database to another or for feeding a data warehouse or business intelligence system. An ETL tool retrieves data from all operational systems and prepares it for further analysis by refor- matting, cleaning, mapping and standard- izing it. As ETL tools mature, they increas- 24 GCN DECEMBER 2013 • GCN.COM BIG DATA ANALYTICS A growing challenge is choosing the right technology to aid in collecting, processing, analyzing