Open Source Software for Big Data

With a dynamic ecosystem of developers, the open source community powers big data

”We’re seeing a rapid expansion of powerful tools available at low cost”

Key Points

  • Hadoop is the leading open source platform for big data
  • Related tools include HDFS, Pig, Mahout, Hive, HBase, Sqoop, Ooze, and Cassandra
  • Hadoop has a large user and support community

Big data management and analysis includes structured and unstructured information and has many uses: enterprise data fusion, decision support, competitive analysis, fraud detection, social media and sentiment analysis, intelligent power grids, medical research, healthcare, defense, intelligence, and security.
For example, healthcare big data solutions are being used with electronic medical records, ancillary systems, and financial systems to collect, standardize, and aggregate large scale clinical data, to enable healthcare leaders to identify and measure outcomes, correlations, and trends, and to improve quality of care.
Biopharma big data solutions are being used to support bioinformatics, data mining, genomics, and translational research. For example, in genomics, big data technologies enable researchers and clinicians to mine billions of data points in to discover new biomarkers.

Challenges of Big Data

Challenges of leveraging big data include: 1.) planning – prioritize use cases and identify value; 2.) data variety, volume, and velocity; 3.) analytical workload complexity; 4.) cost – solutions range from free open source software to proprietary products; 5.) skills – external resources may be needed.

The Information Supply Chain

The objectives of an information supply chain are to consume and integrate raw source data, analyze the data, and then deliver the analytical results to users.
A good approach to managing big data and supporting the processing involved is to deploy hardware and soft-ware solutions optimized for processing different types of big data workloads, and then combine these solutions with an existing enterprise data warehouse to create an integrated information supply chain.

Managing and Analyzing Big Data

Input to the information supply chain consists of raw source data required for analysis. In previous times, most business analytics were created using structured data extracted from operational systems and consolidated into a data warehouse. Big data dramatically increases the number of data sources, the variety and volume of data available for analysis.
A high percentage of this data is semi-structured or unstructured, in contrast to the structured operational data used to populate a data warehouse. In many organizations, unstructured data is growing faster than structured data.

Analytic RDBMSs

An analytic RDBMS or ADBMS is an integrated solution for managing structured data and generating analytics that offers improved price/performance compared with more generalized RDBMS offerings. Performance improvements are achieved with massively parallel processing, enhanced data structures, and data compression.

Non-Relational Systems

Non-relational systems are useful for big data where most of the data is semi-structured or unstructured. These systems support data structures such as document data, graphical information, and key-value pairs.

Hadoop

The leading non-relational platform for big data is Hadoop, an open source project from the Apache Software Foundation. Hadoop is a framework for running applications on a large hardware cluster of commodity hardware. Hadoop includes a distributed file system (HDFS) to distribute and manage large volumes of data across nodes of a hardware cluster to provide high throughput of data, while hiding cluster programming complexity.
Hadoop uses the MapReduce programming model to divide application processing into small fragments of work that can be executed on multiple nodes of the cluster to provide massively parallel processing. Hadoop also includes the Pig and Hive languages for developing and generating MapReduce programs.
Hadoop MapReduce is intended for the batch processing of large volumes of multi-structured data. It is not suitable for low-latency data processing, many small files, or the random updating of data. Instead, these capabilities are provided by database products such as HBase and Cassandra, which run on top of Hadoop.

Focus on Value

Big data adds new high-volume, multi-structured data sources to the information supply chain. Successful strategies for big data include prioritizing use cases, identifying critical success factors, and focusing on value up front.