(Word Cloud Designed by Haoning Richter using R Language on June 7, 2018)
Big Data, Hadoop, Audit and Risk Considerations
For the past 10 years, Big Data has been one of the most discussed Phenomena and business challenges in many organizations in the world. However, it has not been discussed much in the internal audit profession. I recently had the opportunity to study big data in depth and thought it may be useful to apply what I’ve learned to the considerations that internal auditors or compliance professionals may face in the big data world: ethics, responsibilities, social and legal obligations, and compliance.
We cannot stop the ever-increasing complexity and volume of data and tools in big data so we shall embrace it by learning and trying to understand what and how we can become more effective auditors to help business and organizations solve problems and achieve business objectives.
Big Data – What is it?
The dictionary says, “extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.”
Big data includes structured data, semi-structured data, and unstructured data. There are four characteristics of Big Data [1]:
- Volume: The amount of data or data intensity
- Velocity: The speed of data being produced, changed, received, and processed
- Variety: The different data sources coming from internal and external of an entity
- Veracity: The quality and provenance of received data
According to SAS Insights, big data has two additional dimensions [2]:
- In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage. Even more so with unstructured data.
- Today’s data comes from multiple sources, which makes it difficult to link, match, cleanse and transform data across systems. However, it’s necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control.
Big Data – Why Is Big Data Important?
Big data has already affected our lives and work in many ways. Hadoop, one of the top 10 open source tools, has dominated the big data world. Many companies have developed supporting or complimentary solutions to enhance Hadoop so that the Hadoop ecosystem continues to evolve and excel.
The reason big data is so important isn’t about how much you have it but what you do with it. You can collect data (assuming compliance with data privacy and protection regulations) from any sources and analyze it to find answers (or sell to other company), which enable companies to make smarter decisions, market or target the right audience and increase revenue, reduce costs and time, design new product and optimize offerings, recalculating risk portfolios in minutes, and detecting fraudulent behavior or pattern before it affects your organization. As a result, big data can be messy and the traditional data warehousing or data marts can no longer handle the processing nor generate any meaningful analytics.
Ethics in Big Data
In 2012, Kord Davis pointed out the following 4 elements of Big Data Ethics: identity, privacy, ownership, and reputation [3]. Although things have evolved and more “facts” have been surfaced such as the current events in Facebook, those 4 elements can still be relevant in assessing risks, planning an audit, and alignment between documented company values and the practices in methods and tools used (such as algorithms), buying, and selling, etc.
The big data challenges provide the perfect motivation and market demands for new, non-traditional, and effective applications and platforms. Hadoop is one of many open source tools developed to help manage and deal with big data challenges.
Hadoop Background
In 2003, Mr. Doug Cutting worked at a project called Nutch at Google when he had such idea of developing an open source tool for the world to share and to use. In January 2006, Doug joined Yahoo and obtained funding and team resources to develop such open source solution, which they called “Hadoop”.
By 2008, Hadoop scaled. It won the terabyte sort benchmark that year. Hadoop has changed the world by bringing the first scalable, reliable, open-source distributed computing framework to deal with big data across industries. The evolution of Hadoop is driven by its community at Apache, as Doug pointed out. In 2009, Doug realized in order for Hadoop to scale and to be useful in different industries around the world, there needed to be a company providing support for it. Thus, Doug joined Cloudera as the company architect and continues to serve as the Chairman of Apache Software Foundation [4].
The name “Hadoop” was given by one of Doug Cutting’s sons when the son was about 2-year old and called his elephant toy something like “Hadoop”. Doug used the name for his open source project because it was easy to pronounce and to Google. This video provides more details: https://www.youtube.com/watch?v=dImieI08lds
Hadoop Overview
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation [5].
Hortonworks predicted that by end of 2020, 75% of Fortune 2000 companies will be running 1000 node Hadoop clusters in production. The tiny toy elephant in the big data room has become the most popular big data solution across the globe.
Hadoop has the ability to store and analyze data present in different machines at different locations very quickly and in a very cost-effective manner. It uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel [5].
The following diagram explains different tools in the Hadoop and non-Hadoop Platforms:
(Image source: jp.hortonworks.com)
Here are 6 Characteristics and Benefits of Hadoop:
- Hadoop Brings Flexibility in Data Processing:
One of the biggest challenges organizations have had in that past was the challenge of handling unstructured data. Hadoop manages data whether structured or unstructured, encoded or formatted, or any other type of data. Hadoop brings the value to the table where unstructured data can be useful in decision making process.
- Hadoop Is Easily Scalable
- Hadoop Is Fault Tolerant
- Hadoop Is Great at High-Volume Batch Data Processing
- Hadoop Ecosystem Is Robust
- Hadoop is Very Cost Effective
Hadoop generates cost benefits by bringing massively parallel computing to commodity servers, resulting in a substantial reduction in the cost per terabyte of storage, which in turn makes it reasonable to model all your data.
Hadoop Architecture with focuses on HDFS and MapReduce
Hadoop is composed of modules that work together to create the Hadoop framework. The primary Hadoop framework modules are:
- Hadoop Distributed File System: Commonly known as HDFS, it is a distributed file system compatible with very high scale bandwidth.
- MapReduce: A programming model for processing big data.
- YARN: It is a platform used for managing and scheduling Hadoop’s resources in Hadoop infrastructure.
- Libraries: To help other modules to work with Hadoop.