(Word Cloud Designed by Haoning Richter using R Language on June 7, 2018)

Big Data, Hadoop, Audit and Risk Considerations

For the past 10 years, Big Data has been one of the most discussed Phenomena and business challenges in many organizations in the world.  However, it has not been discussed much in the internal audit profession. I recently had the opportunity to study big data in depth and thought it may be useful to apply what I’ve learned to the considerations that internal auditors or compliance professionals may face in the big data world: ethics, responsibilities, social and legal obligations, and compliance.

We cannot stop the ever-increasing complexity and volume of data and tools in big data so we shall embrace it by learning and trying to understand what and how we can become more effective auditors to help business and organizations solve problems and achieve business objectives.

Big Data – What is it?

The dictionary says,extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.”

Big data includes structured data, semi-structured data, and unstructured data. There are four characteristics of Big Data [1]:

  • Volume: The amount of data or data intensity
  • Velocity: The speed of data being produced, changed, received, and processed
  • Variety: The different data sources coming from internal and external of an entity
  • Veracity: The quality and provenance of received data

According to SAS Insights, big data has two additional dimensions [2]:

  • In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage. Even more so with unstructured data.
  • Today’s data comes from multiple sources, which makes it difficult to link, match, cleanse and transform data across systems. However, it’s necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control.

Big Data – Why Is Big Data Important?

Big data has already affected our lives and work in many ways. Hadoop, one of the top 10 open source tools, has dominated the big data world.  Many companies have developed supporting or complimentary solutions to enhance Hadoop so that the Hadoop ecosystem continues to evolve and excel.

The reason big data is so important isn’t about how much you have it but what you do with it.  You can collect data (assuming compliance with data privacy and protection regulations) from any sources and analyze it to find answers (or sell to other company), which enable companies to make smarter decisions, market or target the right audience and increase revenue, reduce costs and time, design new product and optimize offerings, recalculating risk portfolios in minutes, and detecting fraudulent behavior or pattern before it affects your organization.  As a result, big data can be messy and the traditional data warehousing or data marts can no longer handle the processing nor generate any meaningful analytics.

Ethics in Big Data

In 2012, Kord Davis pointed out the following 4 elements of Big Data Ethics: identity, privacy, ownership, and reputation [3]. Although things have evolved and more “facts” have been surfaced such as the current events in Facebook, those 4 elements can still be relevant in assessing risks, planning an audit, and alignment between documented company values and the practices in methods and tools used (such as algorithms), buying, and selling, etc.

The big data challenges provide the perfect motivation and market demands for new, non-traditional, and effective applications and platforms.  Hadoop is one of many open source tools developed to help manage and deal with big data challenges.

Hadoop Background

In 2003, Mr. Doug Cutting worked at