Understanding Data Lakes

By: Phil Simon| - Leave a comment

Canva

Introduction

As I did the research for Analytics: The Agile Way, I encountered a relatively new concept in the business and tech landscape: the data lake. In this post and the next, I’ll broach the subject and describe why they matter.

Let’s begin by examining data lakes in contrast to data warehouses. The latter are predicated upon strictly defined schema—typically either of the star or snowflake variety. That is, they require writing and storing data in a very structured manner or shape. Data warehouses require the strict manipulation of data; they do not store data in its “natural state.”

The tightly controlled process of data warehousing often meets certain business needs—often reporting. Still, it fails to meet others. (More on that in my next post on the subject.)

Enter the Data Lake

I’ve been saying for a while now that traditional data warehouses can’t do it all. To this end, data lakes fulfill a genuine business need and software vendors have taken notice.

Yes, at a high level, both data warehouses and lakes store data but there’s a key difference: on-write vs. on-read.

Let me explain.

Data lakes still require schema but that schema isn’t pre-defined. It’s ad hoc or, if you like, on-read. Data is applied to a plan or schema as it is pulled out of a stored location, not as it goes in. Put differently, data remains in its unaltered (read: natural) state. Critically, a data lake doesn’t define requirements unless and until users query the data. As Margaret Rouse writes:

Each data element in a lake inherits unique identifier tagged with an extended set of metadata tags. When a business question arises, users can query the data lake for relevant data. The end goal: that those users can analyze that smaller dataset to help answer the question.

Think about it. When used correctly, data lakes offer business and technical users to query smaller, more relevant, and more flexible datasets. As a result, query times can drop to a fraction of what they would have been in a datamart, data warehouse, or relational database.

Simon Says

I see a bright future for data lakes. Data volumes continue to increase—especially of the unstructured variety. Data storage costs keep plummeting and data is increasingly valuable. Rather than trying to retrofit useful and mature technologies to a very new environment, expect intelligent organizations to experiment with and adopt data lakes over the next few years.

Topics: , ,

Comments

About The Author

Phil Simon

Professor at ASU’s W. P. Carey School of Business

Phil Simon is a frequent keynote speaker and recognized technology authority. He is the award-winning author of eight management books, most recently Analytics: The Agile Way. He consults organizations on matters related to communications, strategy, data, and technology. His contributions have been featured on The Harvard Business Review, CNN, Wired, NBC, CNBC, Inc. Magazine, BusinessWeek, The Huffington Post, Quartz, The New York Times, Fox News, and many other sites.

Articles by Phil Simon
See All Posts