Understanding Data Lakes
As I did the research for Analytics: The Agile Way, I encountered a relatively new concept in the business and tech landscape: the data lake. In this post and the next, I’ll broach the subject and describe why they matter.
Let’s begin by examining data lakes in contrast to data warehouses. The latter are predicated upon strictly defined schema—typically either of the star or snowflake variety. That is, they require writing and storing data in a very structured manner or shape. Data warehouses require the strict manipulation of data; they do not store data in its “natural state.”
The tightly controlled process of data warehousing often meets certain business needs—often reporting. Still, it fails to meet others. (More on that in my next post on the subject.)
Enter the Data Lake
Yes, at a high level, both data warehouses and lakes store data but there’s a key difference: on-write vs. on-read.
Let me explain.
Data lakes still require schema but that schema isn’t pre-defined. It’s ad hoc or, if you like, on-read. Data is applied to a plan or schema as it is pulled out of a stored location, not as it goes in. Put differently, data remains in its unaltered (read: natural) state. Critically, a data lake doesn’t define requirements unless and until users query the data. As Margaret Rouse writes:
Each data element in a lake inherits unique identifier tagged with an extended set of metadata tags. When a business question arises, users can query the data lake for relevant data. The end goal: that those users can analyze that smaller dataset to help answer the question.
Think about it. When used correctly, data lakes offer business and technical users to query smaller, more relevant, and more flexible datasets. As a result, query times can drop to a fraction of what they would have been in a datamart, data warehouse, or relational database.
I see a bright future for data lakes. Data volumes continue to increase—especially of the unstructured variety. Data storage costs keep plummeting and data is increasingly valuable. Rather than trying to retrofit useful and mature technologies to a very new environment, expect intelligent organizations to experiment with and adopt data lakes over the next few years.