As we all prepare for the New Year, what are the top priorities on your agenda for 2017? Are data lakes part of it? Are you looking for ways to do it right? Then we might be able to help you go through the holiday break with some food for thought.
Data lakes are an important cornerstone for companies needing to manage and exploit new forms of data in order to fuel their digital transformation. Data lakes allow employees to do more with data, faster, in order to drive business results. So far, many IT teams are still trying to figure out how to get return from their initial data lake investments.
Starting with the ‘Why’ instead of the ‘What’: Why do you need a data lake?
Let’s look at the case of GE. GE’s CEO Jeff Immelt once said: “All industrial companies are in the information business whether they like it or not.” GE was founded in 1892 and since that time the company has been able to remain relevant by evolving their business models and focus throughout the years in order to keep pace with the market.
In the past, companies thought they’d gain full 360-degree visibility into their enterprise information with a data warehouse. However, the advent of big data has put these systems under distress, pushing them to capacity, and driving up the costs of storage. As a result, some companies have started moving some of their data (often times less utilized data) off to a new set of systems like those run in Hadoop, NoSQL databases or the Cloud.
As a result of this migration, companies also came to realize that they can actually do more with Hadoop, NoSQL and Cloud vs. using enterprise data warehouses. Thus, they started adding new sources of data like sensor, mobile, social and big data to these systems, ultimately transforming their Hadoop, NoSQL and Cloud systems into data lakes.
So what is a data lake?
According to Nick Huedecker at Gartner, “Data lakes are marketed as enterprise-wide data management platforms for analyzing disparate sources of data in its native format. The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it's available for analysis by everyone in the organization."
The data lake metaphor emerged because ‘lakes’ are a great concept to explain one of the basic principles of big data. That is, the need to collect all data and detect exceptions, trends and patterns using analytics and machine learning. This is because one of the basic principles of data science is the more data you get, the better your data model will ultimately be. With access to all the data, you can model using the entire set of data versus just a sample set, which reduces the number of false positives you might get.
And what’s the difference between a data warehouse and a data lake?
Data lakes provide the flexibility to store anything without having to worry about preformatting data. However, this flexibility has also led to a new set of challenges: because there is much less construct, there is a need to figure out the data structure when reading the data.
And with the overwhelming amount of data flowing into organizations today, there are concerns over what data can be accessed by employees, and what shouldn’t be shared. Due to the lack of tools, there is also confusion around what data lies where, and limited understanding of where the data came from or what has been done with it thus far.
As a result, until now only a limited number of people are able to access the information residing in corporate data lakes. These individuals tended to be those who knew how to work with data science tools in order to deal with the volume and complexity of data. The rest of the organization was just simply drowning in the data lake.
This gap between those who could utilize enterprise data lakes and those who couldn’t led to a gridlock causing most data lakes to fail at delivering on their true promise -business ROI.
So here are five best practices to successfully unleash the power of your data lakes.
1) Accelerate Data Ingestion
Most organizations end up with a disjointed architecture, with numerous enterprise data silos coming from point solutions plus new data from the cloud, big data, IoT, etc. applications. So the pre-requisites for creating a solid platform for data ingestion should be:
► Wide connectivity – Connect all your data big and small, on premises, hybrid and in the cloud.
► Batch & streaming ubiquity – Ensure it has the ability to handle both historical and real-time data ingestion, and can process data pipelines as they come in, including for advanced analytics.
► Scale with volume and variety – It should have the ability to quickly onboard new data sources, such as data from Web clickstreams, Social or Smart devices.
Pitfalls to watch for:
- Ø Hand coding – This will prevent the system’s ability to scale and deliver on business needs in a timely fashion.
- Ø Fragmented tools – Using too many of these will create even more silos.
2) Understand & govern your data
The lack of data governance is preventing many organizations from fully opening up data lakes for all employees to use, because more often than not, data lakes contain sensitive data like social security numbers, date of birth, credit card numbers, etc. that need to be protected. Hence these organizations will not reap the full benefits and get their full return on data lake investment without having a thorough information governance strategy. So here are some the things to consider:
► Add context to data (provenance, semantics…) – Where is the data coming from, and what‘s the relationship between various data sets?
► Optimize data with curation, stewardship, and preparation– Involve the right people to help clean and qualify data.
► Use a collaborative data governance process- Get IT and the business to work together on ensuring enterprise information can be trusted.
Pitfalls to watch for:
- Ø Authoritative governance – A top down approach to data governance never really works well for user engagement. Instead you need a bottom up approach wherein users can model the data at will, but central IT still certifies, protects and governs the data.
- Ø Fragmented tooling – Use of fragmented tools leads to an inconsistent governance framework.
3) Remove data silos and unify data management
In order to get a single version of the truth, you need a unified framework for all data management tasks with:
► Pervasive data quality, data masking- These need to be part of the data platform.
► Consistent operationalization– To increase data trust and agility.
► Single platform for all use cases & personas– Increases productivity and collaboration across teams.
Pitfalls to watch for:
- Ø Fragmented tools – Use of fragmented tools leads to unpredictable and exponential costs.
- Ø Hand coding – Will prevent your system from being scalable and easy to deploy.
- Ø Shadow IT – Employees will find work arounds to access data lakes, which creates chaos and puts enterprise information at risk.
4) Deliver data to a wide audience
Your data lake will only gain its full power if you get the data into the hands of more employees.
► Make data accessible – IT needs to deploy easy to use tools for less tech savvy line of business users, who are making the business decisions using data from the lake.
► Governed self-service – More general access to corporate information without chaos or risk.
► Scalable operationalization – Allows you to industrialize projects.
Pitfalls to watch for:
- Ø Unmanaged autonomy – Use of isolated, unmanaged tools.
- Ø Self-service tools for the tech savvy – Supplying only a handful of data savvy users with access to the data lake.
5) Get ready for change
The pace of change just keeps on accelerating and data volumes are growing exponentially. So you need a modern data platform that can deliver data for real-time, more informed decision making that can take your organization through its digital transformation and on to future success.
 “Beware of the Data Lake Fallacy,” July 28, 2014, http://www.biztechafrica.com/article/gartner-beware-data-lake-fallacy/8517/