Skip to main content
Healthcare Data Lake

Level up: Advancing healthcare analytics through data lakes

Across the healthcare ecosystem, payers, providers, pharmacy, and life sciences organizations are leveraging healthcare data lakes, seeking to unite disparate structured and unstructured data from multiple sources such as claims data, clinical data, social determinants of health and quality insights to name a few. But what exactly is a healthcare date lake?

This blog dives into what a data lake is, the emerging role of data lakes in healthcare, and best practices for managing data lakes.

What is a healthcare data lake?

The concept of “big data” was coined well more than a decade ago and has gained momentum as the sheer amount of data collected across all sectors began its exponential growth. How do big data and data lake intersect? And is a data lake the same as a data warehouse? James Dixon, the CTO of Pentaho who is credited with the naming of the concept of data lake, described the concept of a data lake to Forbes:

“If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

In short, a data lake serves as a centralized repository where both structured and unstructured data is stored at any scale. The healthcare industry has a rich amount of disparate patient data that needs to be structured and streamlined to help physicians, healthcare organizations, payers and scientific researchers make informed, data-driven decisions. A cloud-based healthcare data lake solution enables the combining of complex and disparate data, delivering insightful analytics to solve complex business and clinical issues.

According to a white paper from Scalable Health, the benefits and potential impact of a healthcare data lake include a comprehensive view of patient care, the ability to process huge data volumes at once, enhanced query processing, lower costs, and a faster time to greater insights.

A data lake serves as a solution for companies that have large amounts of data pocketed in different places and are managed by different groups. Having a centralized repository of data prevents the data from being obscured. Since there are no predefined schemas for the data imported into data lakes, users can ingest data in real time – and with the right governance, data lake administrators don’t have to worry about managing access for multiple databases, because data lakes have security controls in place that grant authorized users access to see, process or modify a specific asset, while keeping unauthorized users out, preventing confidential data from being compromised.

How does a data lake differ from a data warehouse?

In contrast to a data lake, a data warehouse stores data in an organized, structured manner with everything archived and ordered in a defined way – think of a computer drive where there’s a folder for pictures, videos, documents, and downloaded content and every file or media content has a designated folder on the drive. The concept is the same for a data warehouse. Unlike a data lake, a data warehouse takes more time to develop, requiring a significant amount of effort in the initial developmental stages that asks each data source to be analyzed and the business use clearly defined before the data is loaded into the warehouse. On the other hand, a data lake is centralized, meaning everything is stored in one big repository rather than stored as individual files like the data in a data warehouse. In a data lake, both structured and unstructured data can be stored at any scale, meaning the data being inputted into the data lake doesn’t have to be analyzed or evaluated for future business use.

A typical data lake would contain multiple sources of data that could be contained in a single integration layer or by integrating multiple channels through an API connected to the enterprise data lake. Many such sources are claims and enrollment data, quality and risk  insights, genomic health, linked external datasets, data coming from ACOs and health systems, national registries, and analytics coming from client and vendor entities.

A 2017 Aberdeen survey found that organizations that implemented a data lake were more likely to outperform their competitors by 9% in organic growth. An effective healthcare data lake solution provides a single source of truth for data that supports advanced reporting, analytics initiatives, and other use cases. It is this ability to transform the structured and unstructured components of the data lake into a data mart that can be configured with the latest technologies to quickly support numerous business use cases and consumption models that underscores the power of getting the foundation right.

Another key benefit of adopting a data lake is that it makes it easier for organizations to train and deploy more accurate machine learning and artificial intelligence (AI) models. AI and machine learning technology – including Python – thrives on large, diverse datasets, and a data lake serves as a powerful foundation to support the training of new algorithms for these technologies.

For an even more detailed explanation on the key benefits of a data lake and how it differs from a data warehouse, review the table below, originally published by Amazon:

Characteristics Data Warehouse Data Lake
Data Relational from transactional systems, operational databases, and line of business applications Non-relational and relational from IoT devices, websites, mobile apps, social media, and corporate applications
Schema Designed prior to the DW implementation (schema-on-write) Written at the time of analysis (schema-on-read)
Price/Performance Fastest query results using higher cost storage Query results getting faster using low-cost storage
Data Quality
Highly curated data that serves as the central version of the truth Any data that may or may not be curated (ie. raw data)
Users Business analysts Data scientists, Data developers, and Business analysts (using curated data)
Analytics Batch reporting, BI, and visualizations Machine Learning, Predictive analytics, data discovery, and profiling

By housing a data lake, healthcare enterprises have the power to not just accommodate disparate data sets but also craft strategic tools, like analytics and insight sharing with payers and providers through a well-configured data mart. Looking into the future, health systems and enterprises have tremendous potential for staying highly cost-effective while also reliably processing large chunks of data in real-time, accelerating query processing and building a population health management model on its own accord.

Avoiding the data swamp: How to maintain a healthy healthcare data lake

While one of the greatest appeals of a data lake is its ability to accept unstructured data and break down silos, the tradeoff is having large amounts of data that is of no value to its users because the data lake has become a swamp and users are unable to access the information they need. With no constraints on where data in a data lake originates and no regulation on what is done with that data once it is ingested into the data lake, how can you ensure that quality data is being ingested into the repository?

Metadata tags: Organizations should consider the use of metadata tagging to add some level of organization to what’s ingested to make it easier for data lake users to find relevant data queries and analyses.

Catalog: Providing a catalog that makes data visible and accessible to all data lake stakeholders can mitigate future challenges with the data lake.

Parameters: To avoid your data lake from transforming into a data swamp, detailed parameters around what kind of data and how much data resides in the lake should be set.

Data governance: Establishing data governance – who handles the data, where it resides and how long the data is retained – is vital to this solution’s success. This concept of data provenance will continue to grow as more disparate data sets are merged, de-duplicated and consumed for downstream activities – especially critical when treatment decisions are made on such data insights.

Here to stay

There is a great deal of buzz around data lakes – especially in the healthcare industry – but given the complexity of technology requirements and architecture necessary to support a data lake, organizations are taking their time to complete the appropriate due diligence.

For companies interested in developing a data lake solution, the first step is to adopt the necessary data architectures. This means leveraging the cloud for architectures, which means a financial investment to build a data repository. While this tends to be a large undertaking and a financial investment for most organizations, the ROI is worth it in the long term.

Data lakes will continue to add value to organizations by breaking down data silos and empowering users with actionable insights to allow them to make data-driven clinical and quality decisions that contribute to measurable impact for members and the economic performance of the health plan. To learn how Inovalon can help your organization unlock the value of data lakes, check out our data lake solution.


Inovalon and design® and Inovalon® are trademarks of Inovalon, Inc. 

About the author

Hemanth Doma, Associate Vice President, Product