Across the healthcare ecosystem, payers, providers, pharmacy and life sciences organizations are leveraging data lakes, seeking to unite disparate structured and unstructured data from multiple sources such as claims data, clinical data, social determinants of health and quality insights to name a few. But what exactly are we talking about when we talk about a data lake?
The concept of “big data” was coined well more than a decade ago and has gained momentum as the sheer amount of data collected across all sectors began its exponential growth. How do big data and data lake intersect? And is a data lake the same as a data warehouse? James Dixon, the CTO of Pentaho who is credited with the naming of the concept of data lake, described the concept of a data lake to Forbes:
“If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
In short, a data lake serves as a centralized repository where both structured and unstructured data is stored at any scale. The healthcare industry has a rich amount of disparate patient data that needs to be structured and streamlined to help physicians, healthcare organizations, payers and scientific researchers make informed, data-driven decisions. A cloud-based healthcare data lake solution enables the combining of complex and disparate data, delivering insightful analytics to solve complex business and clinical issues.
According to a white paper from Scalable Health, the benefits and potential impact of a healthcare data lake include a comprehensive view of patient care, the ability to process huge data volumes at once, enhanced query processing, lower costs, and a faster time to greater insights.
A data lake serves as a solution for companies that have large amounts of data pocketed in different places and are managed by different groups. Having a centralized repository of data prevents the data from being obscured. Since there are no predefined schemas for the data imported into data lakes, users can ingest data in real time – and with the right governance, data lake administrators don’t have to worry about managing access for multiple databases, because data lakes have security controls in place that grant authorized users access to see, process or modify a specific asset, while keeping unauthorized users out, preventing confidential data from being compromised.
In contrast to a data lake, a data warehouse stores data in an organized, structured manner with everything archived and ordered in a defined way – think of a computer drive where there’s a folder for pictures, videos, documents, and downloaded content and every file or media content has a designated folder on the drive. The concept is the same for a data warehouse. Unlike a data lake, a data warehouse takes more time to develop, requiring a significant amount of effort in the initial developmental stages that asks each data source to be analyzed and the business use clearly defined before the data is loaded into the warehouse. On the other hand, a data lake is centralized, meaning everything is stored in one big repository rather than stored as individual files like the data in a data warehouse. In a data lake, both structured and unstructured data can be stored at any scale, meaning the data being inputted into the data lake doesn’t have to be analyzed or evaluated for future business use.
A typical data lake would contain multiple sources of data that could be contained in a single integration layer or by integrating multiple channels through an API connected to the enterprise data lake. Many such sources are claims and enrollment data, quality and risk insights, genomic health, linked external datasets, data coming from ACOs and health systems, national registries, and analytics coming from client and vendor entities.
A 2017 Aberdeen survey found that organizations that implemented a data lake were more likely to outperform their competitors by 9% in organic growth. An effective healthcare data lake solution provides a single source of truth for data that supports advanced reporting, analytics initiatives, and other use cases. It is this ability to transform the structured and unstructured components of the data lake into a data mart that can be configured with the latest technologies to quickly support numerous business use cases and consumption models that underscores the power of getting the foundation right.
Another key benefit of adopting a data lake is that it makes it easier for organizations to train and deploy more accurate machine learning and artificial intelligence (AI) models. AI and machine learning technology – including Python – thrives on large, diverse datasets, and a data lake serves as a powerful foundation to support the training of new algorithms for these technologies.
For an even more detailed explanation on the key benefits of data lake and how it differs from a data warehouse, review the table below, originally published by Amazon:
By housing a data lake, healthcare enterprises have the power to not just accommodate disparate data sets but also craft strategic tools, like analytics and insight sharing with payers and providers through a well-configured data mart. Looking into the future, health systems and enterprises have tremendous potential for staying highly cost effective while also reliably processing large chunks of data in real time, accelerating query processing and building a population health management model on its own accord.
While one of the greatest appeals of a data lake is its ability to accept unstructured data and break down silos, the tradeoff of is having large amounts of data that is of no value to its users because the data lake has become a swamp and users are unable to access the information they need. With no constraints on where data in a data lake originates and no regulation on what is done with that data once it is ingested into the data lake, how can you ensure that quality data is being ingested into the repository?
Metadata tags: Organizations should consider the use of metadata tagging to add some level of organization to what’s ingested to make it easier for data lake users to find relevant data queries and analysis.
Catalog: Providing a catalog that makes data visible and accessible to all data lake stakeholders can mitigate future challenges with the data lake.
Parameters: To avoid your data lake from transforming into a data swamp, detailed parameters around what kind of data and how much data resides in the lake should be set.
Data governance: Establishing data governance – who handles the data, where the data resides and how long the data is retained – is vital to the success of a data lake. This concept of data provenance will continue to grow as more disparate data sets are merged, de-duplicated and consumed for downstream activities – especially critical when treatment decisions are made on such data insights.
There is a great deal of buzz around data lake – especially in the healthcare industry – but given the complexity of technological requirements and architecture necessary to support a data lake, organizations are taking their time to complete the appropriate due diligence.
For the companies that are interested in developing a data lake solution, the first step is to adopt the necessary data architectures. This means leveraging the cloud for data lake architectures, which means a financial investment to build a data repository. While this tends to be a large undertaking and a financial investment for most organizations, the ROI is worth it in the long term.
Data lakes will continue to add value to organizations by breaking down data silos and empowering users with actionable insights to allow them to make data-driven clinical and quality decisions that contribute to measurable impact for members and the economic performance of the health plan.