What big data is, the new tools it is creating the need for, and how it is in daily use in labs today.
By Jim Poggi
Both quantitative and qualitative data have been the backbone of the lab business and its value to the clinical community since its inception. So much so that there is a widely held perception that lab data influences 7 out of 10 medical decisions. While there is no current quantitative market data to verify that bit of wisdom, we do know that over 266,000 labs are reporting patient data, with an annual spend of over $105 billion, and growing at over 9% as of 2021.
Hospital and POL testing is rebounding from the decline during the COVID pandemic and home testing is up sharply. Current estimates place the home testing market for 2022 at $1.5 billion and growing. So, lab clearly remains a leading factor in assessment and management of patient health. What do these trends have to do with Big Data? Everything, really. The number and complexity of lab tests is increasing and the number of sites performing lab tests is also growing.
Testing is becoming increasingly de-centralized. Even more to the point, the very complex algorithms being used to understand the predictive value of next generation sequencing for tumor cell genotypes and liquid biopsy in general has created a tremendous need for more patient data to assure the predictive models are accurate and actionable by clinicians. Further complicating the situation is the introduction of over 300 new COVID assays under EUA in under two years.
As a result, the demand to acquire, store and interpret the complex web of personal and population
data, to understand which tests are most useful, which are under or over utilized, and which have the highest positive impact on treatment programs is becoming a prominent concern and subject to multiple opportunities for solutions. In this column I intend to discuss what big data is, the new tools it is creating the need for, and how it is in daily use in labs today – as well as how it may be used to even greater advantage down the road.
What is big data?
The term “big data” entered our vocabulary in the 1990s and has had many interpretations and definitions since then. The one I like the best combines two different definitions that each tell part of the story.
My big data definition: “Big data refers to any data set that challenges or exceeds an individual’s ability to manually evaluate all data points for clinical relevance. The assessment of big data uses sophisticated analytical tools that reveal otherwise unrecognized patterns.”
What I like about this definition is that it is simple and yet insightful. First, it acknowledges that big data has so many data elements, possibly from a broad range of sources or that it needs such sophisticated data management tools that no one single person can manage to aggregate and analyze the data set.
Second, and more importantly, that the analysis of the data provides clinical relevance and also reveals otherwise unrecognized patterns. To make this situation a bit more concrete, think about the number of data points generated from JUST one assay in one year: Comprehensive metabolic profiles alone generated more than 582 million data points from Medicare in 2016. Considering only the top 5 lab tests that year, more than 1.45 billion data points were generated under Medicare. In any given year, tens of billions of lab data points are generated. That is a lot of data, and to assess and understand it requires a lot of analytical power. We will see how this power can help understand the performance of tests in development, inform the clinical utility of current tests we simply assume to be useful as well as how it is used every day in the working laboratory. Along the way, big data not only creates the need for new analytical tools, but it also is creating the need for new ways to standardize our test definitions and results nomenclature.
Where is it generated and stored?
There is a somewhat surprising wealth of sites where lab tests are performed. We typically think of hospitals, reference labs and physician offices. We also consider urgent care and free-standing emergency room settings. But, while only some of the data is associated with care for a specific patient, lab data is also collected in manufacturer R&D settings, university and government research facilities, and the latest trend is for testing to go home.
Where is it stored? Well, that’s part of the challenge for big data and the U.S. health care system in general. Most data collected in clinical settings specifically associated with individual patient health care is collected at the instrument level, and then reported through the laboratory information system, and ultimately to the electronic medical records and in some instances to the hospital information system. This same data gets reported to Medicare, Medicaid and a variety of private insurance companies, each of which eventually sees a subset of the data.
It is important to note that there is no oversight body currently empowered or capable of having access to all the lab data currently collected annually. While it is likely that analysis of the full set of lab data could have meaningful implications for population health and establishment of “best practices” for which tests to order for whom and when, there are a number of philosophical and practical challenges to this sort of data aggregation. Not the least of which is how to safeguard access to the data, implications for personal data privacy and integrity and the general sense of discomfort in knowing that “someone has all the lab data somewhere and I wonder what they intend to do with it.” Many/most IDNs and other healthcare networks have the capability of aggregating the data across multiple testing platforms within their network and the ability to extract this data, analyze it and develop conclusions from these masses of data.
What new tools are coming into play and why?
Given that by definition big data does not lend itself to individual interpretation of the information, and that typical spreadsheet and data base tools were not developed with analysis of these massive data sets in mind, where is data analysis headed? First off, there are multiple groups working to standardize reporting of specific test results across networks. This is more difficult than you might believe even with established CPT codes and result units of measurement.
One of the more prominent leaders in codifying these data sets is LOINC, whose mission is to create a universal vocabulary to identify test results to assure uniformity of reporting any specific analyte across platforms. Essentially LOINC is the vocabulary used to overlay on top of the local test result reporting of any individual reporting entity. You can think of it as the translator at the United Nations that permits any spoken language to deliver the same message to all attendees at the same time. Its adoption is growing among IDN, individual hospital and private reference laboratories as well as state departments of health and manufacturers of lab and vital signs data. LOINC works with lab results but also with vital signs measurements. Time will tell whether this will become the standard or another standard will come into play. But, one thing is certain: standardization has become a powerful force in the world of lab testing, and the trend to continue to improve standardization will continue unabated irrespective of which standards are ultimately implemented.
New statistical tools including “R statistical programming language” are also making a difference since they have been specifically developed to handle these sorts of very substantial data sets and to provide query and analytical methods to create meaning from the data and to try to reveal previously unrecognized patterns in the data. More and more genomic laboratories in particular are adopting these types of tools.
Both artificial intelligence and machine learning have been touted as emerging tools to help laboratorians and clinicians find underlying connections, cause and effect relationships and develop actionable conclusions from masses of otherwise disparate data. The sophisticated algorithms inherent in these tools and their ability to become more “insightful” and useful over time are key advantages, especially in next generation sequencing where they are in current usage more than any other laboratory discipline. As an understanding of their abilities and limitations becomes clearer over time, there is no doubt that their applications will become more widespread.
What impact does big data have day to day?
In my research, I found a surprising and exciting range of applications at the individual laboratory level. One of the most interesting ones involves labs accurately assessing and modifying the “moving averages” of normal values for lab results. This is a highly valuable QC tool to assure labs identify abnormal patient results rather than a shift in normal values. Sound esoteric to you? Well, consider this: once upon a time the normal human temperature was considered 98.6 degrees Fahrenheit. That was in pre-industrial times. Today, the average temperature of the population in developed countries is dropping, with a recent British study pegging the average of more than 25,000 patients at 97.9 degrees Fahrenheit. So, monitoring changes in the “mean of normal values” is a very big deal.
One lab sought to ask whether folate testing was useful considering that abnormally low folate levels indicate an anemia associated with insufficient folate intake from dietary sources, which would be uncommon in the U.S. Sure enough, in a study of 89,000 folate patient results, only 4 abnormals were found. The result? That hospital lab discontinued offering folate testing as a routine procedure.
In another study, a large group of patients with uncontrolled hypertension (a serious on-going healthcare system problem) were put on a strict protocol of dietary, exercise and lifestyle management changes. The result? Over 69% of the previously uncontrolled hypertensive patients achieved control of their hypertension.
Big data also has history on its side. Proficiency testing has been with us for many years and represents a well-established use of lab data to compare performance of individual lab results among laboratories and to rate their performance objectively. There is a substantial number of other applications including analysis of an IDN’s data from continuous glucose monitoring to determine treatment outcomes, comparison of local data sets for specific analytes with larger data sets, utilization data across networks (which tests are most commonly used, how they impact patient treatment programs, and which need to be used more or less often). In quality control systems on LIS platforms, one application includes determining WHICH auto verification rules can safely and effectively reduce unnecessary reflex testing. There is an emerging trend for “lab benefit managers” to review these larger data sets to set standards of which tests are medically necessary for which patient conditions. There is a lot going on here.
How will big data help us over time?
Its impact on understanding genomic data including risk assessments based on tumor genotyping is already having an impact. These data sets are so complex and involve so many disparate data elements of the patient or tumor genotype (gene translocations, substitutions, deletions and insertions) that large scale data analysis is required to make sense of the data and to interpret which changes in the genotype are associated with differential risk of the progression of cancer as well as which treatments can be most successful. In terms of developing new genomic and liquid biopsy tests and to also understand which markers have the most impact on patient care, big data is essential and is making powerful contributions to diagnosis, patient monitoring and treatment plan implementation.
Longitudinal patient data assessment is a powerful application already in use in many thought leading IDN and tertiary care medical centers. The applications range from determining which tests are most useful for which clinical conditions, to comparing utilization of lab tests among clinicians to determine best practices in patient management, to reducing redundant testing patterns. Applications in quality control management are also obvious and include understanding which tests and/or which laboratorians are performing at the highest level and how to optimize results within the laboratory and among laboratories.
Big data is here and here to stay. Work actively with your key lab manufacturers and your key customers to understand which emerging big data elements are relevant and how you can work collaboratively to improve all three elements of value for your customer: clinical, workflow and economic value. It’s no longer just about generating and storing the data. It is becoming the need to find previously unrecognized patterns and to find ways to harness this knowledge to improve patient care. Stay informed and stay at the cutting edge to improve your value as a lab consultant.