Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Medicine is generating approximately a quintillion bytes of data every day, and within those admittedly disorganized, unstructured, or uncurated data streams are transformative insights about patient care and disease treatment.
“Big Data” systems, which include advanced software algorithms like artificial intelligence and machine learning, have the potential to unlock those insights, thereby improving the predictive value of medical information, as well as the safety and efficacy of treatments and products.
The first area where Big Data is likely to affect medical practices is with respect to decision support systems like:
Patient care and monitoring coordination. Already, major hospital systems are testing Big Data programs to standardize and improve the accuracy and precision of patient care, both preoperatively and postoperatively. So far, these pilot programs are improving patient care while also reducing costs.
Image processing. Big Data systems are demonstrating the ability to “read” computed tomography scans and other digital images more rapidly and accurately than humans.
Electronic health records, which are both a time and cost sink for a majority of physicians, are the target of several novel Big Data programs that can automate, standardize, and improve the process of updating patient electronic health records.
Reimbursement and Claims Management. Not only are Big Data systems reducing coding and billing error rates, they are also being employed to measure individual physician or clinic tendencies and performance.
Big Data systems are likely to have some, albeit comparatively minor, impact on drug and biological discovery and device design. More likely, Big Data will be employed by regulators to evaluate product safety and efficacy and to track postmarket outcomes and adverse events.
In the various “omics” fields, there is a land rush to apply Big Data systems to mapping genomics, proteonomics, and metabolomics data to clinical diagnostic and outcome data. This is exciting but also raises the knotty issue of correlative information versus causative information.
Two central problems limit the ability of Big Data systems to improve the predictive value of medicine:
Garbage In/Garbage Out. Almost by definition, Big Data systems deal with unstructured, uncurated, and disorganized data. Therefore, a key element of Big Data systems are the associated software tools to clean, process, validate, cluster, and visualize data.
Data Privacy. Big Data systems have the ability to break down one of the foundations of modern medicine: patient/doctor confidentiality. Already, numerous hospital systems have experienced malicious breaches of their patient data systems. These justified fears of deanonymization of patient data may limit Big Data system growth.
Finally, all consumers of Big Data medical information should understand that Big Data systems rely on finding patterns in data and, therefore, are not a substitute for understanding biological processes. Big Data systems are tools, no more or less.
“Big Data” is a shorthand way of referring to any data/information system that can only achieve relevant levels of predictive value by using very large data sets. A decade ago Big Data was principally a tool for insurance companies and casinos. It is now defining the future of medicine. Whether it is the digitization of images, unstructured print information, or patient records (which now exist in networked provider systems) or the explosion of digital genomic, metabolomic, or phenotype data, the fundamental reality is that medicine is generating approximately a quintillion bytes of data every day. Within those admittedly disorganized, unstructured, or uncurated data streams are transformative insights into patient care and disease treatment. Big Data systems have the potential to unlock those insights. To achieve this potential, Big Data systems require advanced mining and analytical tools, including artificial intelligence (AI) or machine learning algorithms. In this chapter we will describe how Big Data systems work, generically, how they will likely change patient care and the day-to-day life of every physician, and what the problems and limitations inherent in Big Data systems are.
Electronic health records
Genomic and proteomic data
Peer reviewed studies for publication
U.S. Food and Drug Administration–approved tests for at-home use
Unpublished clinical trials
Smartphones and wearable devices
Medical imaging data
Social media text data
Electronic surveys data
Table 173.1 presents Big Data terminology.
Big Data Neural Network | This is a series of algorithms that attempt to recognize relationships in datasets through a process which mimics the way the human brain operates. |
Bag-of-Words (BOW) | BOW is a model for simplifying representation used in natural language processing and information retrieval. This model represents text (such as a sentence or a document) as a bag (multiset) of its words , disregarding grammar and even word order but keeping multiplicity (the number of times a word appears in text). BOW is often used to classify documents where the frequency of occurrence of each word is used as a feature for “training” a classifier in machine learning (ML) models. |
Conditional Random Fields | A type of statistical modeling often applied to pattern recognition and ML and used for structured predictions. |
Data Robot | Operates on structured data but can also find key phrases within unstructured text. |
Dictionary-Based Systems | Software systems that key off of a dictionary of terms or phrases for special uses or purposes. |
Discovery Stage | Where useful, unexpected, or unknown information is extracted from textual data by way of text mining and artificial intelligence software. |
Electronic Health Records | More than 90% of all US hospitals employ computer-based systems to collect, store, and share patients’ medical information. Each year US hospitals generate more than 50 petabytes (50x10 15 bytes) of data. |
G∗Power | Software for computing statistical power for t- tests, F- tests, and z-tests. |
Hybrid Text Mining Systems | This is a text mining software system that employs a combination of different classifiers, hence the term “hybrid.” Combining classifiers can increase performance, but the classifiers must be well-chosen. |
Loom Systems | Take as input structured and unstructured data, either as batch or as streaming data, and visualize the generated data models. |
Machine Learning Systems | Software systems, typically using artificial intelligence, that provide computers or computing “things” to automatically learn and improve from experience without being explicitly programmed. |
MATLAB | A high-performance language for technical computing that is capable of integrating computational, visual, and programming into an easy-to-use environment. |
MedTime system | Temporal information extraction system using rule-based and support vector and conditional random fields ML systems to extract date, time, duration, and frequency from clinical studies. |
Moore’s Law | The principle that the speed and capability of computers will double every 3 years as a result of the number of transistors a microchip can contain. |
Named Entity Recognition | The process of tagging and extracting named entities in different chunks of textual data; for example, disease name, symptom, anatomical region, treatment, device, pharmaceutical or biologic. |
Natural Language Processing | Subfield of linguistics that employs computer science, information engineering, and artificial intelligence software and systems to “read,” process, and analyze textual or spoken language. |
NegEx | Algorithm to detect negations. |
Personal Data Anonymization | The process of removing and/or disguising personal and sensitive information during text mining. |
Preprocessing | Typically applied to text mining, where textual and unstructured data are standardized and cleaned. |
P4 Medicine | A rubric derived from the words predictive , preventive , personalized , and participatory that is an aspirational definition of medicine . |
Quill (Narrative Science) | Generates from raw input numerical data a data model in a natural language, i.e., in English, using natural language generation to write a story about the most important information found in the data; it takes as input structured data only. |
Rule-Based Software Systems | Software systems that rely on human-crafted or curated rule sets. |
Statcheck | Web-based algorithm that checks the validity of statistical data in papers in both .pdf and .html format and recalculates p-values and other measurements. |
Stemming | Form of linguistic morphology and information retrieval that is the process of reducing words to their word stem. |
Support Vector Machine | Supervised learning model for computers with associated learning algorithms that analyze data for classifiers and regression analysis. |
Text Mining | Techniques for collecting high-quality structured information from unstructured text data. |
Text Representation | Where unstructured data is translated, transformed, or converted into a representational model that facilitates data analysis. |
Tokenization | Data security process that substitutes sensitive data elements with nonsensitive data equivalents referred to as tokens. |
Tree-Based Pipeline Optimization Tool | Open source tool that automatically optimizes feature selection methods and data models to improve accuracy. |
Unified Medical Language System Metathesaurus | The Unified Medical Language System is a set of files and software that aggregates many health and biomedical vocabularies and standards to enable interoperability between computer systems. |
Waikato Environment for Knowledge Analysis | A free collection of visualization tools and algorithms for data analysis and predictive modeling. |
Big Data systems can only deliver predictive value by analyzing massive amounts of data and, in sum, relying on the law of large numbers. The law of large numbers states that the empirical predictability of any phenomenon will tend to converge with the theoretical predictability of that same phenomenon as the number of observations increases. For example, the odds that the number 1 will appear from a single roll of a fair, six-sided die is 16.7%. However, it is very possible, even likely, that after six rolls of the dice the number 1 will not appear. After one thousand rolls of a fair, six-sided die, however, the likelihood that the number 1 appears 16.7% of the time is highly predictable and empirically demonstrable ( Fig. 173.1 ). Given a large enough “ n ,” many apparently random phenomena become highly predictable. That is the law of large numbers.
Entire industries are based on this law. Insurance, pension, and retirement companies all rely on the law of large numbers, as does Las Vegas. For medicine, the quantity of data required to have clinically predictive value is logarithmically higher than that required for these other industries.
All medicine is an explicit or implicit prediction. And each decision has an inherent predictive value – or, as it is more commonly referred to, a P -value. Every pharmaceutical, biological, device, or, increasingly, treatment-related software comes to market by way of regulatory review, clearance, approval, or license – which establishes the base level of efficacy and safety predictability that healthcare providers rely on to treat their patients. Big Data’s value to medicine lies in its potential to improve the predictability of diagnostics and treatments while also helping suppliers to develop novel therapies and products more efficiently.
Three fairly recent developments make Big Data possible.
Dramatically improved speed, power and size of data storage and processing (Moore’s law).
New software tools like artificial intelligence (AI), machine learning (ML), and infrastructure-enabling software.
Dramatically lower costs of collecting and mining data.
Coincident with the arrival of Big Data in medicine are critical concerns regarding privacy, data quality, and, finally, reliance on correlative data at the possible expense of causative data.
Big Data systems are already being employed in medicine. Here are some of the more common current applications and the more likely future applications.
One of the most advanced initiatives in applying Big Data systems is to manage and bundle the doctor–patient engagement process, follow-up care coordination, the unique needs of specific patient populations, chronic disease care, pharmaceutical dosing, contraindications, reactions and uses, and claims management In effect, Big Data systems can be applied to the task of standardizing and, presumably, improving the accuracy and precision of patient care.
Mercy Health System, a Chesterfield, Missouri–based hospital network, is testing a novel Big Data system with AI capabilities to develop care pathways using insights from Mercy Health’s own patient data. The AI system vendor Mercy selected, Menlo Park, California–based Ayasdi, uses ML to find patterns in unstructured information. Ayasdi’s programs aggregated and then processed Mercy’s electronic health records (EHRs) and payer records to help Mercy develop new insights into its patient care patterns. One insight Mercy discovered using these Big Data systems was that knee replacement patients who received the pain drug Lyrica before surgery needed fewer narcotics after their operations and had a shorter length of stay. Ayasdi produced monthly reports for Mercy that tracked patient outcomes and tied those back to care pathways and individual physician care patterns. As a result of these programs, according to a press announcement issued by Mercy, the hospital leadership projected $15.8 million in savings with a median use rate of 56% for the new protocols arising from insights gathered from the Big Data system. Eventually, executives at Mercy hope to realize as much as $20 million in cost savings and increase staff compliance with the Big Data–informed care pathways to 70%. The savings, according to Mercy in their press announcements, came from reducing variations in care as much as making changes such as using a less expensive device.
In other high-profile initiatives, Google has a program for healthcare delivery networks that deploys Big Data systems with predictive modeling capabilities to warn physicians of sepsis, heart failure, or other risky conditions. A software company based in Atlanta, Jvion, has developed a system that uses AI systems to tag patients who may be most susceptible to adverse events, as well as those best suited to respond to treatment protocols.
Keying off both payer data and EHR data, Big Data systems are well suited to be decision support systems for physicians, including, critically, the ability of physicians to ask why? when the AI decision support system makes a treatment recommendation. Other companies also pursuing clinical support applications are IBM Watson, Change Healthcare, and AllScripts.
On June 11, 2019, the U.S. Food and Drug Administration (FDA) cleared the first AI software program to “read,” triage, and flag computed tomography (CT) images of the spine for cerebrospinal fluid leaks. According to FDA documents, the system is intended to assist trained radiologists to more accurately and quickly flag suspected positive CT images of linear lucencies in the cervical spine bone in patterns that are compatible with fractures. According to the system’s manufacturer, Israeli-based Aidoc Medical, Ltd., the system’s time-to-notification was 3.9 minutes (95% confidence interval [CI]: 4.38–4.1) versus 58.4 minutes (95% CI: 45.3–71.4) for the trained human radiologist.
Aidoc’s system illustrates how deep learning techniques are being applied to data-rich digital images. Deep learning algorithms are able to “learn” image features by comparing features across millions of related images. Imaging-dependent specialties, such as oncology, neurosurgery, trauma, audiology, pathology, dermatology, or ophthalmology, are already using early forms of radiomic Big Data to increase the accuracy and speed of diagnosis and assessment of patient risk profiles. ,
In the review of radiological AI systems, Hosny et al. wrote: “Studies in nonsmall-cell lung cancer (NSCLC) used radiomics to predict distant metastasis in lung adenocarcinoma and tumour histological subtypes as well as disease recurrence, somatic mutations, gene-expression profiles and overall survival. Such findings have motivated an exploration of the clinical utility of AI-generated biomarkers based on standard-of-care radiographic images – with the ultimate hope of better supporting radiologists in disease diagnosis, imaging quality optimization, data visualization, response assessment and report generation.”
Become a Clinical Tree membership for Full access and enjoy Unlimited articles
If you are a member. Log in here