Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Artificial intelligence (AI) and machine learning offer significant promise for improving the state of the art in evidence-based diagnosis and management of spinal disorders
AI is the effort to automate intellectual tasks, and encompasses a variety of approaches, including machine learning.
Machine learning is a specific approach to AI in which machines learn directly from data. There are a large number of underlying machine learning algorithms, which are split into traditional or shallow learning methods such as logistic regression and decision trees, and more complex algorithms such as large artificial neural networks that are known as deep learning.
Clinical data, including structured data (e.g., patient demographics, comorbidities, symptoms, examination findings) and unstructured data (e.g., clinical notes or radiographic images), are currently being used to develop machine learning models to more accurately identify spinal disease and predict disease trajectory and surgical outcomes.
Machine learning will continue to play an increasingly important role in the practice of spine surgery by developing models that aid in the diagnosis of spinal disorders, identification of pain generators, and prediction of operative vs. nonoperative outcomes and complications that will empower better shared decision making between physicians and patients.
Disorders of the spine are an important cause of disability and a significant expense in our healthcare economy. The prevalence of spinal disorders and the impact of spinal disorders on health status make evidence-based management of spinal disorders a priority for our health economy. Based upon the recommendations of the 1998 Priority Setting Committee of the Institute of Medicine, appropriate management of spinal disorders based upon the best available evidence is an important healthcare priority. Management of spinal disorders continues to be characterized by significant variability in the utilization of operative and nonoperative therapies. Multiple providers are involved in the management of patients with spinal disorders, including primary care physicians, emergency physicians, neurologists, rheumatologists, nonoperative spine specialists, including physiatrists and anesthesia pain management specialists, and operative spine specialists, including orthopedic surgeons and neurosurgeons. Priorities for research in low back pain include development of predictive models to determine optimal care pathways for patients based upon biological, structural, psychological, and social factors. Predictive models regarding optimal management of spinal disorders, complications, and outcomes of care have been based upon techniques of multivariate regression analysis with predictor variables, and their utility to and adoption by the community of physicians who manage spinal disorders has been limited. Deep learning and artificial intelligence (AI) may offer useful insights into the diagnosis of causes of low back pain, and into the appropriate management of patients. The purposes of this chapter are to introduce the use of deep learning in understanding the phenotype of low back pain, and to discuss the application of deep learning and AI in developing models that may be predictive of outcomes of care, cost-effectiveness of care, and complications. Accurate predictive information regarding the expected benefits and costs of care will be valuable in developing an evidence-based approach to the management of spinal disorders.
AI has been defined as “the effort to automate intellectual tasks normally performed by humans.” This effort has been underway in earnest since at least the 1950s, when experts in computer science began to wonder whether machines could be made to “think.” AI, as indicated above, has a very broad definition, and as such has seen a wide variety of approaches ( Fig. 180.1 ). From the 1950s to the 1980s, the most popular and prominent framework was known as “symbolic AI,” in which a proscriptive set of rules composed by human experts is used to enable a computer to make choices based on its current circumstances. This had some success in simple scenarios that lent themselves well to rule-forming, such as chess programs, but struggled to perform well in more complex tasks in which humans are unable to clearly articulate why they make certain choices, such as in various tasks of perception (image recognition, natural language processing, etc.). Additionally, as these rules are hard coded by human experts without learning directly from data, these systems at base approximate but are unable to exceed those experts’ performance and generate new knowledge. In the medical domain, a major goal of applied AI techniques is personalized medicine or, more broadly, precision medicine, targeting not just individuals but also subpopulations that may benefit from different treatments, which can be identified using AI based on differences in patient-specific aspects ranging from patient demographics to genetic differences. ,
Machine learning refers to a specific approach to AI that allows machines to learn directly from data, without the hard-coding of rules by human experts. There are a large number of individual algorithms contained within the field of machine learning (e.g., linear and logistic regression, support vector machines, k-nearest neighbors, decision trees, neural networks, etc.), and methods by which these algorithms “learn,” but they all share the goal of learning from data directly, crafting their own set of “rules” without requiring input from human experts. This learning is referred to as “training” and may be subdivided into a few subtypes: supervised, unsupervised, and reinforcement learning.
Supervised learning refers to training with labels. In this type of machine learning, input data are associated with labels, and the model learns to predict these labels from new input data. There are two general types of tasks that can be learned in this supervised fashion: regression and classification. In regression, the model learns to predict a continuous output, and in classification, the model predicts a categorical output. For example, a regression task would be learning to predict a patient’s body mass index (BMI) from various input variables, whereas a classification task would be predicting from these inputs whether a patient is obese. Most of the most exciting current progress in machine learning (image recognition, natural language processing, prediction, etc.) has been achieved with supervised learning. In unsupervised learning, a model learns directly from data without labels. For example, a clustering algorithm may learn to break down patients into distinct groups based on clinical information without any prediction of their outcome. The final type of learning is reinforcement learning, in which an agent learns from receiving rewards while interacting with its environment. This has been very important in robotics.
For any of the discussed learning types, the process of learning is often similar. The data are first divided into training and testing datasets. The training data consist of the majority of the data and are what the model is trained on, and the test set consists of a small subset of the overall data that are held out of training to be used for evaluation of the model once training is complete. In this way the general performance of the model can be more accurately computed, and choices can be made that optimize this performance rather than the model’s performance on the training data. The goal for the model is to accurately fit the dataset, giving accurate and consistent predictions for any given set of inputs. A model is said to underfit the data when it fails to perform accurately on even the training data. Additional training time or a more complex model may be needed in order for a model to learn to predict more accurately. A model is said to overfit the data when it performs well on the training set but fails to generalize this performance to the test set. In this case, the model may need additional data to train on, or may benefit from a simpler model with fewer parameters, which can occur in a process known as regularization. There is often a tradeoff between underfitting and overfitting, and tuning a model to achieve the proper balance and fit the data at hand is a major challenge in every machine learning project.
As mentioned above, machine learning is a broad field, with a wide variety of models and algorithms that are used. The field can be divided into traditional methods, sometimes referred to as shallow learning, and deep learning. Traditional methods depend on manual extraction of features, which are then input into one of a number of algorithms, where an output is generated. The algorithms are varied and include linear and logistic regression, support vector machines, k-nearest neighbors, decision trees, and random forests. Linear regression refers to the well-known statistical model of learning the “best fit” linear prediction of a continuous output from any number of inputs. Logistic regression takes this output and passes it through the logistic function to transform it into a binary, categorical output (thus logistic regression is poorly named because it performs classification and not regression). For example, a linear regression model might predict a patient’s BMI from his or her clinical data, and a logistic regression model might predict whether or not the patient is obese from these same data. Support vector machines similarly learn to make linear predictions, but are “large-margin” classifiers that learn the linear margin that gives the widest space between categories, and additionally can learn nonlinear associations by transforming their input data into higher-dimensional space using the kernel trick. In k-nearest neighbors, predictions are made for a given input by averaging the outputs of other data points that are most similar to the input in question. Decision trees are commonly-used decision support tools and make predictions in stepwise fashion by learning a flowchart-like pattern in which each “node” of the tree represents a test of some feature in the input data, and each branch represents the outcome of this test. In this way you follow the appropriate branches based on the features of your input data until you reach an end leaf note, which gives you the final prediction. These nodes and branches are learned with the goal of maximizing the amount of information gained at each step. Decision trees can be powerful predictors when used in isolation, but are prone to overfitting. This tendency is reduced when several decision trees are trained from the data and the results are then combined together to give a final prediction, in a model known as a random forest.
Although the goal of machine learning is to learn directly from data, the performance of traditional machine learning depends greatly on the representation of the data that are fed into the model, and as such is still dependent on manual feature extraction. For example, the above-mentioned logistic regression model to predict obesity does not examine the patient directly, but rather relies on extraction of key data points that are then fed into the model. Many AI tasks can be solved via this manual feature extraction and training of simple models; however, more complex, especially perceptual, tasks are very difficult to solve in this way. For example, if we want to detect a fracture on a radiograph, the presence of a lucency within bone may be a useful feature. However, it is extremely challenging to extract this feature, with all of its complex possible arrangements and orientations, directly from the raw pixel values. One way to solve this problem is to allow the machine learning algorithm to learn to extract the features from the raw input values as well. In this scenario, known as representation learning, the machine learning algorithm takes as input raw values and transforms these into a meaningful representation via automated feature extraction; then this representation is passed through the rest of the algorithm to give the algorithm’s output.
Currently, the most influential form of representation learning is known as deep learning. In deep learning, inputs are passed through a model known as an artificial neural network ( Fig. 180.2 ). These models were inspired by the structure of the brain and therefore share some neuroscience nomenclature, even though they are architecturally quite distinct. In an artificial neural network, individual units of computation known as neurons are placed in parallel to make up a layer, and these layers are joined in series to form the network. Each layer of neurons receives as input the output of the previous layer, performs computation on that output, and passes its own output to the next layer. In this way, neural networks learn a hierarchical representation of data, with the first layer computing simple features from raw input values (the representation learning discussed previously) and each subsequent layer computing more and more complex features. For example, in the case of facial recognition, the first layer may learn to recognize edges, the second layer combines these to recognize textures, the third layer combines these to recognize individual components of a face (e.g., nose, eye, ear), and the final layer combines these to recognize the presence of an entire face. The number of layers of a network, otherwise known as the depth of the network, corresponds to the complexity of function that can be learned, and is the origin of the term deep in deep learning.
Although neural networks can appear quite complex when their architecture is viewed en masse, they are actually quite simple. The key to understanding a neural network is to understand the function of an individual neuron. In the most common form of neural network, an individual neuron can be thought of as a single unit of logistic regression ( Fig. 180.3 ). This neuron will receive as input the output of all of the neurons of the previous layer. It learns weights to associate with each of these inputs, multiplies these inputs with the respective weights, and then sums these values. Therefore, in this first linear step of the neuron’s computation, the neuron has simply performed linear regression. In the second computational step, a nonlinear “activation function” is applied to the linear output, which allows the network to learn complex, nonlinear functions. Traditionally, this activation function was simply the logistic function that transformed a linear output into values between 0 and 1; and therefore, the entire computation of a neuron can be thought of as simply consisting of logistic regression. Although modern networks may use different activation functions that allow more efficient learning, the final output neuron for classification tasks remains simply logistic regression.
Whereas the neural network is sometimes called the “master algorithm,” as it seems to be able to learn many different functions well, there are specialized architectures that work well for specific tasks. For example, convolutional neural networks are particularly effective for image processing, recurrent neural networks and transformers work well with text and other sequence data, and standard “densely-connected” neural networks may work well for structured clinical data. The details of these architectures are beyond the scope of this chapter, but the interested reader may find additional information in referenced works. ,
The value of algorithms that can learn directly from data is obvious, but how exactly do these models “learn”? Although this varies depending on the exact machine learning algorithm, many algorithms, including neural networks, rely on a learning process called gradient descent. In gradient descent, the predictions of a model are compared with the actual labels via a cost function. The cost function allows the algorithm to assess the general “fit” of the model’s predictions, with the cost function approaching a minimum as the algorithm more closely matches the correct outputs with its predictions. Many of these cost functions are used broadly in statistics, including mean squared error and mean absolute error, and some have been developed specifically for machine learning tasks. The exact cost function used depends on the type of task being performed (e.g., classification vs. regression) and what exactly is trying to be learned.
In gradient descent, the partial derivative of each learned parameter is calculated with respect to the loss function after each step, and each parameter is then updated so as to decrease the overall loss function. In effect the slope of the loss function is calculated with respect to each parameter, and the parameter is changed to “move down” the loss function toward its minimum. Once the loss function has reached its minimum, the slope of the loss curve will be zero, and therefore the parameters will stop updating. At this point the model is said to have “converged,” and the model will have learned the optimal configuration of its parameters that minimizes the loss function, and therefore will have learned to give its best predictions. Although this process is easy to conceptualize for simple models such as linear regression, it is also the basis of learning in neural networks, as the partial derivatives of each layer’s weights are calculated using the next layer’s partial derivatives via the chain rule in a process known as backpropagation, and each neuron’s weight is updated appropriately to “move down” the loss curve.
So far the methods described above dictate ways to process one mode of clinical data at a time. Clinical data are by definition multimodal, as structured data (age, sex, BMI, smoking status, etc.), unstructured text from clinical notes, and imaging data are associated with each patient. To make predictions and diagnoses with the highest levels of accuracy, it is sometimes helpful to use more than one mode of data in your predictions, just as a human physician uses multiple types of data as she formulates her assessment and plan. It is possible to do this in machine learning as well by a variety of approaches. A naive approach would be to simply combine all data into one large feature list and train a model to make predictions from this input. However, this does not work well because the model architectures that are most effective vary by data type, as described earlier. Two more successful approaches are described here.
In model ensembling, different models are built to process each type of data, and the predictions of these models are averaged together (known as “ensembling”) to get the final prediction. For example, a machine learning model designed to diagnose infection may average the probability of infection that is output from a convolutional neural network trained on clinical photos with the output of a logistic regression trained from clinical data. A more sophisticated approach involves using neural networks or other forms of representation learning as feature extractors for unstructured data, combining these features with structured data, and then training a model with this combined feature list. For example, suppose in the above example of prediction infection that each clinical photo is 512 × 512 pixels. These 262,144 raw pixel values are unstructured, and each one does not contain much information by itself. Now imagine the convolutional neural network you trained to predict infection ends with an eight-neuron layer that is then fed into a final neuron to make a prediction. After training this model, the outputs from the eight-neuron layer effectively represent the important features of the raw pixel values. These eight outputs serve as an encoding of the image, and you can now combine these features with the structured clinical data and train a model with these features to make your final prediction. It is important to note that, although more complicated, this second approach is not always more successful than simple ensembling, underscoring the importance of trial and error in model design and training.
Degenerative pathology in the lumbar spine is the most common reason for spine surgery and an important expense in our healthcare system. The correlation between radiographic findings and clinical symptoms is important for accurate diagnosis and for the choice of appropriate surgical interventions. Correlating patient health status from clinical imaging is challenging because of the multifaceted etiology that often underlays degenerative conditions and the variance of clinical presentations of lumbar pathology. Boden et al. demonstrated a poor correlation of magnetic resonance imaging (MRI) finding of lumbar degenerative pathology and back pain. The imprecision of diagnostic imaging in the context of high-incidence low back pain in an aging population represents a significant public health problem and an important opportunity for application of precision health solutions. Predictive analysis of clinical imaging from machine learning methods can provide scalable, time-efficient, and patient-specific treatment tools. Proper identification and localization of morphological features causing back pain and neural symptoms are prerequisites for successful treatment. Machine and deep learning may be useful in the development of tools that can precisely correlate radiographic imaging to patient self-reported health status, with the goal of improving both efficacy of treatment plans and patient outcomes. Deep learning algorithms may improve the precision in interpreting clinical imaging and diagnostic accuracy of clinicians. Applying deep learning to a large, well-characterized image set, as we propose here, may yield new diagnostic tools to more accurately manage postsurgical patients in orthopedics. Such precision diagnostics based on state-of-the-art machine learning, as we propose here, may represent a first step in significantly improving the identification of radiographic characteristics that are associated with disability and patient health status.
Become a Clinical Tree membership for Full access and enjoy Unlimited articles
If you are a member. Log in here