Artificial Intelligence and Machine Learning in Spine Surgery


Introduction of Technology

While investigations into incorporating machine learning (ML) and artificial intelligence (AI) into medicine have been numerous and diverse, their aggregate contributions have altered and revolutionized clinical practice. Fundamentally, ML and AI approaches seek to leverage automata for pattern recognition and extrapolation. Generally, approaches are classified as one of two predominant forms: supervised and unsupervised learning. Supervised learning describes those approaches that use a prelabeled “training” dataset to discover relationships between model features and the outcome-of-interest that can subsequently be applied to generate risk or likelihood predictions in unlabeled “test” datasets. Such methods are commonly used for modeling peri- and postoperative outcomes and for automated interpretation of written clinical reports through natural language processing (NLP). Unsupervised learning refers to the de novo identification of patterns and relationships in unlabeled datasets through discovery and reconstruction of often complex and non-linear relationships within data. Examples of unsupervised learning applications within medicine include the contouring of radiology and pathology studies. Within spine surgery specifically, we will review common approaches to three clinical themes: outcome prediction , image analysis , and NLP . We will further describe the notable current applications of ML and AI within these themes and discuss, in depth, an example of how automata may shape patient care within spine surgery.

Outcome Prediction

Perhaps the theme most frequently associated with ML in medicine has been the prediction of clinical outcomes. The early efforts to design intelligent clinical prediction systems generally revolved around semi-automated rule-based systems, such as MYCIN and PLEXXUS. However, while such models augmented real-time decision-making using stored knowledge compiled by subfield experts, rule-based approaches were limited by low flexibility and adaptability. Since the 1970s, continued advancements in computer architecture and hardware have facilitated growth, diversity, and complexity of predictive modeling approaches. Clinical predictive modeling now offers innumerable options for algorithm design and opportunities to anticipate short- and long-term patient outcomes accurately. With external validation and implementation, these approaches may help shape the care delivery framework toward a more personalized archetype enabling improvements in both healthcare efficiency and patients’ care experiences.

The practice of clinical predictive modeling can be generalized into two broad categories: shallow learning and deep learning. Shallow learning describes the approaches designed to identify relationships among pre-defined features of a dataset and an outcome-of-interest. Examples of shallow learning frameworks frequently applied to clinical medicine include regression and classification approaches such as generalized linear models (GLMs), decision trees, and support vector machines. In contrast, deep learning requires no pre-defined feature set, instead allowing the computer system to identify, arrange, and weight important contributors within the data during the process of model development and tuning. Approaches such as convolutional neural networks (CNNs) can recognize and decompose complex signals for the identification of underlying patterns and relationships. Deep learning offers numerous advantages, particularly in the setting of feature-rich datasets or intricate pattern recognition tasks; however, its disadvantages must also be highlighted, foremost among them being the reduced interpretability of model parameters and component contribution. Deep neural networks, especially, are frequently referred to as “black boxes,” a metaphor that highlights the difficulty of deconvoluting and explaining the transformation of input components into an output decision. Ongoing research aims to develop “explainers” to ameliorate this concern ; nevertheless, the practical utility of deep learning approaches will largely depend on clinicians weighing the marginal importance of incremental performance boosts (and downstream implications for clinical care) against the heightened difficulty in model interpretation and increased computational resources required by such architectures.

Within the realm of spine surgery, clinical predictive modeling has been applied to outcomes ranging from postoperative complications and readmissions to long-term postoperative recovery and opioid dependence. Goyal et al. sought to develop an approach to predict non-home discharge and early readmission following cervical or lumbar vertebral fusion. The performance across modeling approaches (GLM, gradient boosting machines, Elastic Net GLM, and artificial neural networks [ANN], among others) was comparable, reaching Area Under the Receiver Operating Characteristic curve (AUROCs) of 0.87 and 0.66 for prediction of non-home discharge and early unplanned readmission, respectively. Similar methods have also been applied to predict postsurgical outcome and complications, long-term narcotic use, healthcare costs, and mortality. It remains important to note that, while increasing interest in ML models for clinical outcome prediction has yielded numerous promising candidates for introduction into clinical practice, there remains a paucity of studies demonstrating objective utility in secondary prospective investigations. Clinical trials applying validated predictors and appropriate care interventions (such as personalized pain management regimens, increased postoperative vigilance, or re-evaluation of post-discharge care) remain important “next-steps” in the translation of ML models from theory into practice.

ANNs and CNNs, outside of image interpretation and NLP (topics that will be addressed in the following sections), have also been explored in the context of clinical predictive modeling. Studies have compared the performance of neural networks to that of alternative approaches such as logistic regression, proposing these more complex model designs as feasible alternatives and potential improvements to the more frequently applied regression- and tree-based methods. In support of this, a prior study by Dreiseitl and Ohno-Machado demonstrated that, in a subset of applications and settings, ANNs may offer better class discrimination than does logistic regression. These investigations, while reflective of the broadening interest within the medical community in harnessing increasingly complex models, emphasize the recognition of important considerations. As described by Dreiseitl and Ohno-Machado, logistic regression performed comparably to ANNs in the majority of examined cases. It certainly is possible that, under many practical circumstances, ANNs offer indistinguishable performance compared to simpler approaches; as previously discussed, a cost-benefit assessment should be conducted by the prospective user to determine whether the increased resource utilization and development complexity is justified by the expected marginal improvements in performance. Lastly, in contexts where developers hope to use model development and validation to identify modifiable factors informative for subsequent prospective studies, the loss of interpretability associated with higher complexity models may hinder informed research design.

Image Classification

The manual interpretation of imaging incurs several downsides that may be ameliorated by automata. First, the low throughput nature of manual radiology review can result in reduced workflow speed and efficiency. Automated and semi-automated MRI interpretation can leverage the multithreaded and superior computational capabilities of modern-day computers to streamline radiology review. Second, heterogeneity in image interpretation across reviewers may compromise confidence in evaluation results. In prior evaluations of interrater concordance, reliability of findings generally ranged broadly from moderate to good. Improved reliability may be achieved by integrating deep learning approaches as either an independent or adjunctive process for establishing diagnoses. Third, detection of surreptitious pathologies may be difficult even for experienced radiologists. Studies evaluating application of deep sCNNs for sensitive diagnosis of these lesions have yielded promising results. Taken together, these potential drawbacks suggest that the integration of AI into spine surgery diagnostics and practice may streamline radiologic evaluation while potentially improving interpretation accuracy and consistency.

While the interpretation of spinal imaging by automated systems has historically relied on shallow learning architectures such as support vector machine and random forest, the recent literature has shown an ever-increasing presence of deep learning in image classification. By the early 2010s, the potential of deep learning in imaging classification had been spotlighted by the development of graphics processing unit (GPU)-implemented CNNs capable of attaining previously unachievable levels of prediction accuracy with accelerated computational speed. Among the most recognizable was AlexNet, an eight-layer CNN that reduced misclassification error on the ImageNet Large Scale Visual Recognition Challenge 2012 dataset by more than 40%. During the development of early CNN-based computer vision approaches such as AlexNet, model performance was significantly limited by the heavy computational and memory resources required by deep learning frameworks. However, the growing understanding of neural network architecture and construction has yielded algorithms with increased resource and computational efficiency ; further augmenting research in deep learning has been the industry-wide progression of computing power and data availability facilitating high dimensional analyses. Broadly, the impact of deep learning constructs on image recognition tasks may be best summarized by the sophistication of recent algorithms harnessing hundreds of millions of parameters that have now achieved classification errors orders of magnitude lower than that of AlexNet.

Specifically pertaining to spinal pathologies and spine surgery, the application of machine-assisted image recognition approaches may impact both presurgical diagnostics and perioperative management. Automating segmentation and annotation of vertebrae using CNNs may facilitate opportunities for sensitive detection and characterization of degenerative disk disease and acute spinal cord injury. Both Huang et al. and McCoy et al. modeled their approaches on the often-employed U-net architecture purposed for segmentation of T2-weighted MRI. The former introduced Spine Explorer , an algorithm for the automated detection and quantification of vertebral body and vertebral disk measurements (metrics including diameter, height, area, and signal intensity). Compared to gold standard manual expert segmentation, Spine Explorer achieved high contour accuracy (97.4% as evaluated by the Jaccard index, which measures the regional intersection of automated and gold standard countours over their union). Disk measurables obtained by Spine Explorer were compared to previously developed indices grading disk degeneration, demonstrating strong inverse correlation between CSF-adjusted disk intensity and disk degeneration severity. The latter study developed BASICseg for auto-segmentation of the spinal cord and spinal cord lesions. Compared to gold standard expert manual segmentation, BASICseg achieved a high Dice coefficient (a measure reflecting degree of regional overlap ) of 0.93 during spinal cord segmentation. Notably, segmenting acute spinal cord injury lesions proved more difficult and, despite BASICseg achieving higher test set accuracy than prior algorithms, further research is necessary to improve contour fidelity. Taken together, these studies introduce approaches to the automated quantification of spinal anatomy and pathology that may supplement surgeon clinical experience when localizing and explaining symptomatology.

Deep learning may guide not only the diagnosis of spinal pathologies but also the practice of spine surgery itself. In a study aimed at automating surgical planning for pedicle screw insertion, Cai et al. constructed a deep CNN for digital three-dimensional segmentation and pedicle screw path simulation in spinal CT imaging. On withheld imaging series, segmentation achieved a Dice coefficient of nearly 97% and a Jaccard coefficient of over 93%. The pedicle screw localization was also accurate, with a mean squared error of 1.34 voxels during five-fold cross validation. Despite promising results, the notable limitations of the study include input imaging requirements and missing physiological qualities, such as biomechanical characterization of the planned surgical path. Regarding the former, limited computational resources necessitated scaling down input bulk and restricting CNN size. The latter concerns the importance of biophysical factors beyond anatomical setting and location. The biomechanical properties and composition of individual vertebral bodies, particularly in the setting of degenerative pathologies, may significantly alter surgical planning and should be considered in fully automated surgical planning workflows. Although such an approach has yet to be developed and validated, the implementation of current methods in a semi-automated manner and future elucidation of improved fully automated algorithms may improve surgical planning efficiency while reducing variance. Lastly, deep learning-enhanced planning and reduction of radiation in instrumented spine surgery through generation of synthetic CT images, as will be discussed later in this chapter, are promising developments.

Natural Language Processing

Beyond the enhanced interpretation of radiologic imaging using computer vision, automated dissection and interpretation of unstructured clinical reports offers tantalizing opportunities for optimizing spine surgery workflows. From origins reaching back to the 1960s, free-text parsing has remained a significant challenge. One of the earliest efforts to transform human-interpretable language into machine-understandable features was the Linguistic String Project whose development, led by Dr. Naomi Sager, served for decades as the archetype of NLP. The application of NLP frameworks to clinical, radiology, and pathology reports have yielded promising results, often outperforming manual alternatives in terms of reliability and accuracy. At a high level, NLP algorithms in medicine generally comprise two phases: the conversion of unstructured or semi-structured reports into a constellation of machine-interpretable information, ranging from semantics to syntactical context, and feature engineering for development of discriminative models for classification tasks. The expanded prevalence of electronic health records in the 21 st century to near-ubiquitous presence has concocted significant opportunity for incorporation of NLP into medical practice. Increasing access to large, shared healthcare datasets, such as the MIMIC databases curated by the Beth Israel Deaconess Medical Center in Boston, Massachusetts, has also aided efforts to optimize NLP algorithms for the clinical setting. Together with increasing physician eagerness to improve care efficiency and effectiveness, the past two decades have seen rapid acceleration of NLP-related medical research and publications.

The formalization of an NLP system capable of processing free-text clinical records is a multistep process, requiring fragmentation of the document into text subunits called tokens that can subsequently be interpreted as discrete concepts using appropriate lexicons and the syntactical context it was identified in. Language databanks, such as the Unified Medical Language system and the frequently used SPECIALIST medical lexicon it encapsulates, are necessary to define and standardize identified tokens. Those systems geared toward extraction of data from clinical reports, such as MEDLEE, cTAKES, and MetaMap, consolidate these steps within an encompassing wrapper. The encoded language can then be contextualized using databases such as SNOMED-CT and can also serve as feature sets for additional classification algorithms. In an early example of NLP implementation in medicine, Murff et al. surveyed surgical inpatient admission records using the Multi-Threaded Clinical Vocabulary Server to identify postoperative complications and compare performance against manually coded complications. While both manual review and NLP-based queries offered high specificity, the latter achieved significantly higher sensitivity particularly for identification of postoperative acute renal failure, sepsis, and pneumonia.

While still fledgling, NLP has been assessed for application in spine surgery primarily as a means of automated detection and surveillance of complications. Karhade et al. developed an NLP pipeline for the identification of incidental durotomy during spine surgery using free-text operative notes and benchmarked predictions against ground truth established by multiple independent expert review. Post-tokenization feature importance was established based on term frequency-inverse document frequency, which seeks to quantify differential prevalence of candidate tokens between the documents of differing strata in the training set (in this case, classified by the presence of incidental durotomy). Downstream classification was performed using gradient-boosted decision trees and performance significantly higher than durotomy identification using CPT and ICD codes (AUROC of 0.99 vs. 0.64 in the test set); subsequent external validation confirmed robust model performance. Similar approaches have been applied for the identification of other spinal pathologies, including low back pain, surgical site infections, and Modic type 1 endplate change.

While few modern NLP approaches used for interpretation of clinical reports leverage deep learning models, studies in informatics and AI have suggested that deep CNNs may significantly improve model accuracy. Machine automation of complex tasks, such as language translation and natural language interpretation (NLI), could spearhead significant shifts in applied clinical medicine. For example, accurate translators may facilitate collaborative efforts to query aggregated clinical records from diverse geographic regions by facilitating information transfer and standardization. NLI, which includes subtasks such as sentiment analysis and question answering, may also increase care efficiency and effectiveness through task multithreading and intelligent understanding of written and spoken language (both in the physician-physician and physician-patient settings). However, estimations of current resource requirements for state-of-the-art NLP models such as BERT and XLNet suggest improved algorithm efficiency may be required for increased scalability. Current industry efforts include those directed at increasing algorithm efficiency, such as the advent of the SustaiNLP 2020 workshop to emphasize sustainable NLP practices and exploration of more minimalistic model designs, such as qQRNN, which can approximate the performance of state-of-the-art approaches while reducing resource use by more than 300-fold.

Machine Learning in Clinical Practice: Synthetic CT and Predictive Analytics

In complex lumbar spine surgery, the combination of MRI and CT has long proven complementary. While the strength of CT imaging lies in visualizing osseous structures and allows assessment of spinal integrity and stability, MRI excels at soft tissue imaging, including nervous tissue. Still, CT images are often required for neuronavigation and treatment planning, such as for image-guided navigated biopsies or pedicle screw placement. CT inherently leads to radiation exposure, with an average lumbar spinal CT equaling an effective dose of around 3.5 mSv up to 19.5 mSv. In addition, multiple imaging sessions increase the logistic burden for patients and healthcare providers alike, and also drive overall costs. The independent acquisition of both imaging modalities can also introduce complex workflows, with the potential for inter-modality registration errors.

For these reasons, MRI-only workflows, especially in radiotherapy, have increased markedly in recent years. The ability to derive synthetic CT images from routinely collected MR sequences, which are usually mandated anyway in spine surgery due to the necessity of delineating the neural structures, without radiation and a second imaging session would thus prove beneficial. As detailed before, deep learning can at times enable tackling such complex tasks that may have appeared impossible beforehand.

Similarly, predictive analytics can guide decision-making by producing accurate and, most importantly, individualized risk profiles, also enabling more confident patient counseling preoperatively.

In the following two case presentations, we aim to demonstrate the clinical use of synthetic CT along with predictive analytics in instrumented lumbar spine surgery.

Please note that while pedicle screw trajectory planning was successfully performed using synthetic CTs and the robot workstation, pedicle screw insertion was effectively carried out based on spiral CT, as there is not yet an ethical and scientific basis to operate patients based on synthetic CT alone. The two cases are, however, real cases treated at our center using robotic guidance, and both pertain to lumbar spinal fusion augmented by the two aforementioned ML applications.

History and Clinical Presentation

Patient A

Patient A (female, 62 years) presented to us with predominant chronic low back pain (CLBP; Numeric Rating Scale [NRS] 7) as well as right-sided radiating pain in the L4 dermatome (NRS 5), and had previously undergone three decompression procedures at L3–L4 at another center, leading to a clinical conclusion of a failed back surgery syndrome (FBSS). Walking down stairs had become more difficult with the right leg. The patient had no left-sided complaints. Conservative therapy and analgesic interventions had no lasting effect; the patient was largely unable to work and indicated being unable to go on further with her CLBP. The adipose and hypertensive smoker was otherwise healthy and worked as a civil servant. The SCOAP-CERTAIN model predicted a high likelihood of clinically significant improvement in all of subjective functional impairment, back pain, and leg pain. Prior to considering surgery, the patient was required by us to cease smoking and lose weight.

Patient B

Patient B (male, 53 years) presented to us with progressive mechanical low back pain upon standing up and walking that had been existing for over 17 years, without radiating leg pain, but with constant hypesthesia in the right thigh. Upon flexion, the pain increased in intensity. A pantaloon cast test (PCT) was carried out to simulate a fusion procedure, which led to a 90% reduction in back pain and the disappearance of the hypesthesia. After the removal of the pantaloon cast (re-challenge), the patient again developed low back pain and hypesthesia in the right thigh. The patient had never undergone spine surgery before, but had undergone multiple analgesic interventions (facet blocks and thermolesional facet interventions at L5–S1). These interventions and other conservative therapies had no lasting effects. The patient was adipose but otherwise healthy, and the SCOAP-CERTAIN model predicted high chances of success after lumbar spinal fusion. We required the patient to lose weight to at least 100 kg before offering surgery.

Preoperative Imaging

You're Reading a Preview

Become a Clinical Tree membership for Full access and enjoy Unlimited articles

Become membership

If you are a member. Log in here