Introduction

Degenerative disease of the lumbar spine, including chronic low back pain (CLBP), lumbar spinal stenosis, lumbar disc herniation, and degenerative lumbar spondylolisthesis, is part of the top-three causes of disability in Western societies and imposes significant direct and indirect socio-economic costs [1]. The gold standard treatment for these chronic degenerative diseases is multidisciplinary therapy including components of exercise therapy, cognitive behavioural therapy, and pharmacological therapy, although certain patients who are unresponsive to long-term conservative treatment may benefit from fusion [2, 3]. Nonetheless, with some reports showing no benefit compared with conservative treatment in a randomized population, patient selection is vitally important [4]. Various prognostic tests exist to attempt to identify subsets of patients that might truly benefit from surgery as a “last resort”, but the validity of these tests is unclear [5, 6]. A relevant proportion of patients with intractable, conservative therapy-resistant lumbar degenerative disease does finally benefit from lumbar fusion surgery—the difficult question is how to identify these subsets reliably and how to avoid unnecessary, unsuccessful surgery [3].

Clinical prediction models can summarize a large number of factors into a single, potentially more accurate prediction of surgical risk or benefit, tailored to each individual patient [7,8,9]. The implementation of machine learning (ML) is increasing exponentially, although methodological rigour is only rarely upheld [8, 10]. Without thorough methodological foundations, development of clinical prediction models can very easily lead to pseudo-reliable predictions with seemingly high-performance measures due to issues such as data leakage, class imbalance, and overfitting [8, 11]. If clinical prediction models are not externally validated properly, real-world performance cannot be adequately estimated, and they should not to be applied in clinical practice [12, 13].

For patients with degenerative disease of the lumbar spine in whom spinal fusion surgery is considered, accurate prediction of long-term outcome in individual patients has been demonstrated to be extraordinarily difficult [5, 14]. The aim of the FUSE-ML consortium was to assemble a large multinational dataset of patients undergoing lumbar spinal fusion for degenerative disease in order to create robust clinical prediction models that take into account surgical variables and that are thoroughly developed and externally validated in a range of international centres.

Methods

Overview

A substantial multinational (7 countries), multicentre (11 centres) dataset (FUSE-ML) of patients who had undergone lumbar spinal fusion for degenerative disease was used to develop and externally validate a ML-based prediction tool for mid-term patient-reported outcomes. We then briefly compared the performance to that of the—to our knowledge—only other comparable, externally validated, clinical prediction model [14]. This study adheres to the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis guidelines and is registered on ClinicalTrials.gov (NCT05161130) [7]. The use of patient data for research purposes was approved by each local institutional review board (IRBs), and patients provided informed consent or informed consent was waived, depending on the demands of the local IRB.

Inclusion and exclusion criteria

Patients with the following indications for thoracolumbar pedicle screw placement were considered for inclusion: degenerative pathologies (one or multiple of the following: spinal stenosis, spondylolisthesis, degenerative disc disease, disc herniation, failed back surgery syndrome (FBSS), radiculopathy, pseudarthrosis). Exclusion criteria were: surgery for—as the primary indication—infections, vertebral tumours, as well as traumatic and osteoporotic fractures or deformity surgery for scoliosis or kyphosis; moderate or severe scoliosis (Coronal Cobb’s > 30°/Schwab classification sagittal modifier + or + +); surgery at more than 6 vertebral levels; missing endpoint data at 12 months; lack of informed consent; age < 18 years old.

Data collection

Each centre either extracted data retrospectively, from a prospective registry, or collected data in a prospective registry supplemented by retrospectively collected variables, with complete mid-term follow-up. The following data were collected: age, gender, surgical indication, index level(s), height, weight, BMI, smoking status, American Society of Anesthesiologists (ASA) Score, preoperative use of opioid pain medication, bronchial asthma as a comorbidity, prior thoracolumbar spine surgery, race/ethnicity, surgical approach, pedicle screw insertion and minimally invasive technique. PROMs included preoperative (baseline) and 12-month postoperative Oswestry Disability Index (ODI) (scaled from 0 to 100) or Core Outcome Measures Index (COMI) for multidimensional subjective functional impairment, numeric rating scale (NRS) for back pain severity, and NRS for leg pain severity [15, 16].

Primary endpoint definitions

Clinically relevant improvements in terms of functional impairment (ODI or COMI) and back/leg pain were dichotomized using the minimum clinically important difference (MCID) according to validated thresholds (Improvement from baseline to 12 months postoperatively of ≥ 15 points for ODI, ≥ 2.2 points for COMI, and ≥ 2 points for NRS pain severity) [17,18,19]. Thus, improvements from baseline that were greater than these validated thresholds were counted as achievement of MCID in the respective score.

Clinical prediction modelling

Numerical input variables were standardized using centring and scaling, and Yeo–Johnson transformation, and highly correlated variables (Pearson correlation coefficient ≥ 0.8) were filtered. A preoperative or postoperative ODI of ≤ 22 [20], COMI of ≤ 3.05 [21], or NRS pain severity of ≤ 3 [16] was considered as a probable “patient acceptable symptom state” (PASS) [22] based on established cut-offs. Patients with a preoperative PASS (minimal symptoms) in one of the three outcome dimensions were excluded from training for that respective dimension. Recursive feature elimination based on generalized linear models (GLMs) was carried out to identify the optimal, parsimonious set of inputs for each of the 3 models. Subsequently, GLMs were trained using Elastic Net Regularization using the Caret [23] library. During training, hyperparameters were tuned using fivefold cross-validation with 10 repeats, maximizing area under the curve (AUC). A k-nearest neighbour imputer was trained to impute missing data. The threshold for binary classification was selected based on the “closest-to-(0, 1)-criterion” and rounded. The models were then integrated into a web-app and underwent external validation. No recalibration was carried out. Quantile-based 95% confidence intervals (CIs) of the discrimination and calibration metrics were obtained from 1000 bootstrap resamples. Standardized model coefficients are reported to allow for explanation. [23] Finally, the models reported by Khor et al. [14] were reconstructed from the published coefficients, and external validation performance was compared. Notably, the Khor et al. model takes insurance status, which was not available within the FUSE-ML consortium. As has been done previously and due to the fact that virtually all inclusions in the FUSE-ML dataset stem from countries with either single-payer healthcare or compulsory health insurance, we adopted “Medicare/Medicaid” as the most appropriate choice for the entire cohort. [12] All analyses were carried out in R version 4.1.1.

Results

Patient cohort

Data from 1115 patients were provided by 11 participating centres in total. The development cohort was made up of 8 centres (817 patients, 42.7% male, age: 61.19 ± 12.36 years), while the remaining 3 centres were used for external validation (298 patients, 35.6% male, age: 59.73 ± 12.64 years). Achievement of MCID at 12-months was recorded in 761 (68.3%) patients for functional impairment, 862 (77.3%) patients for back pain severity, and 796 (71.4%) patients for leg pain severity. An overview of patient characteristics is provided in Table 1, and detailed patient characteristics including missingness and data per centre are shown in Supplementary Table 1. Overall, 3074 of 52′405 baseline data fields (5.9%) were incomplete.

Table 1 Summary of patient characteristics and outcome measures

Performance evaluation

Detailed model performance, including resampled development and external validation performance, is summarized in Table 2, and standardized model coefficients—enabling judgement of variable importance—are provided in Table 3. Calibration plots generated from the external validation cohort are shown in Fig. 1 including resampled training calibration, external validation calibration, and calibration from the Khor et al. model applied to the FUSE-ML external validation cohort. A detailed performance comparison with the Khor et al. model is available in Supplementary Table 2.

Table 2 Discrimination and calibration metrics of the machine learning-based prediction models for clinically relevant improvement
Table 3 Model coefficients of the fully trained models
Fig. 1
figure 1

Calibration curves of the three clinical prediction models for function, back pain, and leg pain on the resampled development cohort (ac), cross-validation performance), the external validation cohort (df), FUSE-ML models at external validation), as well as those generated from the performance of the Khor et al. [14] prediction model applied to the FUSE-ML external validation cohort (gi). The predicted probabilities for functional impairment (ODI/COMI) are distributed into five equally sized groups and contrasted with the actually observed frequencies of functional impairment. Calibration intercept and slope are calculated. A perfectly calibrated model has a calibration intercept of 0 and slope of 1. Metrics are provided with bootstrapped 95% confidence intervals

Prediction of functional impairment

At external validation, the FUSE-ML prediction model for clinical success in terms of functional impairment (ODI/COMI) achieved an AUC of 0.67 (95% CI: 0.59–0.74), sensitivity of 0.59 (95% CI: 0.52–0.66) and specificity of 0.66 (95% CI: 0.55–0.77). In terms of calibration, we measured a calibration intercept of − 0.07 (95% CI: − 0.36–0.22) and a calibration slope of 0.63 (95% CI: 0.34–0.93). When studying the standardized model coefficients, it was clear that predictions were mostly driven by greater baseline ODI/COMI scores, age, and lower back pain severity preoperatively, and application of a lateral surgical approach. The Khor et al. model achieved an AUC of 0.71 (95% CI: 0.64–0.77) on the same external validation cohort.

Prediction of back pain severity

Prediction of clinical success in terms of back pain severity in the external validation dataset was achieved with an AUC of 0.72 (95% CI: 0.64–0.79), sensitivity of 0.72 (95% CI: 0.65–0.77) and specificity of 0.64 (95% CI: 0.51–0.78). The calibration intercept was − 0.38 (95% CI: − 0.70–0.06) and slope, 1.10 (95% CI: 0.62–1.57). Higher baseline back pain and a lateral surgical approach were assigned the highest importance by the model. The Khor et al. model demonstrated an AUC of 0.73 (95% CI: 0.65–0.79) at external validation.

Prediction of leg pain severity

At external validation, long-term leg pain severity was predicted with an AUC of 0.64 (95% CI: 0.54–0.73), sensitivity of 0.76 (95% CI: 0.71–0.82) and specificity of 0.42 (95% CI: 0.26–0.57). The calibration intercept was 0.14 (95% CI: − 0.22–0.51) and calibration slope, 0.49 (95% CI: − 0.12–0.86). Looking at model coefficients, it appeared that greater baseline leg pain, a posterior surgical approach, and the absence of prior thoracolumbar surgery contributed most to the predictions of leg pain. The Khor et al. model had a corresponding AUC of 0.63 (95% CI: 0.54–0.71).

Model deployment

The prediction model was integrated into a freely available, web-based application accessible at https://neurosurgery.shinyapps.io/fuseml/.

Discussion

The rationale of the FUSE-ML study was to develop and thoroughly externally validate clinical prediction models for 12-month MCID in ODI/COMI, back pain, and leg pain in patients undergoing lumbar fusion for degenerative disease of the lumbar spine. Using data from 11 centres in 7 countries, a web-app was generated. After thorough external validation, we found that the fully trained clinical prediction models demonstrated only moderate ability to dichotomize patients who did and those who did not benefit from lumbar fusion surgery (discrimination performance). Calibration performance—the reliability of the predicted probabilities—was fair. Generally, our models performed comparably well to those published previously by Khor et al. although our models appeared to require only around half of the inputs to achieve the same performance, which streamlines implementation.

Our findings, coupled with those reported in the literature for patients with degenerative disease of the lumbar spine, demonstrate that the accurate prediction of long-term postoperative PROMs in this patient population remains remarkably difficult, and that clinical prediction models should only have a minor role in clinical decision-making. It is well-known that even expert surgeons can overestimate the benefits and underestimate complications of certain procedures [24]. Clinical outcomes in degenerative disease of the lumbar spine and spinal fusion—and in particular CLBP, FBSS, and low-grade spondylolisthesis—are known as distinctly difficult to anticipate, and few independent predictors with a sufficiently large effect size are known [5, 14, 25]. Taking the example of discogenic CLBP, all recent randomized studies show that fusion surgery—overall—does not produce significantly better results than conservative treatment [4]. While surgery may not provide a benefit compared to conservative treatment for CLBP in the general patient population, there are subsets of patients that will truly benefit [5, 6]. Rigorous patient selection is the key to success in degenerative spine surgery.

In theory, clinical prediction models can provide valuable insights, since they enable calculation of individualized likelihoods of improvements or complications for each patient—as opposed to informing patients about a generalized treatment success rate that is based on historical data in the literature [26]. The hopes of being able to predict the effects of fusion surgery more robustly by generating “objective” risk–benefit profiles for each individual patient have not been fulfilled to date [26]. Janssen et al. [27] achieved an externally validated AUC of 0.68 for prediction of MCID in the predominant pain complaint using a nomogram. Apart from this nomogram, to our best knowledge, the only other externally validated prediction tools that predict pain and functional outcomes for this population are the prediction models of Khor et al. [14]. The latter was developed using the data of 1965 adult lumbar fusion surgery patients collected from a registry of fifteen Washington state hospitals. This model has recently been externally validated at a single Dutch centre, demonstrating AUCs of 0.71–0.83, sensitivities of 0.64–1.00, and specificities of 0.38–0.65, with fair calibration. [12] This analysis demonstrated that the discrimination and calibration performance generalized relatively well to a new population, although this level of performance unfortunately still would not allow any reliable decision support in actual clinical practice. FUSE-ML is largely based on the same inputs as those used in the Khor et al. [14] tool, although we attempted to improve upon the predictions by introducing surgical variables. In our extensive, multinational external validation study, the FUSE-ML models demonstrated only moderate discrimination and calibration, both of which appeared similar to the performance of the Khor et al. models when applied to our external validation dataset. Still, judging by these performance measures, these models would likely not be very helpful in clinical practice. The discrimination and calibration performance of expert surgeons has not been established as yet for lumbar fusion in degenerative disease. As long as these metrics remain unknown and as long as comparative or randomized studies do not demonstrate superiority of a decision-making approach integrating machine learning, these supportive tools ought to be used only adjunctively and with great caution in this patient population.

Even with the considerable amount of development data available to us for FUSE-ML, and the application of, e.g. regularization techniques, outcomes after lumbar spinal fusion remained difficult to predict with high reliability. One likely contributing factor is the input data: while we included a wide range of relevant socio-demographic, disease-specific, and surgical variables, the addition of imaging data for radiomic analysis and the inclusion of psychological factors could potentially improve predictions. The rationale behind the current approach was to only include few simple, preoperatively and easily available variables, with the intention of keeping prediction tools simple, accessible, and quick to use. This goal was also achieved: we demonstrated that our models generalized to an external validation dataset as approximately equally well as previously published, robust models did—although the FUSE-ML models appeared to enable the same level of performance with only around half of the inputs required. [14] More parsimonious models, rather than more complex models that require hard-to-collect inputs, are more prone to overfitting and may not be interpretable at all (“black box”) [28, 29].

Still, even generally—in other patient populations—there is little to no high-quality evidence that clinical prediction models have any measurable clinical impact in their current state. A simulation analysis by Joshi et al. [30] found that only if applied on a population scale, prediction models in adult spinal deformity may overall decrease healthcare costs by better redirection of resources. Prospective clinical studies evaluating the real-world impact of integrating decision support tools into practice are currently not available. All of the above indicates a need for improving the methods, performance, and in silico/in vivo validation of clinical prediction models. However, caution must be exercised: the publication of clinical prediction models has increased exponentially over the past few years, as a result of the equally exponential access to computing power and “big data”. [8] Exactly because it has become relatively easy to generate prediction models, many of these publications fall into common methodological ML “traps”, which can catch out reviewers of expert medical journals. An important notion is the fact that it is relatively easy to generate prediction models with seemingly high-performance measures if certain concepts are disregarded—such as class imbalance, data leakage, adequate resampling, and proper validation, among others [8, 11]. Furthermore, the vast majority of published models have not undergone external validation and would very likely perform considerably worse in external validation studies [10, 13]. A recent review by Lubelski et al. highlighted the vast methodological deficits in the spinal prediction modelling literature [10]. Lastly, the hopes that ML may help improve predictive performance compared to “traditional statistical modelling” have not been fulfilled, as a systematic analysis by Christodoulou et al. concludes [31]. ML certainly has advantages when analysing highly dimensional data, imaging data, or in natural language processing and time series analysis, but for “simple” tabulated clinical data as is the case with most prediction models, the advantages of ML over, e.g. “traditional” generalized linear models likely, do not outweigh their drawbacks [8, 31].

We do not recommend the use of clinical prediction models—even those with very high-performance metrics—as absolute “red light” or “green light” indicators, but advocate carefully balancing all available clinical data against patient wishes and expectations as well as clinical expertise. There is a need for improved clinical prediction models in spinal fusion for degenerative disease of the lumbar spine, and development will require major international collaborative efforts to collect larger amounts of data and to enable thorough validation of developed models. The FUSE-ML collaborators will continue investigating approaches to improving patient selection in this population.

Strengths and limitations

Our study used data from 11 centres in different countries, with unified variable definitions. The models have been made available as a web-based tool. Different degenerative spinal diseases were included. Consequently, our models may perform better for more common pathologies, whereas performance may be limited for the less prevalent ones. Conversely, this heterogeneity in training data may equip the models for the heterogenous presentations of spinal degenerative disease. We also directly compare the performance of our models to the current “benchmark” model in spinal fusion surgery and demonstrate approximate equivalence of our performance at external validation, as well as fair calibration of our models.

Our data consisted of a mix of retrospectively and prospectively collected data from institutional registries. Many definitions of MCID—and, in the same vein, of PASS—exist, and their choice determines the interpretation of generated predictions [15]. We chose a MCID based on robust MCID studies[17,18,19], and we excluded patients unlikely to improve by determining a minimally symptomatic state (PASS) based on thresholds from analyses that were anchored to patient-rated symptom satisfaction [16, 20, 21]. Our prediction tool does not include measures of quality of life and psychological factors, which may improve performance. Learning techniques rely on large amounts of development data and often improve their performance linearly with an increasing number of training samples. Thus, although we included a relatively large cohort of patients, further training with a larger sample is likely to improve the performance and generalization of the models. We excluded patients under the age of 18 and those with spinal deformity. Our models may not necessarily generalize when extrapolating to these patients.

Conclusions

With the great heterogeneity of outcomes after lumbar spinal fusion for degenerative disease and the countless physical and psychological factors that may modulate the effects of procedures, identifying those patients most likely to benefit from surgical treatment in an objective fashion remains difficult. Although assistive clinical prediction models can help in quantifying potential benefits of surgery and the externally validated FUSE-ML tool (https://neurosurgery.shinyapps.io/fuseml) may aid in individualized risk–benefit estimation, truly impacting clinical practice in the era of “personalized medicine” will necessitate improvements in reliability of clinical prediction models in this patient population. When thoroughly externally validated, current approaches based on tabulated clinical data fail to break the performance barrier required to prevent ineffective surgery or to allow meaningful decisions that are at least partially informed by such clinical prediction models.