This study was conducted and reported following the recommendations of the TRIPOD statement [15, 16]. The local ethics committee approved this study (BASEC #2020-01311).
Study design and study population
In total, 2799 patients of a multicentre prospective Swiss cohort study were included in the study. The prognostic model was developed with data from 1588 patients (1395 outpatients and 193 hospitalized patients) and validated with 1211 patients (941 outpatients and 270 hospitalized patients). In-hospital patients from either ward and/or intensive/intermediate care unit (ICU/IMC) were eligible. The patients were at least 18 years old, diagnosed with COVID-19 (positive SARS-CoV-2 PCR test) at one of four Swiss hospitals, namely the University Hospital Basel, University Hospital Bern, University Hospital Zurich and Cantonal Hospital Baden, between February and December 2020. After a follow-up time between 60 and 425 days, the patients answered a questionnaire. The questionnaire contained questions regarding the patient’s characteristics (for example sex, age, body mass index), symptoms and severity of the acute COVID-19 infection, comorbidities, cardiovascular risk factors and questions regarding the personal situation during the COVID-19 pandemic. The questionnaire was available in German, French, Italian and English.
Cohort data for derivation and validation
According to the sample size calculation, the data set is large enough to split the data into a derivation and validation cohort. The data for model development and validation was separated based on geographical reasoning. A geographic validation instead of a standard cross-validation was chosen because a geographic validation is more meaningful [17]. The derivation cohort, used for the model development, was formed by patients recruited at the University Hospital Basel. The validation cohort contained the patients from University Hospital Zurich, University Hospital Bern and Cantonal Hospital Baden.
Outcome definition and follow-up
More than 50 symptoms have been described to be potentially related to long COVID [3]. We defined a composite outcome, addressing three long COVID symptoms, namely reduced exercise tolerance/reduced resilience, shortness of breath and tiredness (REST). The focus was placed on the REST symptoms, because the symptoms are potentially treatable/modifiable, are common [2, 3] and represent a significant burden on the patient. The binary REST outcome, encoded with 0 (no REST symptoms) and 1 (at least one of three REST symptoms), was assessed at follow-up. The follow-up was defined to be at least 60 days after acute COVID-19 infection.
Candidate predictors
Twelve candidate predictors including patient’s demographic information, their number of comorbidities, the presence of cardiovascular risk factors, the number of symptoms during acute COVID-19, the severity of acute COVID-19 and gender-related characteristics were considered for inclusion in the prognostic model. A detailed description of the candidate predictors and their coding is given below:
-
Patient demographics consisting of sex (1 female, 0 male), age at presentation and body mass index
-
Number of comorbidities, i.e. the sum of the following pre-existing diseases prior to COVID-19 infection. Each pre-existing disease was coded with 0 (absent disease) or 1 (present disease, e.g., maximum 1 point for heart disease):
-
Heart disease: Heart failure, narrowing of the coronary arteries and/or heart attack, congenital heart disease, heart muscle inflammation (myocarditis)
-
Vascular disease: deep vein thrombosis in the leg, pulmonary embolism, stroke, blood clotting disorder, blood disease
-
Kidney disease
-
Diseases of the immune system including autoimmune disease
-
Lung disease: asthma, chronic obstructive pulmonary disease, pulmonary hypertension
-
Nervous system disease
-
Liver disease
-
Infectious disease including human immunodeficiency virus and hepatitis
-
Malignant cancer currently or within the last five years
-
Psychiatric disease
-
Rheumatic disease
-
Cardiovascular risk factors including diabetes, high blood pressure, increased cholesterol and/or family history, coded with 1 (at least one condition present) or 0 (no condition present)
-
Candidate predictors related to acute COVID-19 infection:
-
Absolute number of symptoms including fever, shortness of breath, coughing, loss of smell, loss of sense of taste, gastro-intestinal problems, physical weakness, tiredness, headache and other symptoms. The answer “I don’t remember” was counted as one symptom.
-
Severity categorized in outpatients, hospitalized patients to normal ward and/or intensive or intermediate care unit (ICU/IMC), coded with 1 outpatients, 2 normal ward and 3 ICU/IMC.
-
Gender-related candidate predictors:
-
Responsibility for childcare/family member on a numeric rating scale from 1 (no responsibility or not applicable/low risk) to 6 (full responsibility/high risk).
-
Being the main responsible person for the household as factor candidate predictor with 1 (no/low risk), 2 (the partners contribute to roughly equal shares) and 3 (yes or I live alone/high risk).
-
Feeling of stress at home measured on a numeric rating scale ranging from 1 (no stress/low risk) to 10 (maximum stress/high risk).
The candidate predictors were chosen based on pre-existing medical knowledge and discussions with clinical experts. With the exception of the severity of acute COVID-19 infection and age at presentation, the candidate predictors were obtained by questionnaire at follow-up. The patients provided information on their health status, and no clinical measurements were collected. The severity of acute COVID-19 infection as well as the patient’s age at first presentation was ascertained based on the patient file.
Sample size
The minimum sample size for developing and validating a prognostic model was calculated following the recommendations of Riley et al. [18, 19]. For the prediction of the REST outcome, with an observed prevalence of approximately 22% in the derivation cohort, an estimated variance explained by the prognostic model of 15% [18], an assumed shrinkage factor of 0.932 and twelve candidate predictors (10 variables), 1006 patients are required. This corresponds to approximately 222 events and an event per predictor parameter ratio equal to 18 in the derivation cohort. For validation, a sample size of 988 patients is required. A calibration slope of 1, a corresponding 95% confidence interval width of 0.35 and a calibration intercept of 0 were assumed. The parameters \(I_\alpha\) = 0.140, \(I_{\alpha ,\beta }\) = −0.154 and \(I_\beta\) = 0.297 (for explanations reference is made to [19]) were calculated based on the linear predictors, based on the prediction model from the derivation cohort.
Missing data
Very few missing values in the candidate predictors were observed (max 1.6% in the variable responsibility for childcare/family member). They were replaced with a non-parametric iterative imputation method using a random forest algorithm [20]. In brief, the missing values from all candidate predictors are replaced temporarily. The iterative imputation methods starts with the candidate predictor \(x_s\) containing the smallest proportion of missing values. A random forest with the non-missing values of \(x_s\) as dependent variable and all candidate predictors as independent variables is built. Based on the random forest, the missing values of \(x_s\) are imputed. The iterative imputation continues with the next candidate predictor and stops when a stopping criterion is reached. The iterative imputation method was performed on the derivation, validation and total cohort with the R package MissForest [21]. Missingness at random was assumed.
Statistical analysis methods
Candidate predictors were summarized with descriptive statistics, with median and interquartile range (IQR) for continuous candidate predictors and counts and percentage for categorical candidate predictors by cohort. The standardized mean difference (SMD) between the derivation and validation cohort was calculated. A SMD greater than 0.1 was considered as indicating imbalance [22].
The prognostic model was developed by (I) selecting relevant predictors by applying three methods and (II) estimating the parameters at the selected relevant predictors. Based on the validation cohort, the (III) three models were geographically validated and the (IV) final prognostic model was determined. A detailed description of the four steps is given in the following.
I) Selecting relevant predictors
A global logistic regression model was fitted with the REST outcome as dependent variable and all candidate predictors as independent variables. Interactions between age and number of comorbidities, age and body mass index, age and responsibility for childcare/family member, sex and presence of at least one cardiovascular risk factor, and body mass index and number of comorbidities were included in case of evidence for an interaction (p-value ≤ 0.05). Interactions were preselected based on clinical reasoning. The assumptions of the logistic regression model were checked with leverage plots and Tukey-Anscombe plots. Nonlinear relationships between individual continuous candidate predictors and the REST outcome were investigated visually by fitting a generalized additive model [23]. The generalized additive model was fitted with the REST outcome as dependent variable and all candidate predictors as independent variables, with smooth functions (penalized regression splines) for the continuous or ordinal candidate predictors age, body mass index, responsibility for childcare/family member and feeling of stress at home. Collinearity may reduce the accuracy of the estimated coefficients [17], and for this reason, linear relationships between continuous candidate predictors were assessed by calculating Pearson’s correlation r and the variance inflation factor (VIF) [24]. Values of |r| \(\ge\) 0.7 and/or VIF \(\ge\) 5 were considered as problematic [25]. A high VIF or r indicates that the calculated regression coefficients are unstable. Variable selection was performed based on three methods, including augmented backward elimination (ABE) [26], adaptive best-subset selection (ABESS) [27] and model-based recursive partitioning (MBRP) [28]. In the following, each method is described.
ABE
ABE combines the backward elimination method with the change-in-estimate criterion. A detailed description of the method is given by Dunkler et al. [26]. In brief, ABE starts with the global logistic model. More important predictors can be predefined as ‘passive’ candidate predictors, less important predictors as ‘active’ candidate predictors. ABE will only be performed on active candidate predictors or candidate predictors of unknown importance. Backward elimination can be based on the significance level \(\alpha\), Akaike’s information criterion (AIC) or Bayesian information criterion (BIC). Candidate predictors not selected with backward elimination will be evaluated further with the change-in-estimate criterion. The change-in-estimate criterion evaluates the change in the coefficient of a passive candidate predictor by removing an active candidate predictor repeatedly. By applying backward elimination only, an important predictor might falsely be excluded. The additional change-in-estimate criterion in the ABE method minimizes this risk. Backward elimination was conducted based on AIC and with the threshold of the relative change-in-estimate criterion set equal to 0.05 [26] by using the R package abe. Candidate predictors were neither specified as ‘passive only’ nor ‘active only’ variables. In other words, all candidate predictors were classified as predictors of unknown importance.
ABESS
The algorithm selects a few relevant predictors out of a set of candidate predictors, so that the corresponding prognostic model has the highest prediction accuracy [27]. For this, the algorithm builds an initial first set \(S_1\), consisting of candidate predictors that are most correlated with the outcome. The remaining candidate predictors form a second set \(S_2\). Less important candidate predictors from \(S_1\) are replaced iteratively with relevant candidate predictors from \(S_2\) by the ABESS algorithm until the model error is minimized. The most suitable model sparsity level is determined by a data-driven procedure using the special information criterion developed for ABESS. The algorithm for the best-subset selection problem is implemented in R with the package abess and the function abess.
MBRP
The MBRP algorithm builds a decision tree [28]. In brief, the algorithm starts by first fitting a logistic regression model to the full derivation cohort. Model parameter instabilities are assessed. In case of instabilities in candidate predictor estimates, the derivation cohort will be split into two child nodes at the highest instability. The split point that locally optimizes the negative log-likelihood is determined and one logistic regression model per node (in other words per split) is fitted. The procedure is repeated in each child node until there is no further evidence for parameter instability. The MBRP algorithm can handle nonlinear relationships and interactions between candidate predictors. All predictors could potentially be used for partitioning. In R, the MBRP algorithm is implemented in the function glmtree of the package partykit.
II) Estimation of parameters
The logistic regression model \(\text {M}_\text {ABE}\) and \(\text {M}_\text {ABESS}\) was fitted containing all selected predictors found with the ABE and ABESS method, respectively, as independent variables, and the REST outcome as dependent variable. Non-linear and non-additive (interaction) effects were investigated [26]. The logistic regression model \(\text {M}_\text {MBRP}\) was fitted containing the selected predictors found with the MBRP approach as independent variables and the REST outcome as dependent variable.
III) Model validation
The model performance of \(\text {M}_\text {ABE}\), \(\text {M}_\text {ABESS}\) and \(\text {M}_\text {MBRP}\) was assessed by evaluating the overall performance, the relative performance (discrimination) and the absolute performance (calibration) in the validation cohort. An internal validation was conducted as well.
Overall performance
The overall performance was assessed with the scaled Brier score. The Brier score can be calculated with Brier = \(\frac{1}{N} \sum _{i~=~1}^N(y_i - \hat{y}_i)^2\), with the number of patients N, actual outcome y and predicted probability \(\hat{y}\) for each patient i [17]. A perfect prognostic model has a Brier score of 0. Since the Brier score depends on the prevalence of the outcome, the interpretation of the Brier score can be simplified by scaling the Brier score from 0% (non-informative model) to 100% (perfect model): \(\text {Brier}_\text {scaled}~=~(1 - \frac{\text {Brier}}{\text {Brier}_\text {max}}) \times 100\), with \(\text {Brier}_\text {max}~=~\frac{1}{N} \sum _{i~=~1}^N y_i \times (1 - \frac{1}{N} \sum _{i~=~1}^N y_i)\).
Relative performance
The model discrimination is the model’s ability to differentiate between the patients with and without the outcome [17]. Model discrimination was summarized with the concordance c statistic and its corresponding 95% confidence interval, calculated with a bootstrap resampling approach. Due to the binary outcome, the c statistic is equal to the area under the receiver operating characteristic (ROC) curve (AUC), a summary statistic of the ROC curve. The ROC curve contains the sensitivity on the y-axis and 1 - specificity on the x-axis. A perfect prognostic model has an AUC of 1; a non-informative prognostic model has an AUC of 0.5.
Absolute performance
The agreement between the actual outcome y based on a suitable binning and the predicted probability \(\hat{y}\) is shown with a calibration plot [17]. A calibration slope smaller than 1 is an indicator for overfitting, or in other words, the predicted probabilities are higher than the observed outcome rates. In R, the function val.prob.ci.2 of the package CalibrationCurves was used to plot the calibration plots and calculate the calibration intercept, calibration slope and AUC. Calibration intercept corresponds to calibration-in-the-large [29].
IV) Final prognostic model
The final prognostic model was defined by the model with the best model performance regarding calibration intercept, calibration slope and AUC. The final prognostic model was also fitted to the total cohort (derivation and validation cohort).
Software
The statistical analysis was conducted with the statistical software R, version 4.2.0 [30]. A dynamic reporting approach with a fully scripted analysis was chosen. The following R base packages (base, datasets, graphics, grDevices, grid, methods, parallel, splines, stats, stats4, utils) and other packages (abe 3.0.1, abess 0.4.5, bestglm 0.37.3, Cairo 1.5.15, CalibrationCurves 0.1.2, colorspace 2.0.3, doParallel 1.0.17, doRNG 1.8.2, dplyr 1.0.9, foreach 1.5.2, Formula 1.2.4, ggplot2 3.3.6, Hmisc 4.7.0, iterators 1.0.14, lattice 0.20.45, leaps 3.1, libcoin 1.0.9, magrittr 2.0.3, mgcv 1.8.40, missForest 1.5, mvtnorm 1.1.3, nlme 3.1.157, partykit 1.2.15, randomForest 4.7.1.1, readxl 1.4.0, regclass 1.6, rms 6.3.0, rngtools 1.5.2, rpart 4.1.16, skimr 2.1.4, SparseM 1.81, survival 3.3.1, tableone 0.13.2, VGAM 1.1.6, VIM 6.1.1, xtable 1.8.4) were used.