This article has Open Peer Review reports available.
Validation and updating of risk models based on multinomial logistic regression
- Ben Van Calster^{1, 2}Email authorView ORCID ID profile,
- Kirsten Van Hoorde^{3},
- Yvonne Vergouwe^{2},
- Shabnam Bobdiwala^{4},
- George Condous^{5},
- Emma Kirk^{6},
- Tom Bourne^{1, 4, 7} and
- Ewout W. Steyerberg^{2}
https://doi.org/10.1186/s41512-016-0002-x
© The Author(s) 2017
Received: 3 April 2016
Accepted: 9 September 2016
Published: 8 February 2017
Abstract
Background
Risk models often perform poorly at external validation in terms of discrimination or calibration. Updating methods are needed to improve performance of multinomial logistic regression models for risk prediction.
Methods
We consider simple and more refined updating approaches to extend previously proposed methods for dichotomous outcomes. These include model recalibration (adjustment of intercept and/or slope), revision (re-estimation of individual model coefficients), and extension (revision with additional markers). We suggest a closed testing procedure to assist in deciding on the updating complexity. These methods are demonstrated on a case study of women with pregnancies of unknown location (PUL). A previously developed risk model predicts the probability that a PUL is a failed, intra-uterine, or ectopic pregnancy. We validated and updated this model on more recent patients from the development setting (temporal updating; n = 1422) and on patients from a different hospital (geographical updating; n = 873). Internal validation of updated models was performed through bootstrap resampling.
Results
Contrary to dichotomous models, we noted that recalibration can also affect discrimination for multinomial risk models. If the number of outcome categories is higher than the number of variables, logistic recalibration is obsolete because straightforward model refitting does not require the estimation of more parameters. Although recalibration strongly improved performance in the case study, the closed testing procedure selected model revision. Further, revision of functional form of continuous predictors had a positive effect on discrimination, whereas penalized estimation of changes in model coefficients was beneficial for calibration.
Conclusions
Methods for updating of multinomial risk models are now available to improve predictions in new settings. A closed testing procedure is helpful to decide whether revision is preferred over recalibration. Because multicategory outcomes increase the number of parameters to be estimated, we recommend full model revision only when the sample size for each outcome category is large.
Keywords
Calibration Discrimination Model updating Multicategory outcome Multinomial logistic regression Prediction models Risk modelsBackground
Prior to implementing risk prediction models in clinical practice to assist in patient management, their performance needs to be rigorously validated. Core elements of performance include discrimination (i.e., how well the model discriminates between the different categories) and calibration (i.e., the reliability of the predicted risks) [1, 2]. It is of particular importance to externally validate the model using data collected later in time (temporal validation) and/or in different locations or hospitals (geographical validation) [3, 4]. Disappointing validation results do not necessarily imply that the previously developed prediction model should be discarded, because the model contains crucial information such as which predictors are considered relevant. An attractive alternative is to perform some form of model updating, where we combine information from the previously developed model with new data [5]. This approach has clear practical relevance, because it is often not realistic to expect that a single model will work in all settings, due to differences in patient management protocols and referral patterns across centers and regions, and improvements in care over time. Updating is particularly useful for model validation in settings with different patient populations (e.g., primary vs secondary care), sometimes labeled “domain validation,” because of likely differences in case-mix, event rates, predictor definitions, and measurement methods [6].
Methods to update risk models for dichotomous outcomes focus on recalibration, revision, and extension [1, 5, 7]. Recalibration merely adjusts the model intercept and/or overall slope, where an overall slope adjustment implies a fixed proportional adjustment of all predictor coefficients. Model revision adjusts the individual model coefficients, and model extension refits the model while including additional markers.
Van Hoorde and colleagues assessed how dichotomous recalibration and revision techniques could be extended to multicategory outcomes for which risk estimation was based on a sequence of dichotomous logistic regression models (sequential dichotomous modeling) [8]. The aim of the current paper is to introduce methods to directly update risk models for multicategory outcomes based on multinomial logistic regression, which is the most commonly used method for nominal outcomes. We present recalibration, revision, and extension methods and a statistical test to direct the preferred strategy. We illustrate these methods on a case study.
Methods
Case study
Descriptive statistics for the case study of multicategory outcome prediction: original development data of model M4 (n = 197), the temporal updating data at SGH (n = 1422), and the geographical updating data at QCCH (n = 873)
Original development data (SGH) n = 197 | Temporal updating (SGH) n = 1422 | Geographical updating (QCCH) n = 873 | |
---|---|---|---|
Age (years) | 30 (25–33) | 31 (26–35) | 32 (27–32) |
Initial hCG (IU/L) | 265 (76–618) | 410 (154–941) | 530 (197–1563) |
hCG ratio | 0.80 (0.33–1.99) | 1.04 (0.39–2.10) | 0.65 (0.34–1.49) |
Initial progesterone (nmol/L) | 17 (4–66) | 21 (4–61) | 9 (3–34) |
Outcome, n (%) | |||
Failed | 109 (55%) | 717 (50%) | 502 (58%) |
IUP | 76 (39%) | 577 (41%) | 245 (28%) |
Ectopic | 12 (6%) | 128 (9%) | 126 (14%) |
After applying the exclusion criteria (see the Appendix), data from 1422 (88%) patients were available at SGH and data from 873 (80%) women at QCCH. At SGH, there were 717 (50%) FPUL, 577 (41%) IUP, and 128 (9%) EP. The QCCH data contained 502 (58%) FPUL, 245 (28%) IUP, and 126 (14%) EP.
Updating methods
Updating methods for multinomial logistic regression models with the numbers of parameters that are estimated for updating in general and in the case study
Category | Method and description | Number of parameters |
---|---|---|
(General = case study) | ||
Original | 0—no adjustments | 0 = 0 |
Recalibration | 1—intercept recalibration: adjust intercepts | (k − 1) = 2 |
2—logistic recalibration: adjust intercepts and slopes | k × (k − 1) = 6 | |
3—refitting: re-estimation of individual coefficients | (q + 1) × (k − 1) = 8 | |
Revision | 4—penalized refitting using recalibrated coefficients from method 2 as offset | (k + q + 1) × (k − 1) = 14 |
5—refitting including functional form: method 3, but hCGr modeled with rcs | (q′ + 1) × (k − 1) = 8 | |
Extension | 6—extension: similar to method 3 but log(progesterone) added | (q + m + 1) × (k − 1) = 10 |
7—penalized extension: similar to method 5 but log(progesterone) added | (k + q + m + 1) × (k − 1) = 16 |
We denote the number of outcome categories with k. In our case study, k = 3. Next, we make the distinction between LP_{FvsI} and LP_{EvsI} on the one hand and LP_{ x,FvsI} and LP_{ x,EvsI} on the other. LP_{FvsI} and LP_{EvsI} are the linear predictors of the original M4 (Eq. 1), whereas LP_{ x,FvsI} and LP_{ x,EvsI} are the updated linear predictors for updating method x, where x = 1, …, 7. Finally, q denotes the number of variables in the model (i.e., including nonlinearity and interaction terms).
Reference method
Method 0 applies the original prediction model without any adjustments.
Intercept recalibration
Note that multinomial logistic regression models have k − 1 linear predictors, with k being the number of outcome categories. Per equation in Eq. 3, only one linear predictor corresponds to the outcomes that are compared (the “corresponding” linear predictor), with the other linear predictors labeled as “non-corresponding.” For intercept recalibration, we assume that the coefficients for the corresponding linear predictors are equal to 1 whereas we assume that the non-corresponding linear predictors have a coefficient of 0. This update aims to improve calibration-in-the-large by aligning observed event rates and mean predicted risks [13].
Logistic recalibration
Intercepts (α), coefficients for corresponding linear predictors (β), and coefficients for non-corresponding linear predictors (γ) are estimated in order to update the regression coefficients [13]. This method corrects miscalibration of the predicted probabilities from M4, such that there is no general over- or underestimation of risks and such that predicted risks are on average not overly extreme or overly modest. It may be surprising that coefficients for non-corresponding linear predictors are not fixed at zero. Only if the original model is correct for the updating population, all βs are 1 and all γs are 0 in Eq. 4. When performing logistic recalibration by setting all γs to 0, there is no unique result: the updated model will be different depending on the choice of reference category in the logistic recalibration model [13]. In the Appendix, we work out the logistic recalibration formula for the case study.
Model refitting by re-estimating individual coefficients
Model refitting by penalized estimation of differences with recalibrated coefficients
Adding the linear predictors as offset implies that the changes in the intercepts and predictor coefficients with respect to method 2 are modeled. We used ridge penalization on these changes to shrink coefficients to their recalibrated values in order to prevent an overly complex model leading to too extreme risk predictions [14]. Without such penalization, methods 3 and 4 would be identical.
Model refitting including reassessment of functional form
Because M4 was originally developed on a small dataset, the quadratic effect for hCG ratio may be inadequate or the result of overfitting. Given that hCG ratio is the most important predictor, it is worthwhile to re-assess its functional form. When using rcs with three knots, the number of parameters used to model hCG ratio remains at two. We decide to keep the log-transformation for the average hCG level to limit the overall complexity of the model.
Model extension by refitting and adding a novel marker
Model extension using penalization
Closed testing procedure
Description of the closed testing procedure for updating of multinomial logistic regression models
Step | Procedure |
---|---|
1. Original model vs refitting | H0: both models have the same fit, log L _{original} = log L _{refitted}. Test: likelihood ratio test with (q + 1) × (k − 1)df. Result: if H0 not rejected, choose the original model, else go to step 2. |
2. Intercept recalibration vs refitting | H0: both models have the same fit, log L _{int recal} = log L _{refitted}. Test: likelihood ratio test with q × (k − 1)df. Result: if H0 not rejected, choose intercept recalibration, else go to step 3. |
3. Logistic recalibration vs refitting | H0: both models have the same fit, log L _{logrecal} = log L _{refitted}. Test: likelihood ratio test with (q − k + 1) × (k − 1)df. Result: if H0 not rejected, choose logistic recalibration, else choose refitting. |
- A.
Predetermine the alpha level α.
- B.
Use a likelihood ratio test at α to compare the refitted model with the original model. The degrees of freedom is (q + 1) × (k − 1). If this test is not rejected (p value >α), there is no statistically significant improvement in fit of the refitted model vs the original model. The procedure stops and the original model is selected. If the test is significant, proceed to the next step.
- C.
Use a likelihood ratio test at α to compare refitting with intercept recalibration. The degrees of freedom is q × (k − 1). If the test is not rejected, the procedure stops and intercept recalibration is selected. If the test is significant, proceed to the next step.
- D.
If q < k (fewer variables than outcome categories), the procedure stops. Else, use a likelihood ratio test at α to compare refitting with logistic recalibration. The degrees of freedom is (q − k + 1) × (k − 1). If the test is not rejected, logistic recalibration is selected. If the test is significant, refitting is selected.
Ridge penalization
Methods 4 and 7 use ridge penalization which fits models using penalized maximum likelihood in order to obtain more stable and shrunken coefficients [14]. In methods 4 and 7, the coefficients of the predictors are shrunken towards the coefficients following logistic recalibration (method 2). Ridge penalization is implemented with the glmnet package in R [19]. The regularization parameter λ of the ridge penalty was estimated using 10-fold cross-validation with the deviance as performance criterion.
Missingness
Extension methods (methods 6 and 7) add the progesterone level at presentation to the model. At SGH 47 (3.3%) women and at QCCH 109 (12.5%) women had a missing value for progesterone. We used single imputation to deal with the missing values for the current illustrative study, although multiple imputation might be preferred to fully account for uncertainty in the imputation process [20]. The log-transformed progesterone level was imputed via fully conditional specification that included age, the logarithm of hCG0, the logarithm of hCG48, and outcome [21].
Performance evaluation: discrimination and calibration
Performance was evaluated using measures for discrimination and calibration. Optimism-corrected performance was based on bootstrap internal validation (500 bootstrap resamples), as recommended in [22].
The overall discrimination is evaluated using the polytomous discrimination index (PDI), a nominal version of the c-statistic [23]. In a set of patients, one from each outcome category (i.e., a set of size k), PDI estimates the probability that a patient from a random outcome category is correctly identified by the model. The patient from outcome category i is correctly identified in a set if this patient has the highest predicted risk of category i. A PDI of 0.7 means that it is estimated that on average 70% of patients from a set is correctly identified. Random performance corresponds to a PDI of 1/k [23]. In addition, c-statistics for all pairs of categories are calculated using the conditional risk method [24].
The common definition of calibration is that predicted risks should correspond to observed proportions per level of predicted risk: for patients with an estimated risk of event of 0.3, we expect 30% to have/develop this event. To assess calibration, we calculated calibration intercepts and calibration slopes (with 95% CI) [13]. Ideally, we expect calibration intercepts of 0 and a calibration slope of 1. The calibration intercepts indicate whether the risks are systematically overestimated (if <0) or underestimated (if >0). The calibration slopes indicate the presence of too extreme (if <1) or too modest (if >1) risk predictions. For the original model, we derive flexible calibration curves based on vector splines using the VGAM package in R [25]. This is similar to dichotomous calibration plots where observed proportions are based on loess or spline-based analyses [26, 27].
As an overall measure of performance that combines discrimination and calibration, the Brier score was computed. Brier scores were also optimism-corrected.
Results
Validation of the original M4 model
Polytomous discrimination index, pairwise c-statistics, and Brier score on the updating data after correction for optimism using bootstrapping
Updating method | PDI | c-statistic FPUL-IUP | c-statistic FPUL-EP | c-statistic IUP-EP | Brier |
---|---|---|---|---|---|
Temporal updating (SGH) | |||||
No updating | 0.87 (0.85–0.90) | 0.99 (0.98–0.99) | 0.92 (0.89–0.94) | 0.89 (0.85–0.93) | 0.172 (0.155–0.190) |
Intercept recalibration | 0.87 (0.84–0.89) | 0.99 (0.98–0.99) | 0.92 (0.89–0.94) | 0.89 (0.85–0.92) | 0.165 (0.148–0.183) |
Logistic recalibration | 0.88 (0.85–0.90) | 0.99 (0.98–>0.99) | 0.93 (0.91–0.95) | 0.91 (0.88–0.94) | 0.158 (0.143–0.173) |
Refitting | 0.88 (0.85–0.90) | 0.99 (0.99–>0.99) | 0.93 (0.91–0.95) | 0.91 (0.88–0.94) | 0.157 (0.141–0.172) |
Penalized refitting | 0.88 (0.85–0.90) | 0.99 (0.98–>0.99) | 0.93 (0.91–0.95) | 0.91 (0.88–0.94) | 0.158 (0.142–0.172) |
Refitting + rcs | 0.88 (0.86–0.91) | 0.99 (0.98–>0.99) | 0.93 (0.91–0.95) | 0.92 (0.89–0.95) | 0.153 (0.137–0.168) |
Extension | 0.89 (0.87–0.92) | 0.99 (0.99–>0.99) | 0.93 (0.92–0.95) | 0.93 (0.90–0.95) | 0.150 (0.135–0.165) |
Penalized extension | 0.89 (0.87–0.92) | 0.99 (0.99–>0.99) | 0.93 (0.92–0.95) | 0.93 (0.90–0.95) | 0.150 (0.135–0.165) |
Geographical updating (QCCH) | |||||
No updating | 0.80 (0.77–0.83) | 0.95 (0.93–0.97) | 0.91 (0.88–0.94) | 0.84 (0.79–0.88) | 0.286 (0.258–0.314) |
Intercept recalibration | 0.80 (0.77–0.83) | 0.95 (0.93–0.97) | 0.91 (0.88–0.94) | 0.84 (0.79–0.88) | 0.278 (0.247–0.310) |
Logistic recalibration | 0.80 (0.77–0.83) | 0.96 (0.94–0.97) | 0.93 (0.90–0.95) | 0.84 (0.79–0.88) | 0.267 (0.243–0.291) |
Refitting | 0.80 (0.77–0.83) | 0.96 (0.94–0.97) | 0.94 (0.92–0.96) | 0.84 (0.80–0.88) | 0.266 (0.243–0.291) |
Penalized refitting | 0.80 (0.77–0.83) | 0.96 (0.94–0.97) | 0.94 (0.91–0.95) | 0.84 (0.79–0.88) | 0.265 (0.242–0.289) |
Refitting + rcs | 0.82 (0.79–0.85) | 0.96 (0.94–0.97) | 0.94 (0.92–0.96) | 0.85 (0.81–0.89) | 0.261 (0.237–0.284) |
Extension | 0.81 (0.78–0.84) | 0.96 (0.94–0.97) | 0.94 (0.92–0.96) | 0.84 (0.80–0.88) | 0.262 (0.238–0.287) |
Penalized extension | 0.81 (0.78–0.84) | 0.96 (0.94–0.97) | 0.94 (0.92–0.96) | 0.84 (0.80–0.88) | 0.263 (0.239–0.287) |
Theoretical results
Due to the non-corresponding linear predictors, the number of coefficients to be estimated in logistic recalibration increases quickly with the number of outcome categories. Logistic recalibration requires the estimation of k − 1 intercepts and of (k − 1)^{2} coefficients, hence k × (k − 1) parameters in total. Straightforward refitting of the model requires k − 1 intercepts and of q × (k − 1) coefficients, hence (q + 1) × (k − 1) parameters in total. This implies that logistic recalibration is obsolete if q < k because then refitting does not require more parameters.
Discrimination
Discrimination improved only slightly with more elaborate updating. An interesting finding is that recalibration can affect discrimination, which is not possible for dichotomous risk models. For intercept recalibration, this effect was so small that it was not visible when rounding c-statistics to two decimals (Table 4). Logistic recalibration clearly improved discrimination (Table 4). For temporal updating, the PDI increased to 0.88 and the c-statistics to 0.93 for FPUL vs EP and 0.91 for IUP vs EP (Table 4). For geographical updating, the PDI remained at 0.80, but c-statistics increased to 0.96 (FPUL vs IUP) and 0.93 (FPUL vs EP) (Table 4). Refitting, refitting with inclusion of functional form, and extension led to further small improvements in discrimination. Model extension increased the PDI by +0.01 (geographical updating) or +0.02 (temporal updating) and pairwise c-statistics by at most +0.04.
Calibration
Intercept recalibration improved calibration substantially by correcting calibration-in-the-large (Fig. 2). Due to overfitting of M4, logistic recalibration further improved calibration: the calibration slopes improved to 0.98 (FPUL vs IUP) and 0.97 (EP vs IUP) at temporal updating and to 1 and 0.98 at geographical updating (Fig. 2). Calibration remained good for more elaborate updating methods, although refitting and extension led to slightly lower calibration slopes. This was corrected when penalized versions of these methods were used.
Overall performance (Brier) and closed testing procedure
The Brier score of the original model was 0.172 at temporal updating and 0.286 at geographical updating (Table 4). This improved gradually when intercept recalibration (0.165 and 0.278), logistic recalibration (0.158 and 0.267), or refitting (0.157 and 0.266) was used. Revision with inclusion of functional form improved Brier scores to 0.153 and 0.263.
Model extension resulted in Brier scores of 0.150 and 0.262. The likelihood ratio tests for the log of progesterone indicated its predictive value at temporal updating (OR FPUL vs IUP = 0.19 (95% CI, 0.12 to 0.30), OR EP vs IUP = 0.30 (0.19 to 0.46) and geographical updating (OR FPUL vs IUP = 0.60 (0.44 to 0.81), OR EP vs IUP = 0.67 (0.50 to 0.91)).
Results of the closed testing procedure
Step | df | Temporal updating (SGH) | Geographical updating (QCCH) |
---|---|---|---|
1. Original model vs refitting | 8 | Δℓ = 241.6, p < 0.0001 | Δℓ = 212.1, p < 0.0001 |
2. Intercept recalibration vs refitting | 6 | Δℓ = 169.1, p < 0.0001 | Δℓ = 172.7, p < 0.0001 |
3. Logistic recalibration vs refitting | 2 | Δℓ = 20.2, p < 0.0001 | Δℓ = 22.8, p < 0.0001 |
The updated model coefficients for each method and each dataset are provided in the Appendix.
Discussion
In this paper, we propose methods to update risk models based on multinomial logistic regression. As a case study, the M4 model to predict the outcome of pregnancies of unknown location [12] was updated temporally (using more recent data from the same setting) and geographically (using data from a different hospital). Seven updating methods were considered: two recalibration methods (intercept recalibration, logistic recalibration), three revision methods (refitting of individual coefficients, penalized refitting of individual coefficients, and refitting with reassessment of functional form of the most important predictor), and two extension methods (straightforward and penalized extension). A closed testing procedure was introduced to select between no updating, intercept recalibration, logistic recalibration, and refitting.
Conclusions for the case study on the M4 model
The original M4 model was poorly calibrated in both updating settings, but discrimination was very good. Steady but mild improvements in discrimination were observed when increasingly elaborate methods were used. The closed testing procedure suggested refitting in both updating settings. This was likely due to (1) slightly better discrimination, (2) the fact that revision methods should improve the average accuracy of risk predictions per individual, and (3) large validation sample size. Reassessment of functional form appeared to further improve discrimination, whereas penalized refitting had beneficial impact on calibration. Extending the model with progesterone further improved model discrimination.
Differences between updating of dichotomous vs multinomial risk models
Some dissimilarities can be seen between dichotomous and nominal updating methods. For dichotomous outcomes, recalibration does not change the c-statistic, since this is a rank order statistic not affected by linear transformation. A multinomial logistic model contains multiple linear predictors, one for each outcome category vs the reference category. Because of different adaptations to the linear predictors, recalibration methods can change the PDI (nominal c-statistic) and the pairwise c-statistics.
For multinomial risk models with k outcome categories, we have k − 1 calibration intercepts and (k − 1)^{2} calibration slopes. Hence, logistic recalibration is more complicated for multinomial logistic risk models and in fact even becomes obsolete when k is higher than the number of variables q in the original risk model (excluding intercepts, but including nonlinear and interaction terms). In these situations, logistic recalibration requires at least as many parameters as straightforward refitting.
In addition, a multinomial calibration plot is more complex than a dichotomous one [13]. For the former, we have one curve for each outcome category while for the latter, a single curve is sufficient. The same predicted risk for one category can be associated with different observed proportions depending on the predicted risks for the other categories [13]. Therefore, irrespective of whether a logistic or flexible calibration analysis is used, smoothing is needed in the calibration plots to visualize the overall relationship between predicted risks and observed proportions.
Finally, the number of parameters to be estimated increases with the number of outcome categories. For example, if the original model has q variables, straightforward refitting requires (q + 1) × (k − 1) parameters. Hence, the more categories, the more cumbersome model revision becomes.
Choice of updating method
Calibration can often be strongly improved with simple intercept and/or slope adjustments [1, 7, 8]. When updating a prediction model that was originally based on a small sample, as in our case study, intercept recalibration will typically be insufficient as it is likely that the original model is overfitted. Recalibration corrects problems with the calibration intercepts and slopes. This was recently described as “weak” calibration [26]. In contrast, “strong” calibration is defined as the correspondence between predicted probabilities and observed proportions per covariate pattern. This is a utopic goal in empirical studies [26], but if we would like to approach strong calibration, revision methods should be preferred. These methods aim to correct bias in individual model coefficients and hence should on average lead to more accurate predictions per covariate pattern. In our case study, the closed testing procedure indicated that refitting was required although there were minor differences in discrimination and calibration performance measures.
In research on updating methods for risk models, the functional form or optimal transformation of continuous predictors has thus far received limited attention. However, transformations used in the original model may not hold for every setting in which the model may be used. Settings will for example vary with respect to the homogeneity of the patient population, or the transformation used in the original model may be the result of overfitting. Our case study also showed that the functional form of the effect of hCG ratio could be improved from the original model.
In theory, one would always prefer revision methods in order to make optimal adjustments to the model. Computing power will usually not be an issue, but rather the available sample size is a key determinant for the choice of updating method. If sample size is limited, recalibration may already give very good value for money whereas revision may require too much from the available data. However, when sample size is large, model revision will help to further improve discrimination and/or accuracy of predicted risks [26, 29]. Reliably reassessing functional form may require even more data (e.g., updating a linearly modeled covariate with restricted cubic splines). For multinomial models, the number of outcome categories k is important as well. If k is larger than the number of model variables q in the original model, logistic recalibration is obsolete.
Existing evidence for dichotomous risk models recommends at least 100–200 cases in the smallest outcome category for reliable model validation [26, 30, 31]. Further, based on common guidelines for developing dichotomous models, at least 10 cases per coefficient in the smallest outcome category would be recommended to use model revision [1, 32]. The total number of coefficients equals q × (k − 1) for multinomial updating. For straightforward refitting, 20 cases per coefficient is preferable; otherwise, penalized refitting can be recommended [1]. If sample size is smaller, logistic recalibration is a defendable alternative. However, such guidelines for multinomial risk models require additional research. The closed testing procedure indirectly takes sample size in account by the fact that larger samples yield higher statistical power: for larger samples, the procedure will more easily suggest revision.
Further research
The influence of sample size (e.g., events per variable (EPV)) on the development, validation, and updating of multinomial logistic models for risk prediction as well as its influence on calibration slopes should be investigated. Second, different penalization techniques for multinomial risk prediction models can be considered, including variants of the Lasso [1, 19, 33–36]. Third, research concerning altering functional form of one or more predictors in case of updating might be conducted. Fourth, it might be of interest to use and consider benchmark values to distinguish between case-mix effects and wrong coefficients when explaining poor validation results of multinomial prediction models [37]. Fifth, updating methods should be evaluated within the context of dynamic/continuous updating, a topic that becomes increasingly relevant [38]. Finally, updating techniques are needed for prediction models for ordinal outcomes.
Conclusions
Updating methods for dichotomous risk models were successfully adapted to multinomial risk models. Simple recalibration methods may work well even if the original prediction model was based on a relatively small sample. Since the number of parameters to be estimated increases with the number of outcome categories, we recommend full model revision only when the sample size is large. To decide on the appropriate updating complexity, the closed testing procedure is helpful because it will tend to favor recalibration in smaller samples and refitting in larger samples. If the available sample size is large, revision including reassessment of functional form may be considered to better tailor predictions to individual patients or covariate patterns.
Declarations
Acknowledgements
Not applicable.
Funding
The research is supported by KU Leuven Research Council (grant C24/15/037) and Research Foundation—Flanders (grant G049312N). Kirsten Van Hoorde was supported by a PhD grant of the Flanders’ Agency for Innovation by Science and Technology (IWT Vlaanderen). Yvonne Vergouwe was funded by the Netherlands Organization for Scientific Research (Grant 917.11.383). Tom Bourne is supported by the National Institute for Health Research (NIHR) Biomedical Research Centre based at Imperial College Healthcare NHS Trust and Imperial College London.
Availability of data and materials
Datasets will not be shared because we lack informed consent for the publication of patient data and have not sought approval from the local ethics committees.
Authors’ contributions
BVC and KVH conceived the study. BVC, KVH, YV, and EWS designed the study. BVC and KVH performed the statistical analysis. BVC, KVH, YV, and EWS interpreted the results. SB, GC, EK, and TB acquired the patient data. BVC and KVH drafted the manuscript. YV, SB, GC, EK, TB, and EWS revised the manuscript. All authors approved the final version for publication.
Competing interests
The authors declare that they have no competing interests.
Ethics approval and consent to participate
This study involves secondary analysis of data from studies that were registered as audits at St. George’s Hospital and Queen Charlotte’s and Chelsea Hospital (Imperial College London) and hence did not require official ethics approval from the requisite Ethics Committees.
Disclaimer
The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR, or the Department of Health.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Steyerberg EW. Clinical prediction models. A practical approach to development, validation, and updating. New York: Springer; 2009.Google Scholar
- Harrell Jr FE. Regression modeling strategies. With applications to linear models, logistic regression, and survival analysis. New York: Springer; 2001.Google Scholar
- Altman DG, Royston P. What do we mean by validating a prognostic model? Stat Med. 2000;19(4):453–73.View ArticlePubMedGoogle Scholar
- Siontis GC, Tzoulaki I, Castaldi PJ, Ioannidis JP. External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination. J Clin Epidemiol. 2015;68(1):25–34.View ArticlePubMedGoogle Scholar
- Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ, Habbema JD. Validation and updating of predictive logistic regression models: a study on sample size and shrinkage. Stat Med. 2004;23(16):2567–86.View ArticlePubMedGoogle Scholar
- Toll DB, Janssen KJ, Vergouwe Y, Moons KG. Validation, updating and impact of clinical prediction rules: a review. J Clin Epidemiol. 2008;61(11):1085–94.View ArticlePubMedGoogle Scholar
- Janssen KJ, Moons KG, Kalkman CJ, Grobbee DE, Vergouwe Y. Updating methods improved the performance of a clinical prediction model in new patients. J Clin Epidemiol. 2008;61(1):76–86.View ArticlePubMedGoogle Scholar
- Van Hoorde K, Vergouwe Y, Timmerman D, Van Huffel S, Steyerberg EW, Van Calster B. Simple dichotomous updating methods improved the validity of polytomous prediction models. J Clin Epidemiol. 2013;66(10):1158–65.View ArticlePubMedGoogle Scholar
- Barnhart K, van Mello NM, Bourne T, Kirk E, Van Calster B, Bottomley C, Chung K, Condous G, Goldstein S, Hajenius PJ, et al. Pregnancy of unknown location: a consensus statement of nomenclature, definitions, and outcome. Fertil Steril. 2011;95(3):857–66.View ArticlePubMedGoogle Scholar
- Kirk E, Bottomley C, Bourne T. Diagnosing ectopic pregnancy and current concepts in the management of pregnancy of unknown location. Hum Reprod Update. 2014;20(2):250–61.View ArticlePubMedGoogle Scholar
- Van Calster B, Abdallah Y, Guha S, Kirk E, Van Hoorde K, Condous G, Preisler J, Hoo W, Stalder C, Bottomley C, et al. Rationalizing the management of pregnancies of unknown location: temporal and external validation of a risk prediction model on 1962 pregnancies. Hum Reprod. 2013;28(3):609–16.View ArticlePubMedGoogle Scholar
- Condous G, Van Calster B, Kirk E, Haider Z, Timmerman D, Van Huffel S, Bourne T. Prediction of ectopic pregnancy in women with a pregnancy of unknown location. Ultrasound Obstet Gynecol. 2007;29(6):680–7.View ArticlePubMedGoogle Scholar
- Van Hoorde K, Vergouwe Y, Timmerman D, Van Huffel S, Steyerberg EW, Van Calster B. Assessing calibration of multinomial risk prediction models. Stat Med. 2014;33(15):2585–96.View ArticlePubMedGoogle Scholar
- Le Cessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J R Stat Soc: Ser C: Appl Stat. 1992;41(1):191–201.Google Scholar
- Vergouwe Y, Nieboer D, Oostenbrink R, Debray TP, Murray G, Kattan MW, Koffijberg H, Moons KG, Steyerberg E: A closed testing procedure to select an appropriate method to update prediction models. Stat Med. In press.Google Scholar
- Chang JY, Ahn H, Chen JJ. On sequential closed testing dose groups with a control. Commun Stat. 2000;29(5–6):941–56.View ArticleGoogle Scholar
- Marcus R, Peritz E, Gabriel KR. On closed testing procedures with special reference to ordered analysis of variance. Biometrika. 1976;63(3):655–60.View ArticleGoogle Scholar
- Ambler G, Royston P. Fractional polynomial model selection procedures: investigation of type i error rate. J Stat Comput Simul. 2001;69(1):89–108.View ArticleGoogle Scholar
- Friedman JH, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):22.Google Scholar
- Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:b2393.View ArticlePubMedPubMed CentralGoogle Scholar
- van Buuren S, Groothuis-Oudshoorn K: mice: multivariate imputation by chained equations in R. J Stat Softw. 2011;45(3):67.Google Scholar
- Steyerberg EW, Harrell Jr FE, Borsboom GJ, Eijkemans MJ, Vergouwe Y, Habbema JD. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol. 2001;54(8):774–81.View ArticlePubMedGoogle Scholar
- Van Calster B, Van Belle V, Vergouwe Y, Timmerman D, Van Huffel S, Steyerberg EW. Extending the c-statistic to nominal polytomous outcomes: the Polytomous Discrimination Index. Stat Med. 2012;31(23):2610–26.View ArticlePubMedGoogle Scholar
- Van Calster B, Vergouwe Y, Looman CW, Van Belle V, Timmerman D, Steyerberg EW. Assessing the discriminative ability of risk models for more than two outcome categories. Eur J Epidemiol. 2012;27(10):761–70.View ArticlePubMedGoogle Scholar
- Yee TW: The VGAM package for categorical data analysis. 2010, 32(10):34.Google Scholar
- Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol. 2016;74:167–76.View ArticlePubMedGoogle Scholar
- Austin PC, Steyerberg EW. Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers. Stat Med. 2014;33(3):517–35.View ArticlePubMedGoogle Scholar
- Steyerberg EW, Vedder MM, Leening MJ, Postmus D, D'Agostino Sr RB, Van Calster B, Pencina MJ. Graphical assessment of incremental value of novel markers in prediction models: from statistical to decision analytical perspectives. Biom J. 2015;57(4):556–70.View ArticlePubMedGoogle Scholar
- Vach W. Calibration of clinical prediction rules does not just assess bias. J Clin Epidemiol. 2013;66(11):1296–301.View ArticlePubMedGoogle Scholar
- Vergouwe Y, Steyerberg EW, Eijkemans MJ, Habbema JD. Substantial effective sample sizes were required for external validation studies of predictive logistic regression models. J Clin Epidemiol. 2005;58(5):475–83.View ArticlePubMedGoogle Scholar
- Collins GS, Ogundimu EO, Altman DG. Sample size considerations for the external validation of a multivariable prognostic model: a resampling study. Stat Med. 2016;35(2):214–26.View ArticlePubMedGoogle Scholar
- Wynants L, Bouwmeester W, Moons KG, Moerbeek M, Timmerman D, Van Huffel S, Van Calster B, Vergouwe Y. A simulation study of sample size demonstrated the importance of the number of events per variable to develop prediction models in clustered data. J Clin Epidemiol. 2015;68(12):1406–14.View ArticlePubMedGoogle Scholar
- Copas JB. Regression, prediction and shrinkage. J R Stat Soc Ser B Methodol. 1983;45(3):311–54.Google Scholar
- Steyerberg EW, Eijkemans MJC, Habbema JDF. Application of shrinkage techniques in logistic regression analysis: a case study. Statistica Neerlandica. 2001;55(1):76–88.View ArticleGoogle Scholar
- Van Houwelingen JC. Shrinkage and penalized likelihood as methods to improve predictive accuracy. Statistica Neerlandica. 2001;55(1):17–34.View ArticleGoogle Scholar
- Van Houwelingen JC, Le Cessie S. Predictive value of statistical models. Stat Med. 1990;9(11):1303–25.View ArticlePubMedGoogle Scholar
- Vergouwe Y, Moons KG, Steyerberg EW. External validity of risk models: use of benchmark values to disentangle a case-mix effect from incorrect coefficients. Am J Epidemiol. 2010;172(8):971–80.View ArticlePubMedPubMed CentralGoogle Scholar
- Strobl AN, Vickers AJ, Van Calster B, Steyerberg E, Leach RJ, Thompson IM, Ankerst DP. Improving patient prostate cancer risk assessment: moving from static, globally-applied to dynamic, practice-specific risk calculators. J Biomed Inform. 2015;56:87–93.View ArticlePubMedPubMed CentralGoogle Scholar