Validation and updating of risk models based on multinomial logistic regression

Background Risk models often perform poorly at external validation in terms of discrimination or calibration. Updating methods are needed to improve performance of multinomial logistic regression models for risk prediction. Methods We consider simple and more refined updating approaches to extend previously proposed methods for dichotomous outcomes. These include model recalibration (adjustment of intercept and/or slope), revision (re-estimation of individual model coefficients), and extension (revision with additional markers). We suggest a closed testing procedure to assist in deciding on the updating complexity. These methods are demonstrated on a case study of women with pregnancies of unknown location (PUL). A previously developed risk model predicts the probability that a PUL is a failed, intra-uterine, or ectopic pregnancy. We validated and updated this model on more recent patients from the development setting (temporal updating; n = 1422) and on patients from a different hospital (geographical updating; n = 873). Internal validation of updated models was performed through bootstrap resampling. Results Contrary to dichotomous models, we noted that recalibration can also affect discrimination for multinomial risk models. If the number of outcome categories is higher than the number of variables, logistic recalibration is obsolete because straightforward model refitting does not require the estimation of more parameters. Although recalibration strongly improved performance in the case study, the closed testing procedure selected model revision. Further, revision of functional form of continuous predictors had a positive effect on discrimination, whereas penalized estimation of changes in model coefficients was beneficial for calibration. Conclusions Methods for updating of multinomial risk models are now available to improve predictions in new settings. A closed testing procedure is helpful to decide whether revision is preferred over recalibration. Because multicategory outcomes increase the number of parameters to be estimated, we recommend full model revision only when the sample size for each outcome category is large.


Background
Prior to implementing risk prediction models in clinical practice to assist in patient management, their performance needs to be rigorously validated. Core elements of performance include discrimination (i.e., how well the model discriminates between the different categories) and calibration (i.e., the reliability of the predicted risks) [1,2]. It is of particular importance to externally validate the model using data collected later in time (temporal validation) and/or in different locations or hospitals (geographical validation) [3,4]. Disappointing validation results do not necessarily imply that the previously developed prediction model should be discarded, because the model contains crucial information such as which predictors are considered relevant. An attractive alternative is to perform some form of model updating, where we combine information from the previously developed model with new data [5]. This approach has clear practical relevance, because it is often not realistic to expect that a single model will work in all settings, due to differences in patient management protocols and referral patterns across centers and regions, and improvements in care over time. Updating is particularly useful for model validation in settings with different patient populations (e.g., primary vs secondary care), sometimes labeled "domain validation," because of likely differences in case-mix, event rates, predictor definitions, and measurement methods [6].
Methods to update risk models for dichotomous outcomes focus on recalibration, revision, and extension [1,5,7]. Recalibration merely adjusts the model intercept and/or overall slope, where an overall slope adjustment implies a fixed proportional adjustment of all predictor coefficients. Model revision adjusts the individual model coefficients, and model extension refits the model while including additional markers.
Van Hoorde and colleagues assessed how dichotomous recalibration and revision techniques could be extended to multicategory outcomes for which risk estimation was based on a sequence of dichotomous logistic regression models (sequential dichotomous modeling) [8]. The aim of the current paper is to introduce methods to directly update risk models for multicategory outcomes based on multinomial logistic regression, which is the most commonly used method for nominal outcomes. We present recalibration, revision, and extension methods and a statistical test to direct the preferred strategy. We illustrate these methods on a case study.

Case study
This case study aims to guide further follow-up for women with a pregnancy of unknown location (PUL). A PUL involves a pregnant woman whose pregnancy cannot be visualized using transvaginal ultrasound [9,10]. It is important to estimate the likelihood that the PUL is a failed PUL (FPUL), an ectopic pregnancy (EP), or an intra-uterine pregnancy (IUP) [10]. Potential follow-up strategies are to perform (a) a urinary pregnancy test after 2 weeks if an FPUL is predicted, (b) a repeat ultrasound scan after 1 week if an IUP is predicted, and (c) a repeat ultrasound scan and human chorionic gonadotropin (hCG) assessment after 2 days if an EP is predicted [11]. M4 is a multinomial logistic regression model developed for this purpose that is based on the serum hCG levels at presentation (hCG0) and 48 h later (hCG48) [12]. M4 was developed on data from women recruited at St. George's Hospital (SGH) in London between March and November 2002 [12]. Among the 197 patients, 109 (55%) had a FPUL, 76 (39%) an IUP, and 12 (6%) an EP (Table 1). M4 has the following predictors: the logarithm of the average of hCG0 and hCG48 (hCGm) and the ratio of hCG48 and hCG0 (hCGr The predicted risks for each outcome category are obtained as Here, we validate and update the M4 model using more recent data from the same setting (temporal updating) and using data from another center (geographical updating

Updating methods
We implemented seven updating methods ( Table 2): two recalibration, three revision, and two extension methods.
We denote the number of outcome categories with k. In our case study, k = 3. Next, we make the distinction between LP FvsI and LP EvsI on the one hand and LP x,FvsI and LP x,EvsI on the other. LP FvsI and LP EvsI are the linear predictors of the original M4 (Eq. 1), whereas LP x,FvsI and LP x,EvsI are the updated linear predictors for updating method x, where x = 1, …, 7. Finally, q denotes the number of variables in the model (i.e., including nonlinearity and interaction terms).

Reference method
Method 0 applies the original prediction model without any adjustments.

Intercept recalibration
The simplest recalibration method (method 1) updates the intercepts of the two linear predictors LP FvsI and LP EvsI of M4:  3-refitting: re-estimation of individual coefficients (q + 1) × (k − 1) = 8 Revision 4-penalized refitting using recalibrated coefficients from method 2 as offset (k + q + 1) × (k − 1) = 14 5-refitting including functional form: method 3, but hCGr modeled with rcs (q′ + 1) × (k − 1) = 8 Extension 6-extension: similar to method 3 but log(progesterone) added (q + m + 1) × (k − 1) = 10 7-penalized extension: similar to method 5 but log(progesterone) added (k + q + m + 1) × (k − 1) = 16 hCGr human chorionic gonadotropin ratio, rcs restricted cubic spline, k number of outcome categories, q number of variables (including additional nonlinear and interaction terms, but excluding intercepts) in original model, q ′ number of variables when changing functional form of one or more predictors, m number of variables related to added markers Note that multinomial logistic regression models have k − 1 linear predictors, with k being the number of outcome categories. Per equation in Eq. 3, only one linear predictor corresponds to the outcomes that are compared (the "corresponding" linear predictor), with the other linear predictors labeled as "non-corresponding." For intercept recalibration, we assume that the coefficients for the corresponding linear predictors are equal to 1 whereas we assume that the noncorresponding linear predictors have a coefficient of 0. This update aims to improve calibration-in-thelarge by aligning observed event rates and mean predicted risks [13].

Logistic recalibration
For method 2, a multinomial logistic recalibration framework for LP FvsI and LP EvsI is applied [13]: Intercepts (α), coefficients for corresponding linear predictors (β), and coefficients for non-corresponding linear predictors (γ) are estimated in order to update the regression coefficients [13]. This method corrects miscalibration of the predicted probabilities from M4, such that there is no general over-or underestimation of risks and such that predicted risks are on average not overly extreme or overly modest. It may be surprising that coefficients for non-corresponding linear predictors are not fixed at zero. Only if the original model is correct for the updating population, all βs are 1 and all γs are 0 in Eq. 4. When performing logistic recalibration by setting all γs to 0, there is no unique result: the updated model will be different depending on the choice of reference category in the logistic recalibration model [13]. In the Appendix, we work out the logistic recalibration formula for the case study.

Model refitting by re-estimating individual coefficients
Method 3 re-estimates the intercepts and the coefficients of each predictor using the updating data. A straightforward refit using multinomial logistic regression is used: Model refitting by penalized estimation of differences with recalibrated coefficients Method 4 uses the recalibrated linear predictors from method 2 (i.e., LP 2,FvsI and LP 2,EvsI from Eq. 4) as an offset: Adding the linear predictors as offset implies that the changes in the intercepts and predictor coefficients with respect to method 2 are modeled. We used ridge penalization on these changes to shrink coefficients to their recalibrated values in order to prevent an overly complex model leading to too extreme risk predictions [14]. Without such penalization, methods 3 and 4 would be identical.

Model refitting including reassessment of functional form
Method 5 re-estimates model coefficients as in method 3, but now the functional form of hCG ratio is updated as well. This is done with a restricted cubic spline (rcs) fit with three knots for the log of hCG ratio [2]: Because M4 was originally developed on a small dataset, the quadratic effect for hCG ratio may be inadequate or the result of overfitting. Given that hCG ratio is the most important predictor, it is worthwhile to re-assess its functional form. When using rcs with three knots, the number of parameters used to model hCG ratio remains at two. We decide to keep the log-transformation for the average hCG level to limit the overall complexity of the model.

Model extension by refitting and adding a novel marker
Method 6 is similar to method 3 but adds the log of the progesterone level at presentation as a novel marker:

Model extension using penalization
Method 7 is similar to method 4 because the linear predictors from method 2 are used as offset and ridge penalization is used. Penalization then affects (1) the differences in the intercepts and coefficients of log(hCGm), hCGrc, and hCGrc 2 relative to logistic recalibration and (2) the coefficients for the new predictor: Closed testing procedure We extend a recently suggested closed testing procedure for updating dichotomous logistic models [15]. A closed testing procedure involves a hierarchical correction for multiple testing to control the overall type I error at the desired alpha level [16,17]. The procedure for updating dichotomous prediction models is based on the closed testing procedure that was introduced for variable selection with multivariable fractional polynomials [18]. The procedure for multinomial updating compares the original model (method 0) with intercept recalibration (method 1), logistic recalibration (method 2), and straightforward refitting (method 3) (Table 3). To compare four increasingly complex and nested updating methods, the proposed closed testing procedure sequentially assesses the required scope of model updating: from no updating to refitting. The overall null hypothesis to be tested at a predetermined alpha level is that the original model has the same fit as an updated alternative or else that log-likelihood(method 0) = log-likelihood(method 1) = loglikelihood(method 2) = log-likelihood(method 3). The procedure consists of the following steps: A. Predetermine the alpha level α. B. Use a likelihood ratio test at α to compare the refitted model with the original model. The degrees of freedom is (q + 1) × (k − 1). If this test is not rejected (p value >α), there is no statistically significant improvement in fit of the refitted model vs the original model. The procedure stops and the original model is selected. If the test is significant, proceed to the next step. C. Use a likelihood ratio test at α to compare refitting with intercept recalibration. The degrees of freedom is q × (k − 1). If the test is not rejected, the procedure stops and intercept recalibration is selected. If the test is significant, proceed to the next step. D. If q < k (fewer variables than outcome categories), the procedure stops. Else, use a likelihood ratio test at α to compare refitting with logistic recalibration. The degrees of freedom is (q − k + 1) × (k − 1). If the test is not rejected, logistic recalibration is selected. If the test is significant, refitting is selected.

Ridge penalization
Methods 4 and 7 use ridge penalization which fits models using penalized maximum likelihood in order to obtain more stable and shrunken coefficients [14]. In methods 4 and 7, the coefficients of the predictors are shrunken towards the coefficients following logistic recalibration (method 2). Ridge penalization is implemented with the glmnet package in R [19]. The regularization parameter λ of the ridge penalty was estimated using 10-fold cross-validation with the deviance as performance criterion.

Missingness
Extension methods (methods 6 and 7) add the progesterone level at presentation to the model. At SGH 47 (3.3%) women and at QCCH 109 (12.5%) women had a missing value for progesterone. We used single imputation to deal with the missing values for the current illustrative study, although multiple imputation might be preferred to fully account for uncertainty in the imputation process [20]. The log-transformed progesterone level was imputed via fully conditional specification that included age, the logarithm of hCG0, the logarithm of hCG48, and outcome [21].

Performance evaluation: discrimination and calibration
Performance was evaluated using measures for discrimination and calibration. Optimism-corrected performance Each test is performed at the prespecified overall alpha level H0 null hypothesis, L likelihood, df degrees of freedom was based on bootstrap internal validation (500 bootstrap resamples), as recommended in [22]. The overall discrimination is evaluated using the polytomous discrimination index (PDI), a nominal version of the c-statistic [23]. In a set of patients, one from each outcome category (i.e., a set of size k), PDI estimates the probability that a patient from a random outcome category is correctly identified by the model. The patient from outcome category i is correctly identified in a set if this patient has the highest predicted risk of category i. A PDI of 0.7 means that it is estimated that on average 70% of patients from a set is correctly identified. Random performance corresponds to a PDI of 1/k [23]. In addition, c-statistics for all pairs of categories are calculated using the conditional risk method [24].
The common definition of calibration is that predicted risks should correspond to observed proportions per level of predicted risk: for patients with an estimated risk of event of 0.3, we expect 30% to have/develop this event. To assess calibration, we calculated calibration intercepts and calibration slopes (with 95% CI) [13]. Ideally, we expect calibration intercepts of 0 and a calibration slope of 1. The calibration intercepts indicate whether the risks are systematically overestimated (if <0) or underestimated (if >0). The calibration slopes indicate the presence of too extreme (if <1) or too modest (if >1) risk predictions. For the original model, we derive flexible calibration curves based on vector splines using the VGAM package in R [25]. This is similar to dichotomous calibration plots where observed proportions are based on loess or spline-based analyses [26,27].
As an overall measure of performance that combines discrimination and calibration, the Brier score was computed. Brier scores were also optimismcorrected.

Validation of the original M4 model
The original model had very good discrimination. The PDI was 0.87 at temporal updating and 0.80 at geographical updating (Table 4). c-statistics were 0.99 and 0.95 for FPUL vs IUP, 0.92 and 0.91 for FPUL vs EP, and 0.89 and 0.84 for IUP vs EP (Table 4). Calibration of predicted risks was poor in both updating settings, as demonstrated by the calibration curves (Fig. 1). The calibration intercepts were 1.19 (FPUL vs IUP) and 0.60 (EP vs IUP) at temporal updating and 0.27 and 0.86 at geographical updating (Fig. 2). The calibration slopes were 0.82 (FPUL vs IUP) and  (Fig. 2).

Theoretical results
Due to the non-corresponding linear predictors, the number of coefficients to be estimated in logistic recalibration increases quickly with the number of outcome categories. Logistic recalibration requires the estimation of k − 1 intercepts and of (k − 1) 2 coefficients, hence k × (k − 1) parameters in total. Straightforward refitting of the model requires k − 1 intercepts and of q × (k − 1) coefficients, hence (q + 1) × (k − 1) parameters in total. This implies that logistic recalibration is obsolete if q < k because then refitting does not require more parameters.

Discrimination
Discrimination improved only slightly with more elaborate updating. An interesting finding is that recalibration can affect discrimination, which is not possible for dichotomous risk models. For intercept recalibration, this effect was so small that it was not visible when rounding c-statistics to two decimals (Table 4). Logistic recalibration clearly improved discrimination (Table 4). For temporal updating, the PDI increased to 0.88 and the c-statistics to 0.93 for FPUL vs EP and 0.91 for IUP vs EP (Table 4). For geographical updating, the PDI remained at 0.80, but c-statistics increased to 0.96 (FPUL vs IUP) and 0.93 (FPUL vs EP) ( Table 4). Refitting, refitting with inclusion of functional form, and extension led to further small improvements in discrimination. Model extension increased the PDI by +0.01 (geographical updating) or +0.02 (temporal updating) and pairwise c-statistics by at most +0.04.

Calibration
Intercept recalibration improved calibration substantially by correcting calibration-in-the-large (Fig. 2). Due to overfitting of M4, logistic recalibration further improved calibration: the calibration slopes improved to 0.98 (FPUL vs IUP) and 0.97 (EP vs IUP) at temporal updating and to 1 and 0.98 at geographical updating (Fig. 2). Calibration remained good for more elaborate updating methods, although refitting and extension led to slightly lower calibration slopes. This was corrected when penalized versions of these methods were used.

Overall performance (Brier) and closed testing procedure
The Brier score of the original model was 0.172 at temporal updating and 0.286 at geographical updating (Table 4) The closed testing procedure favored refitting over the original model or recalibration methods in both datasets (Table 5), despite minor performance improvements of refitting over logistic recalibration (Table 4, Fig. 2). This suggests that refitting resulted in a better model log-likelihood compared to logistic  recalibration and hence that refitting would result in improved predicted risks per patient. This is supported by reclassification graphs [28] shown in Fig. 3: predicted probabilities are on average higher for the true outcome after refitting than after logistic recalibration. For example, in panel C, the predicted risk of IUP for true IUP cases from SGH is often higher after refitting (y-axis) than after logistic recalibration (x-axis). The updated model coefficients for each method and each dataset are provided in the Appendix.

Discussion
In this paper, we propose methods to update risk models based on multinomial logistic regression. As a case study, the M4 model to predict the outcome of pregnancies of unknown location [12] was updated temporally (using more recent data from the same setting) and geographically (using data from a different hospital). Seven updating methods were considered: two recalibration methods (intercept recalibration, logistic recalibration), three revision methods (refitting of individual coefficients, penalized refitting of individual coefficients, and refitting with reassessment of functional form of the most important predictor), and two extension methods (straightforward and penalized extension). A closed testing procedure was introduced to select between no updating, intercept recalibration, logistic recalibration, and refitting.

Conclusions for the case study on the M4 model
The original M4 model was poorly calibrated in both updating settings, but discrimination was very good. Steady but mild improvements in discrimination were observed when increasingly elaborate methods were used. The closed testing procedure suggested refitting in both updating settings. This was likely due to (1) slightly better discrimination, (2) the fact that revision  FPUL (a, b), the predicted probability of IUP when the reference standard is IUP (c, d), the predicted probability of EP when the reference standard is EP (e, f) methods should improve the average accuracy of risk predictions per individual, and (3) large validation sample size. Reassessment of functional form appeared to further improve discrimination, whereas penalized refitting had beneficial impact on calibration.
Extending the model with progesterone further improved model discrimination.
Differences between updating of dichotomous vs multinomial risk models Some dissimilarities can be seen between dichotomous and nominal updating methods. For dichotomous outcomes, recalibration does not change the c-statistic, since this is a rank order statistic not affected by linear transformation. A multinomial logistic model contains multiple linear predictors, one for each outcome category vs the reference category. Because of different adaptations to the linear predictors, recalibration methods can change the PDI (nominal c-statistic) and the pairwise c-statistics.
For multinomial risk models with k outcome categories, we have k − 1 calibration intercepts and (k − 1) 2 calibration slopes. Hence, logistic recalibration is more complicated for multinomial logistic risk models and in fact even becomes obsolete when k is higher than the number of variables q in the original risk model (excluding intercepts, but including nonlinear and interaction terms). In these situations, logistic recalibration requires at least as many parameters as straightforward refitting.
In addition, a multinomial calibration plot is more complex than a dichotomous one [13]. For the former, we have one curve for each outcome category while for the latter, a single curve is sufficient. The same predicted risk for one category can be associated with different observed proportions depending on the predicted risks for the other categories [13]. Therefore, irrespective of whether a logistic or flexible calibration analysis is used, smoothing is needed in the calibration plots to visualize the overall relationship between predicted risks and observed proportions.
Finally, the number of parameters to be estimated increases with the number of outcome categories. For example, if the original model has q variables, straightforward refitting requires (q + 1) × (k − 1) parameters. Hence, the more categories, the more cumbersome model revision becomes.

Choice of updating method
Calibration can often be strongly improved with simple intercept and/or slope adjustments [1,7,8]. When updating a prediction model that was originally based on a small sample, as in our case study, intercept recalibration will typically be insufficient as it is likely that the original model is overfitted. Recalibration corrects problems with the calibration intercepts and slopes. This was recently described as "weak" calibration [26]. In contrast, "strong" calibration is defined as the correspondence between predicted probabilities and observed proportions per covariate pattern. This is a utopic goal in empirical studies [26], but if we would like to approach strong calibration, revision methods should be preferred. These methods aim to correct bias in individual model coefficients and hence should on average lead to more accurate predictions per covariate pattern. In our case study, the closed testing procedure indicated that refitting was required although there were minor differences in discrimination and calibration performance measures.
In research on updating methods for risk models, the functional form or optimal transformation of continuous predictors has thus far received limited attention. However, transformations used in the original model may not hold for every setting in which the model may be used. Settings will for example vary with respect to the homogeneity of the patient population, or the transformation used in the original model may be the result of overfitting. Our case study also showed that the functional form of the effect of hCG ratio could be improved from the original model.
In theory, one would always prefer revision methods in order to make optimal adjustments to the model. Computing power will usually not be an issue, but rather the available sample size is a key determinant for the choice of updating method. If sample size is limited, recalibration may already give very good value for money whereas revision may require too much from the available data. However, when sample size is large, model revision will help to further improve discrimination and/or accuracy of predicted risks [26,29]. Reliably reassessing functional form may require even more data (e.g., updating a linearly modeled covariate with restricted cubic splines). For multinomial models, the number of outcome categories k is important as well. If k is larger than the number of model variables q in the original model, logistic recalibration is obsolete.
Existing evidence for dichotomous risk models recommends at least 100-200 cases in the smallest outcome category for reliable model validation [26,30,31]. Further, based on common guidelines for developing dichotomous models, at least 10 cases per coefficient in the smallest outcome category would be recommended to use model revision [1,32]. The total number of coefficients equals q × (k − 1) for multinomial updating. For straightforward refitting, 20 cases per coefficient is preferable; otherwise, penalized refitting can be recommended [1]. If sample size is smaller, logistic recalibration is a defendable alternative. However, such guidelines for multinomial risk models require additional research. The closed testing procedure indirectly takes sample size in account by the fact that larger samples yield higher statistical power: for larger samples, the procedure will more easily suggest revision.

Further research
The influence of sample size (e.g., events per variable (EPV)) on the development, validation, and updating of multinomial logistic models for risk prediction as well as its influence on calibration slopes should be investigated. Second, different penalization techniques for multinomial risk prediction models can be considered, including variants of the Lasso [1,19,[33][34][35][36]. Third, research concerning altering functional form of one or more predictors in case of updating might be conducted. Fourth, it might be of interest to use and consider benchmark values to distinguish between case-mix effects and wrong coefficients when explaining poor validation results of multinomial prediction models [37]. Fifth, updating methods should be evaluated within the context of dynamic/continuous updating, a topic that becomes increasingly relevant [38]. Finally, updating techniques are needed for prediction models for ordinal outcomes.

Conclusions
Updating methods for dichotomous risk models were successfully adapted to multinomial risk models. Simple recalibration methods may work well even if the original prediction model was based on a relatively small sample. Since the number of parameters to be estimated increases with the number of outcome categories, we recommend full model revision only when the sample size is large. To decide on the appropriate updating complexity, the closed testing procedure is helpful because it will tend to favor recalibration in smaller samples and refitting in larger samples. If the available sample size is large, revision including reassessment of functional form may be considered to better tailor predictions to individual patients or covariate patterns.

Appendix 1 Exclusion criteria for the case study
For the study on pregnancies of unknown location, exclusion criteria were being lost to follow-up, a missing hCG level at presentation (hCG0) or a level ≤25 IU/L (which is considered as a negative pregnancy test), an unavailable 48-h hCG (hCG48) level, and an hCG48 level that is not taken within 1 to 3 days after the hCG0 level. Some patients do not have an hCG48 level due to an IUP or EP being detected using ultrasonography prior to this blood sample, a very low hCG0 level, failing to return in time, or being lost to follow-up. Of the 2708 patients, 2295 (85%) are available for analysis after applying the exclusion criteria. At SGH, 93 of 1610 women had an hCG0 level ≤25 IU/L, 12 had no hCG48 level, and 83 were lost to follow-up. At QCCH, 1 of 1098 women had no hCG0 level, 17 had an hCG0 level ≤25 IU/L, 125 had no hCG48 level, 22 had hCG48 measured more than 3 days after hCG0, and 60 were lost to follow-up. Therefore, data from 1422 (88%) patients were available at SGH and data from 873 (80%) women at QCCH. At SGH, there were 50% FPUL, 41% IUP, and 9% EP. The QCCH data contained 58% FPUL, 28% IUP, and 14% EP.