 Methodology
 Open Access
 Published:
Leaveoneout crossvalidation, penalization, and differential bias of some prediction model performance measures—a simulation study
Diagnostic and Prognostic Research volume 7, Article number: 9 (2023)
Abstract
Background
The performance of models for binary outcomes can be described by measures such as the concordance statistic (cstatistic, area under the curve), the discrimination slope, or the Brier score. At internal validation, data resampling techniques, e.g., crossvalidation, are frequently employed to correct for optimism in these model performance criteria. Especially with small samples or rare events, leaveoneout crossvalidation is a popular choice.
Methods
Using simulations and a real data example, we compared the effect of different resampling techniques on the estimation of cstatistics, discrimination slopes, and Brier scores for three estimators of logistic regression models, including the maximum likelihood and two maximum penalized likelihood estimators.
Results
Our simulation study confirms earlier studies reporting that leaveoneout crossvalidated cstatistics can be strongly biased towards zero. In addition, our study reveals that this bias is even more pronounced for model estimators shrinking estimated probabilities towards the observed event fraction, such as ridge regression. Leaveoneout crossvalidation also provided pessimistic estimates of the discrimination slope but nearly unbiased estimates of the Brier score.
Conclusions
We recommend to use leavepairout crossvalidation, fivefold crossvalidation with repetitions, the enhanced or the .632+ bootstrap to estimate cstatistics, and leavepairout or fivefold crossvalidation to estimate discrimination slopes.
Introduction
The concordance statistic (cstatistic) is a widely used measure to quantify the discrimination ability of models for binary outcomes. Calculating the cstatistic for the data on which the model was fitted will usually give too optimistic results for the model performance in subjects outside of the model development set, especially with small samples or rare events. This overoptimism can be corrected by data resampling techniques such as crossvalidation (CV) or the bootstrap. Leaveoneout (LOO) CV has the advantage of being applicable even with small samples where other techniques such as tenfold or fivefold CV might run into problems when, e.g., some of the CV subsets contain only one category of the binary outcome. Whereas LOO CV is known to yield nearly unbiased estimates for performance measures applicable to single observations such as the Brier score [16], it has been shown to induce negative bias into the cstatistic [1, 27]. Nevertheless, LOO CV to “crossvalidate” estimated probabilities and cstatistics is still in widespread use. For example, in the current standard implementation of logistic regression in the SAS/STAT 15.2 and Viya 3.4 software, PROC LOGISTIC, a onestep approximation to LOO CV is the only builtin crossvalidation method for estimated probabilities and also for the cstatistic [23].
The present paper aims to provide a better understanding of resampling techniques, in particular LOO CV, when combined with different estimators commonly used to fit logistic regression models with binary outcomes and when used to evaluate the model performance by means of the cstatistic, the Brier score, and the discrimination slope [30], which is an increasingly popular measure of predictive accuracy in binary outcome models. A similar aim was pursued by Iba et al. [15] but focusing on bootstrap methods only.
Throughout this paper, we focus on the situation of relatively few observations or events, where reliable predictive models are out of scope. Nevertheless, in such situations, it could still be of interest to use a prediction model to capture the data structure in a compact manner [26]. In such a modelbased description of the association of an outcome with several covariates, an unbiased measure of the discrimination ability of that model is still of interest.
The remainder of this paper is organized as follows: we start with an introduction of the measures of model performance, the resampling techniques, and the model estimators of interest. A study on the association between diabetes and the waisthip ratio serves as illustrative example. Subsequently, we provide an intuitive explanation of the problems with LOO CV using a simple toy example and present a comprehensive simulation study. Finally, we discuss the impact of our findings on routine statistical analyses. A preprint of this manuscript can be found on arXiv.org [9].
Methods
Measures of model performance
We denote the two outcome values as “event” and “nonevent” and assume that in logistic regression, the probability of an event is modeled.
The cstatistic is the proportion of pairs among all possible pairs of observations with contrary outcomes in which the estimated event probability is higher in the observation with the event than in the observation with the nonevent. It equals the area under the receiver operating characteristic curve [10].
The discrimination slope is the difference between the mean estimated probability of an event for observations with events and the mean estimated probability for observations with nonevents. Paralleling the construction of the cstatistic, the discrimination slope can be computed as the average pairwise difference in estimated probabilities, thus representing a parametric version of the cstatistic. It was suggested as “highly recommendable R^{2}substitute for logistic regression models” by Tjur [30] and recently revisited by Antolini et al. [2]. For model estimators which give average estimated probabilities equal to the event fraction, the discrimination slope is equal to the measure of explained variation for binary outcomes proposed in [25]; see the Appendix for a proof.
The Brier score is the mean squared difference between the binary outcome and the estimated probabilities [29]. It equals 0 for perfect models. The magnitude of the Brier score has to be interpreted in the context of the event fraction of a data set. For instance, at an event fraction of 0.5, a noninformative model with estimated probabilities equal to the event fraction assumes a Brier score of 0.25, while at an event fraction of 0.25, it is 0.1875. Unlike the cstatistic and the discrimination slope, the Brier score can be defined and computed for single observations.
Techniques to correct for overoptimism
Throughout this paper, we assume that the observations in the model development set are drawn from a large underlying population and that we are interested in the performance of the model applied to this underlying population excluding the observations in the development set. Calculating performance measures on the same data as used for model development will usually result in highly biased (“overoptimistic”) estimates which need to be corrected. Here, we describe some resampling techniques which provide optimismcorrected estimates of the cstatistic. If not mentioned otherwise, methodology straightforwardly generalizes to the discrimination slope and the Brier score. We denote by “apparent” measures that are calculated from the data on which the model was fitted without correction.
With ffold CV, the data are split into f parts or “folds” with approximately equal number of observations. A model is then fitted on the observations from f1 folds. Using this model, estimated probabilities for the observations in the excluded fth fold are calculated. By excluding each fold in turn, one obtains f cstatistics which are then averaged. To decrease variability caused by the random partitioning, the whole procedure is repeated r times and results are averaged. Here, we consider f = 5 and r = 40, i.e., 5fold CV with 40 repetitions. For brevity, we will not always explicitly mention these repetitions, but we always performed repeated 5fold CV.
Setting f to the sample size n (“leaveoneout CV”), the cstatistic cannot be computed using the averaging strategy explained above, as only one observation is excluded at each iteration. Instead, one has to resort to a pooling strategy by fitting n models using each possible subset of n1 observations, each time calculating the predictive probability for the observation excluded from model estimation, and computing a single cstatistic from the pooled n estimated probabilities. Notably, for performance measures applicable to single observations such as the Brier score, LOO CV can also be applied in conjunction with the averaging strategy and then gives the same result as the pooling strategy.
Leavepairout (LPO) CV is an approach that is independent of random sampling but based on cstatistics calculated within folds [1, 27]. With LPO CV, each pair of observations with contrary outcomes is excluded from the data and, in turn, a model is fitted on the remaining n − 2 observations, and estimated probabilities for the two excluded observations are calculated from this model. The LPO crossvalidated cstatistic is the proportion of pairs with concordant estimated probabilities, i.e., where the estimated probability of the observation with the event is higher than that of the observation with the nonevent. LPO CV can imply considerable computational burden: if k is the number of events, (n − k)k models have to be estimated, compared to only n models with LOO CV. For example, with 50 events among 100 observations, 2500 models must be fitted with LPO but only 100 with LOO CV. Whereas LPO CV generalizes straightforwardly to the discrimination slope, it is not clear how it should be adapted for the Brier score: simply averaging the Brier score computed for all leftout pairs will give biased estimates in the case of unbalanced outcomes because of the dependence of the Brier score on the event fraction. One solution would be to adequately weight contributions by events and nonevents. Here, we refrain from applying LPO CV to the Brier score.
In Harrell’s implementation of an enhanced bootstrap to correct bias due to overfitting [12], the bias is explicitly estimated and then subtracted from the apparent cstatistic. Specifically, 200 samples of n observations with replacement are drawn from the original data set. On each of these bootstrap resamples, a model is fitted and used to calculate cstatistics both for the bootstrap resample and the original data. An estimate of “optimism” is obtained by subtracting the average cstatistic in the original data from the average cstatistic in the bootstrap resamples. The enhanced bootstrap cstatistic is then given by the apparent cstatistic minus the estimate of optimism.
The .632+ bootstrap [7] is a weighted average of the apparent cstatistic and the average “outofthebag” cstatistic calculated from bootstrap resamples. The “outofthebag” cstatistic is obtained by fitting the model in a bootstrap resample and applying it to the observations not contained in that bootstrap resample. We give the technical details in the Appendix.
Penalized likelihood estimation methods
We investigated the performance of the resampling techniques in combination with the following estimators of logistic regression:

Maximum likelihood estimation (ML)

Logistic ridge regression (RR) [18]
FL amounts to penalization by the Jeffreys prior and was shown to reduce the bias in coefficient estimates compared to ML. With RR, the log likelihood is penalized by the square of the Euclidean norm of the regression parameters multiplied by a tuning parameter. We chose the tuning parameter by minimizing a penalized version of the Akaike’s information criterion (AIC) given by \(2l\left(\hat{\beta}\right)+2\ d{f}_e\) with \(l\left(\hat{\beta}\right)\) the log likelihood,
the effective degrees of freedom, and \({l}^{\ast}\left(\hat{\beta}\right)\) the penalized log likelihood [33]. This approach of optimizing the tuning parameter is less computerintensive than the optimization of crossvalidated measures and has been reported to yield similar or even superior results [12].
For the implementation of ML and FL, we used the Rpackage logistf with default convergence criteria [13, 22]. For RR, we applied the function lrm in the Rpackage rms, setting the singularity criterion to 10^{−15} [11].
If the data are separated, i.e., if a combination of explanatory variables perfectly predicts the outcome, then ML fails to produce finite regression coefficients, and some estimated probabilities will be exactly 0 or 1 [19]. By contrast, FL gives reasonable results in the case of separation. Under separation, RR will supply finite regression coefficients if the tuning parameter is greater than 0. However, CV or AIC optimization will often set the tuning parameter to 0 in case of separation, and then RR leads to the same problems as ML [19]. See Additional file 1: S1 for how we handled separation, linearly dependent explanatory variables, or binary outcomes restricted to one category occurring in bootstrap resamples or CV subsets.
Motivation
A real data example: association between waisthip ratio and diabetes
As part of a smoking cessation project among AfroAmericans in two rural Virginia counties, a screening examination on coronary heart disease risk factors was performed [34]. For illustration purposes, we focus on the association between the waisthip ratio and the presence of diabetes (defined by glycosylated hemoglobin >7.0), adjusted for gender, in the Virginia county Louisa. Among the 198 study participants, 14.6% (29 persons) were classified as having diabetes. A difference of 0.10 in the waisthip ratio was associated with an adjusted odds ratio, estimated with ML, of 1.9 (95% CI; 1.01, 3.58). On the level of estimated probabilities, this corresponds to probabilities of diabetes of 0.112 and 0.193 for females with a waisthip ratio of 0.8 or 0.9, respectively. In line with findings from previous studies [1, 27], LOO CV resulted in lower cstatistics for ML, FL, and RR than the other resampling techniques; see Fig. 1. While a model with no discriminative ability (a “random guess”) would yield a cstatistic of approximately 0.5, the LOO crossvalidated cstatistic for RR was even only 0.468, while the corresponding cstatistics of ML and FL both were 0.54. This may give the impression that RR supplies a model that performs even worse than a random guess, while ML and FL yield better models. All other resampling techniques gave similar cstatistics across model estimators.
Understanding the bias in LOO crossvalidated cstatistics
Figure 2 explains the bias in LOO crossvalidated cstatistics by illustrating the estimation process on an artificial toy example with 20 observations and 5 events. The crucial observation in Fig. 2 is that the estimated probability for a leftout event (CV iterations 1–5) was on average lower than for a leftout nonevent (CV iterations 6–20). If an event was left out, the data used in the model fitting consisted of only 4 events out of 19 observations (event fraction 0.210), compared to 5 out of 19 (event fraction 0.263) if a nonevent was left out. Hence, the LOO crossvalidated estimated probability tends to be too low for an event and too high for a nonevent. Consequently, LOO crossvalidated cstatistics (and discrimination slopes) based on pooling these crossvalidated estimated probabilities are biased low. Figure 2 also illustrates that the bias in LOO crossvalidated cstatistics is usually more severe for modeling methods yielding shrunken estimated probabilities such as ridge regression. This tendency can lead to undesired results if one optimizes the tuning parameter in RR using LOO crossvalidated cstatistics; see Additional file 1: Figure S1. Whereas for the null scenario, the discrimination ability of RR is independent of the penalization strength, optimization of LOO crossvalidated cstatistics favors models with less regularization.
Simulation study
We follow the ADEMP structured approach in describing the setup of our simulation study [21].
Aim
The aim of the simulation study was to compare the accuracy of the resampling techniques LOO CV, LPO CV, 5fold CV, enhanced bootstrap, and .632+ bootstrap in estimating cstatistics, discrimination slopes, and Brier scores for the model estimators ML, FL, and RR. In particular, the following questions should be answered:

Is the discrimination slope accurately estimated by LOO CV?

Does the performance of resampling methods differ between model estimators?

Which resampling methods estimate the cstatistic, the discrimination slope, and the Brier score most efficiently?
Data generating mechanism
Data generation was motivated by the structure of real data sets, where typically a mix of variables with different distributions is encountered [5]. By sampling from a multivariate normal distribution and applying certain transformations, we generated one binary, one ordinal, and three continuous explanatory variables; see Additional file 1: S2 and Table S1 for details. Binary outcomes y_{i} were drawn from Bernoulli distributions with the event probability following a logistic model. We considered twelve simulation scenarios in a factorial design combining sample size (n ∈ {50,100}), marginal event fraction (E(y) ∈ {0.25, 0.5}), and effect size (strong or weak effects of all explanatory variables, or null scenarios with no effects). More information on the magnitude of the effects is given in Additional file 1: S2. For each scenario, we created 1000 data sets.
Estimands
Our estimands are the cstatistic, the discrimination slope, and the Brier score for the model estimators ML, FL, and RR.
Methods
For each simulated dataset and each model estimator, we assessed the predictive accuracy in terms of cstatistics, discrimination slopes, and Brier scores by the five resampling methods LOO CV, LPO CV, 5fold CV, enhanced bootstrap, and .632+ bootstrap.
Performance measures
We compared the resamplingbased cstatistics, discrimination slopes, and Brier scores with those obtained if the estimated models were validated in the population, in our study approximated by an independent validation data set consisting of 100,000 observations. We described the performance of the resampling techniques in terms of mean and root mean squared difference (RMSD) of the resamplingbased cstatistics, discrimination slopes, and Brier scores to their respective independently validated (IV) counterparts. Finally, we calculated Monte Carlo standard errors for the mean squared difference [21] and the RMSD [17].
Results
First, we describe the distribution of the cstatistic, discrimination slope, and Brier score obtained in the independent validation set, which will serve as gold standard. The mean IV cstatistics ranged between 0.5 and 0.684; see Additional file 1: Table S2. RR achieved the largest mean IV cstatistics in nonnull scenarios, but there was little difference between model estimators.
For the mean IV discrimination slope, the differences between the model estimators showed a range of up to 0.04 units; see Additional file 1: Table S3. In nonnull scenarios, ML achieved the largest median IV discrimination slopes, with values of up to 0.135. RR yielded the smallest median IV discrimination slopes, which were at least 20% smaller than by ML in all scenarios.
The results for the IV Brier score were in contrast to those for the IV discrimination slope: now, ML performed worst in all scenarios, while RR resulted in the smallest mean Brier scores in all but one scenarios; see Additional file 1: Table S4.
In approximating IV cstatistics, LOO CV performed worst both with respect to mean difference (bias) and RMSD; see Fig. 3. The downward bias was most severe for RR and amounted to − 0.274 in the most unfavorable scenario. For this scenario, the magnitude of the bias with ML or FL was only about a quarter of the magnitude of the bias with RR. In all but two scenarios, the enhanced and the .632+ bootstraps yielded the smallest RMSD for RR, ML, and FL. Notably, the RMSD increased with increasing effect size for the .632+ bootstrap, whereas it decreased for all other resampling methods as expected. This behavior can be understood by looking at the definition of the .632+ bootstrap, which ensures that the .632+ cstatistic is always greater than or equal to the minimum of the apparent cstatistic and 0.5, resulting in a rightskewed distribution of the .632+ cstatistic especially for null scenarios. For all model estimators and all resampling techniques, the RMSD decreased with increasing sample size and increasing event fraction. The differences between resampling techniques were less pronounced with stronger effects, larger sample sizes, and balanced event fraction.
LOO CV also performed poorly in approximating the IV discrimination slope, yielding pessimistic estimates with a RMSD at least larger than the one by LPO CV and 5fold CV; see Fig. 4. However, the differences in RMSD across resampling techniques were fairly small. Only for RR the two bootstrap techniques sometimes resulted in discrimination slopes with substantially larger RMSD than LPO CV and 5fold CV. The .632+ bootstrap gave overly optimistic discrimination slopes, with an absolute mean difference to the IV values often larger than the one by LOO CV. On the other hand, in all but three simulation scenarios, the .632+ bootstrap yielded discrimination slopes with smallest median deviations. This discrepancy can be explained by the rightskewness of the distribution of the differences between optimismcorrected and IV discrimination slopes, which was especially pronounced for the .632+ bootstrap. With increasing sample size, the RMSD decreased for all resampling techniques and all model estimators. Again, the differences between resampling techniques were less pronounced with increasing effect size, sample size, and more balanced event fraction.
As described in the “Methods” section, LPO CV does not naturally generalize to the Brier score, so we only considered LOO CV, 5fold CV, enhanced bootstrap, and .632+ bootstrap. In all but three simulation scenarios, LOO CV performed best with respect to the mean difference to the IV Brier score for all model estimators; see Additional file 1: Figure S2. Similarly, as for the discrimination slope, the enhanced bootstrap gave overly pessimistic Brier scores with fairly large RMSD for RR, especially in scenarios with no or small effects. However, differences in RMSD were small between resampling methods, only in scenarios with small or no effects the .632+ bootstrap showed some benefit over the other techniques. With increasing sample size, the RMSD decreased for all resampling techniques and all model estimators.
The percentage of separated data sets was highest (18.2%) for the scenario with a sample size of 50, an event fraction of 0.25, and strong effects; see Additional file 1: Table S5. In this scenario, more than one third of bootstrap resamples were separated.
Discussion
The findings of our simulation study confirm that LOO CV yields pessimistic cstatistics [1, 27], but they also demonstrate that this bias depends on the choice of model estimator. Thus, LOO crossvalidated cstatistics should neither be used to describe the performance of a single model nor to compare the performance of a set of models estimators, e.g., in the optimization of tuning parameters in regularized regression. LPO CV, which was suggested as an alternative to LOO CV [1], indeed performed better both in terms of mean difference and RMSD to the IV cstatistic. However, the enhanced bootstrap and the .632+ bootstrap achieved a smaller RMSD in almost all simulation settings. The .632+ bootstrap is a weighted average of the apparent cstatistic and a certain overly corrected cstatistic which is set to 0.5 if smaller. In this way, it is ensured that the .632+ bootstrap cstatistics are greater than or equal to 0.5 (or the apparent cstatistic if the apparent cstatistic should be smaller than 0.5). One can apply a similar kind of winsorization with any resampling technique by reporting cstatistics smaller than 0.5 as 0.5, whereas in practice winsorizing at 0.5 rather entails a loss of information instead of a gain in precision, this approach leads to smaller RMSD to the IV cstatistic in simulations; see Additional file 1: Table S6. With this in mind, the superiority of the .632+ bootstrap in terms of RMSD to the IV cstatistic might appear less relevant. The performance of LPO CV, 5fold CV, enhanced bootstrap, and .632+ bootstrap in the estimation of cstatistics was too similar to give definite recommendations in favor of one of these techniques, which is in line with a previous study [27]. Thus, the choice might be guided by other criteria such as the dependency on data sampling, the extent of computational burden, the level of complexity of the approach, or the likeliness of encountering problems with model fitting in resamples. In particular, if in addition to the cstatistic the corresponding receiver operating characteristics curve should be estimated, different methods are required [20], since the resampling techniques discussed above do not provide rankings of the data necessary for estimating receiveroperating characteristic curves. One important point left aside in our simulation study is that if one wants to accurately estimate the performance of a model estimator consisting of multiple steps, e.g., a variable selection step and a coefficient estimation step, the model development should be systematically replayed in every bootstrap or CV sample, as emphasized by [28].
LOO CV also gives pessimistic estimates for discrimination slopes. Moreover, our simulations revealed unexpected behavior of some of the bootstrap techniques. First, the enhanced bootstrap and the .632+ bootstrap performed reasonably well for ML and FL but sometimes poorly for RR in estimating the discrimination slope. Second, the simple bootstrap resulted in estimates even more optimistic than the apparent discrimination slopes; see Additional file 1: S3. According to our simulation results, we suggest using LPO CV or 5fold CV to correct for optimism in discrimination slopes.
The Brier score is the only performance measure considered in this study which can be estimated by LOO CV using the averaging approach. As expected from the general theory on LOO CV [16], the LOO crossvalidated Brier scores were close to unbiased in our simulation. For the Brier score, there was little difference between resampling methods.
According to Austin and Steyerberg [3], the splitsample method, where a proportion of the data is excluded from model fitting and later used as independent data set for assessing the model performance, is quite popular among clinical investigators. However, we have not included the splitsample method in our simulation study as it is known to perform poorly, resulting in estimates of even larger RMSE than the apparent estimates [3]. Various authors proposed to use the internally leaveoneout crossvalidated calibration slope as global shrinkage factor to correct predictions for overestimation [6, 24, 31, 32]. We plan to investigate the calibration slope in this context in a separate report. Another interesting topic of future research is the reliable estimation of variability for internally validated performance measures (see, e.g., [4]).
Our study illustrates that the performance of resampling techniques can vary considerably between model estimators, even if the model estimators are similar in construction. Including for instance machine learning methods such as support vector machines into the comparison might even have revealed larger performance differences. This interaction between resampling techniques and model estimators implies that simulation studies aiming to assess the accuracy of a resampling technique should consider a broader set of model estimators to be widely applicable.
Summarizing, our study emphasizes that estimates provided by resampling techniques should be treated with caution, no matter whether one is interested in absolute values or a comparison between model estimators. Especially in studies with small samples or possibly spurious effects, it might be reasonable to scrutinize the validity of a performance measure estimate by applying an alternative resampling method.
Availability of data and materials
The data on the screening examination as part of the smoking cessation project is freely available at http://biostat.mc.vanderbilt.edu/DataSets.
References
Airola A, Pahikkala T, Waegeman W, De Baets B, Salakoski T. A comparison of AUC estimators in smallsample studies. In: Proceedings of the third International Workshop on Machine Learning in Systems Biology, PMLR 8:313. 2009. Available from https://proceedings.mlr.press/v8/airola10a.html.
Antolini L, Tassistro E, Valsecchi MG, Bernasconi DP. Graphical representations and summary indicators to assess the performance of risk predictors. Biom J. 2019;61(6):141729. https://doi.org/10.1002/bimj.201700186.
Austin PC, Steyerberg EW. Events per variable (EPV) and the relative performance of different strategies for estimating the outofsample validity of logistic regression models. Stat. Methods Med. Res. 2017;26:796–808.
S. Bates, T. Hastie, and R. Tibshirani, Crossvalidation: what does it estimate and how well does it do it?, arXiv (2021), arXiv:2104.00673. Available at https://arxiv.org/abs/2104.00673.
Binder H, Sauerbrei W, Royston P. Multivariable modelbuilding with continuous covariates: 1. performance measures and simulation design, Technical Report FDMPreprint 105. Germany: University of Freiburg; 2011.
Dunkler D, Sauerbrei W, Heinze G. Global, Parameterwise and Joint Shrinkage Factor Estimation. J Stat Softw. 2016;69(8):1–19. https://doi.org/10.18637/jss.v069.i08.
Efron B, Tibshirani R. Improvements on crossvalidation: the .632+ bootstrap method. J Am Stat Assoc. 1997;92:548–60.
Firth D. Bias reduction of maximumlikelihoodestimates. Biometrika. 1993;80:27–38.
A. Geroldinger, L. Lusa, M. Nold, and G. Heinze, On resampling methods for model assessment in penalized and unpenalized logistic regression, arXiv (2021), arXiv:2101.07640. Available at https://arxiv.org/abs/2101.07640.
Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36.
Harrell F. rms: Regression Modeling Strategies. R package, Version 5.12. https://CRAN.Rproject.org/package=rms.
Harrell FE. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis, Springer Series in Statistics, ed. New York: Springer; 2001.
Heinze G, Ploner M, Dunkler D, Southworth H. logistf: Firth’s Biasreduced Logistic Regression. R package, Version 1.23. https://CRAN.Rproject.org/package=logistf.
Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Stat. Med. 2002;21:2409–19.
Iba K, Shinozaki T, Maruo K, Noma H. Reevaluation of the comparative effectiveness of bootstrapbased optimism correction methods in the development of multivariable clinical prediction models. BMC Med. Res. Methodol. 2021;21:9.
James G, Witten D, Hastie T, Tibshirani RJ. An introduction to statistical learning, Springer Texts in Statistics, ed. New York: Springer; 2013.
Koehler E, Brown E, Haneuse SJPA. On the assessment of Monte Carlo error in simulationbased statistical analyses. Am. Stat. 2009;63:155–62.
Le Cessie S, Van Houwelingen HC. Ridge estimators in logistic regression. J. R. Stat. Soc. Ser. C. (Appl. Stat.). 1992;41:191–201.
Mansournia MA, Geroldinger A, Greenland S, Heinze G. Separation in logistic regression: causes, consequences, and control. Am. J. Epidemiol. 2018;187:864–70.
Montoya Perez I, Airola A, Bostrom PJ, Jambor I, Pahikkala T. Tournament leavepairout crossvalidation for receiver operating characteristic analysis. Stat. Methods Med. Res. 2019;28:2975–91.
Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat. Med. 2019;38:2074–102.
R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria 2022. https://www.Rproject.org/.
SAS Institute Inc., SAS/STAT User’s Guide, Version SAS 9.4/Viya 3.4 (2019). Available at https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_logistic_details32.htm.
Sauerbrei W. The use of resampling methods to simplify regression models in medical statistics. J Roy Stat Soc CApp. 1999;48:313–29.
Schemper M. Predictive accuracy and explained variation. Stat. Med. 2003;22:2299–308.
Shmueli G. To explain or to predict? Stat. Sci. 2010;25:289–310.
Smith GC, Seaman SR, Wood AM, Royston P, White IR. Correcting for optimistic prediction in small data sets. Am. J. Epidemiol. 2014;180:318–24.
Steyerberg EW, Bleeker SE, Moll HA, Grobbee DE, Moons KGM. Internal and external validation of predictive models: a simulation study of bias and precision in small samples. J. Clin. Epidemiol. 2003;56:441–7.
Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21:128–38.
Tjur T. Coefficients of determination in logistic regression modelsa new proposal: the coefficient of discrimination. Am. Stat. 2009;63:366–72.
Van Houwelingen JC. Shrinkage and penalized likelihood as methods to improve predictive accuracy. Statistica Neerlandica. 2001;55:17–34.
Van Houwelingen JC, Le Cessie S. Predictive value of statistical models. Stat. Med. 1990;9:1303–25.
Verweij PJM, Van Houwelingen HC. Penalized likelihood in Cox regression. Stat. Med. 1994;13:2427–36.
Willems JP, Saunders JT, Hunt DE, Schorling JB. Prevalence of coronary heart disease risk factors among rural blacks: a communitybased study. South. Med. J. 1997;90:814–20.
Acknowledgements
The authors would like to thank the two reviewers for suggestions that significantly improved the manuscript.
Funding
Open access funding provided by Austrian Science Fund (FWF). This research was funded by the Austrian Science Fund, project number I2276.
Author information
Authors and Affiliations
Contributions
AG and GH conceived the study. AG and GH designed the simulation experiment. AG coded the simulation experiment and analyzed the data. All authors interpreted the results and contributed to writing the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1:
S1. Problems in resampling techniques associated with small samples. S2. Data generating mechanism. S3. A side remark on the simple bootstrap: resampling may increase the optimism. Figure S1. Independently validated (solid line) and leaveoneout crossvalidated (dashed line) cstatistics for different penalization strengths in ridge regression on six artificially constructed data sets. The data were created in the same way as for one of the scenarios in our simulation study (null scenario, sample size of 50, marginal event fraction of 0.25). The xaxis shows the tuning parameter in ridge regression (lambda in the R package glmnet) with higher values corresponding to stronger penalization. For each data set we fitted 96 ridge regression models corresponding to a series of logequidistant tuning values. As in our simulation study, the independently validated cstatistics were obtained by validating the models on an independent data set consisting of 100,000 observations. As expected, the independently validated cstatistics are very close to the true value of 0.5. LOO, leaveoneout crossvalidation; IV, independently validated. Figure S2. Mean and root mean squared differences (RMSD) between Brier scores (BS) computed by data resampling techniques and independently validated (IV) BS for three different model estimators for the simulation settings with 50 observations, an event fraction of 0.25 and either no or strong effects. The Monte Carlo standard errors of both, the mean difference and of the root mean squared difference (x100), were smaller than 0.2 for all scenarios. ML, maximum likelihood; FL, Firth’s logistic regression; RR, ridge regression. LOO, leaveoneout crossvalidation; 5fold, 5fold crossvalidation; enhBT, enhanced bootstrap; .632+, .632+ bootstrap; app, apparent estimate. Table S1. Construction of explanatory variables in the simulation study, following Binder H, Sauerbrei W, Royston P. Multivariable ModelBuilding with Continuous Covariates: 1. Performance Measures and Simulation Design. Germany: University of Freiburg; 2011. Square brackets […] indicate that the argument is truncated to the next integer towards 0. The indicator function 1_{{…}} is equal to 1 if the argument is true and 0 otherwise. Table S2. Mean and standard deviation (x100) of independently validated (IV) cstatistics for different model estimators and all simulation scenarios. The standard deviation strongly depends on the number of new observations (in our case 100 000) used to estimate the IV cstatistics. Table S3. Mean and standard deviation (x100) of independently validated (IV) discrimination slope for different model estimators and all simulation scenarios. The standard deviation strongly depends on the number of new observations (in our case 100 000) used to estimate the IV discrimination slope. Table S4. Mean and standard deviation (x100) of independently validated (IV) Brier score for different model estimators and all simulation scenarios. The standard deviation strongly depends on the number of new observations (in our case 100 000) used to estimate the IV Brier score. Table S5. Percentage of separated data sets for the twelve simulation scenarios in the full data sets, in the data sets used for model fitting in leaveoneoutcrossvalidation, leavepairout crossvalidation and 5fold crossvalidation, respectively, and in the bootstrap data sets. Table S6. Mean difference and root mean squared difference (x100) between winsorized cstatistics computed by different resampling techniques and the independently validated (IV) value (as presented in Table S2) for simulation scenarios with sample size of 50 and event fraction of 0.25. Resampled cstatistics were winsorized by replacing values smaller than 0.5 by 0.5. Figure 3 shows the analogous results for the untransformed cstatistics.
Appendix
Appendix
Reinterpreting the discrimination slope
Here, we show that for an arbitrary model estimator which gives estimated probabilities \({\hat{\pi}}_i\) which are on average equal to the mean \(\overline{y}\) of the binary outcome y = (y_{i}), y_{i} ∈ {0, 1}, i = 1, …n, the discrimination slope is equal to the measure of explained variation \({\hat{V}}_B\) defined by Schemper [25]. Denote by \(\overline{\pi}={n}^{1}\left({\sum}_i{\hat{\pi}}_i\right)\) the average estimated probability, by assumption, we have \(\overline{\pi}=\overline{y}\). The explained variation \({\hat{V}}_B\) is defined as
The denominator is equal to \(2\overline{\pi}\left(1\overline{\pi}\right)\) as can be verified by splitting the sum into summands with y_{i} = 0 and summands with y_{i} = 1, noting that there are \(n\left(1\overline{\pi}\right)\)summands of the first type and \(n\overline{\pi}\)summands of the second type. Similarly, we can decompose
where we have used that \({\sum}_{i:{y}_i=1}{\hat{\pi}}_i\) is equal to \(n\overline{\pi}{\sum}_{i:{y}_i=0}{\hat{\pi}}_i\). Summarizing, \({\hat{V}}_B\) can be calculated as
Using the notation introduced above, the discrimination slope can be computed as \(\frac{\sum_{i:{y}_i=1}{\hat{\pi}}_i}{n\ \overline{\pi}}\frac{\sum_{i:{y}_i=0}{\hat{\pi}}_i}{n\ \left(1\overline{\pi}\right)}.\) Again exploiting that \({\sum}_{i:{y}_i=1}{\hat{\pi}}_i=n\overline{\pi}{\sum}_{i:{y}_i=0}{\hat{\pi}}_i\), we conclude that the discrimination slope and the measure of explained variation \({\hat{V}}_B\) agree.
Applying the .632+ bootstrap to the cstatistic
The .632+ bootstrap was introduced as a tool providing optimism corrected estimates for error rates [7]. It allows for different choices of the particular form of this error rate but assumes that the error rate can be assessed on the level of observations, i.e., quantifies the discrepancy between a predicted value and the corresponding observed outcome value. As both the cstatistic and the discrimination slope cannot be applied to single observations but only to collections of observations, we had to slightly modify the definitions.
The .632+ bootstrap estimate of the cstatistic, \({\hat{c}}^{.632+},\) is a weighted average of the apparent cstatistic \({\hat{c}}^{\textrm{app}}\) and an overly corrected bootstrap estimate \({\hat{c}}^{(1)}\). It is constructed as follows: the model is fitted on each of, say 200 bootstrap resamples (i.e., random samples of size n drawn with replacement), and is used to calculate the estimated probabilities for the observations omitted from the bootstrap resample. For each of the bootstrap resamples, the cstatistic is then calculated from the omitted observations. Finally, these cstatistics are averaged over all bootstrap resamples yielding the estimate \({\hat{c}}^{(1)}\).
The .632+ bootstrap estimate of the cstatistic is then given by
where \(\hat{w}=0.632/\left(10.368\ \hat{R}\right)\) with \(\hat{R}=\left({\hat{c}}^{\textrm{app}}{\hat{c}}^{(1)}\right)/\left(\ {\hat{c}}^{\textrm{app}}0.5\right).\) In order to ensure that \(\hat{R}\) falls between 0 and 1 such that \(\hat{w}\) ranges from 0.632 to 1, the following modifications are made:

set \({\hat{c}}^{(1)}\) to 0.5 if \({\hat{c}}^{(1)}\) is smaller than 0.5 and

set \(\hat{R}\) to 0 if \({\hat{c}}^{(1)}\)> \({\hat{c}}^{\textrm{app}}\) or if \(0.5\ge {\hat{c}}^{\textrm{app}}\).
The value 0.5 occurring in these modifications and in the denominator of \(\hat{R}\) is the expected cindex if the outcome is independent of the explanatory variables. The .632+ bootstrap estimate of the discrimination slope can be obtained analogously, just replacing 0.5 by 0 in the definitions above.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Geroldinger, A., Lusa, L., Nold, M. et al. Leaveoneout crossvalidation, penalization, and differential bias of some prediction model performance measures—a simulation study. Diagn Progn Res 7, 9 (2023). https://doi.org/10.1186/s41512023001460
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s41512023001460
Keywords
 Bootstrap
 Concordance statistic
 Discrimination slope
 Logistic regression
 Resampling techniques