A variety of statistics have been proposed as tools to help investigators evaluate diagnostic tests and prediction models. Sensitivity and specificity are generally reported for binary tests; for prediction models that give a continuous range of probabilities, discrimination (area under the curve (AUC) or concordance index) and calibration are recommended [1].
Recent years have seen considerable methodologic criticism of these traditional metrics. This was driven, at least in part, by interest in molecular markers. For instance, it has been argued out that AUC is insensitive, that it does not markedly increase when a new marker is added to a model unless the odds ratio for that marker is very high [2, 3]. In 2008, Pencina and colleagues introduced the net reclassification improvement (NRI) as an alternative metric to AUC [4]. The NRI measures the incremental prognostic impact of a new predictor when added to an existing prediction model for a binary outcome [5]. The metric became very widely used within a short period of time, a phenomenon attributed to the view that change in AUC will be small even for valuable markers [6]. The NRI has since been debunked; Kerr et al. provide a comprehensive evaluation of the disadvantages of NRI [7]. Hilden and Gerds demonstrated that miscalibration can improve NRI and so, critically, this results in the size of the test for NRI being much larger than nominal levels under the null [5]. Hilden and Gerds’ commentary on their findings focuses on the concept of a “proper scoring rule”, that is, a metric that is maximized when correct probabilities are used [5]. The authors mention the Brier score as an example of a proper scoring rule.
The Brier score is an improvement over other statistical performance measures, such as AUC, because it is influenced by both discrimination and calibration simultaneously, with smaller values indicating superior model performance. The Brier score also estimates a well-defined parameter in the population, the mean squared distance between the observed and expected outcomes. The square root of the Brier score is thus the expected distance between the observed and predicted value on the probability scale.
Hence, the Brier score would appear to be an attractive replacement for NRI, and it has indeed been recommended and used in statistical practice to evaluate the clinical value of tests and models. For instance, the Encyclopedia of Medical Decision Making describes the use of the Brier score in the “Evaluation of statistical prediction rules” [Binder H, Graf E. Brier scores. In: [8]]. As a practical example, a study by la Cour Freiesleben et al. aimed to develop prognostic models for identification of patients’ risks of low and excessive response to conventional stimulation for in vitro fertilization/intracytoplasmic sperm injection in order to ascertain if a low or a high dosage level should be used. The conclusions concerned “the best prognostic model” for each of the two endpoints, with models that were selected on the basis of Brier score. The authors then recommend that the models “be used for evidence-based risk assessment before ovarian stimulation and may assist clinicians in individual dosage [between two alternatives] of their patients” [9]. This is a clear example where authors used the Brier score to make clinical recommendations.
The Brier score has also been used to evaluate binary diagnostic tests. For instance, Braga et al. [10] compared six binary decision rules for Zika infection with a novel prediction model that provided a semi-continuous score. Brier scores were reported for all comparators. The authors stated that the “lowest Brier score of 0.096” was for the prediction model, leading to the conclusion that “the model is useful for countries experiencing triple arboviral epidemics”. Similarly, Kloeckner et al. [11] used Brier scores to compare two binary risk groupings with a three-group categorization for survival after chemoembolization for liver cancer. They concluded that risk groupings were not “sufficient to support clear-cut clinical decisions”.
The Brier score depends on prevalence in such a way [12] that it may give undesirable results where clinical consequences are discordant with prevalence. For instance, if a disease was rare (low prevalence), but very serious and easily cured by an innocuous treatment (strong benefit to detection), the Brier score may inappropriately favor a specific test compared to one of greater sensitivity. Indeed, this is approximately what was seen in the Zika virus paper [10], where the test with high sensitivity and moderate specificity (81 and 58%) had a much poorer Brier score than a test with low sensitivity but near perfect specificity (29 and 97%).
In this paper, we investigate scenarios in which we anticipate the Brier score might give a counter-intuitive rank ordering of tests and models. If the Brier score performs poorly in at least some common scenarios, this refutes any claim that it has general value as a metric for the clinical value of diagnostic tests or prediction models. As a comparator, we apply a decision-analytic net benefit method to the same scenarios. We start by introducing the Brier score and the decision-analytic alternative before applying both to four illustrative scenarios.
Brier score
The Brier score was introduced by Brier in 1950 to address the issue of verification of weather forecasts and has since been adopted outside the field of meteorology as a simple scoring rule for assessing predictions of binary outcomes. The Brier score was a measure developed to scale the accuracy of weather forecasts based on Euclidean distance between the actual outcome and the predicted probability assigned to the outcome for each observation [13]. The Brier score simultaneously captures discrimination and calibration, with low values being desirable.
It has been previously established that the Brier score is a proper scoring rule [14]. As overfitting results in miscalibration, this property penalizes overfit. For instance, Hilden and Gerds generated regression trees (“greedy” and “modest”) from a training dataset with varying propensities for overfit. When the models were applied to a validation set, the Brier score was superior for the modest tree, although the NRI favored the greedy tree [5].
In terms of notation, D is a random variable representing the outcome and X is a random variable representing the predicted probability of the outcome. Consider a set of n patients, let the subscript i index the individual patient. Let d
i
represent the observed outcome of patient i, such that d
i
= 0 if the disease is absent and d
i
= 1 if the disease is present. Let x
i
denote the predicted probability of the disease corresponding to the ith patient. The Brier score, the mean squared prediction error, is defined as:
$$ \mathrm{BS}\left(D,X\right)=E{\left[D-X\right]}^2 $$
The expected value of the Brier score can be estimated by using \( \frac{1}{n}{\sum}_{i=1}^n{\left({d}_i-{x}_i\right)}^2 \) provided that 1 ≥ x
i
≥ 0 for all i = 0, 1, 2,…, n.
We wish to calculate Brier scores for several hypothetical scenarios where we vary the prevalence and calibration of a model. In the case of a binary test, let T
i
denote the result of the test corresponding to the ith patient, such that T
i
= 1 if the test is positive for the disease and T
i
= 0 if the test is negative. The expected Brier score can be represented by:
$$ E\left[\mathrm{BS}\right]=P\left(T=1,D=0\right)+P\left(T=0,D=1\right) $$
This equals the misclassification rate in this binary test setting. We will refer to this derivation as “method 1”. An alternative to viewing a binary test as giving probabilities of 0 or 1 is to use the probability of disease among the test positive (positive predictive value) for a positive test and the probability of disease among the test negative cases (one minus the negative predictive value) for a negative test. We will refer to this derivation as “method 2”. This gives an expected Brier score:
$$ E\left[\mathrm{BS}\right]={\left(1-\mathrm{PPV}\right)}^2P\left(D=1,T=1\right)+{\mathrm{PPV}}^2P\left(D=0,T=1\right)+{\mathrm{NPV}}^2P\left(D=1,T=0\right)+{\left(1-\mathrm{NPV}\right)}^2P\left(D=0,T=0\right) $$
Method 1 might therefore be seen as a miscalibrated version of method 2. In the case of logistic regression, the Brier score can be written as a function of z, a continuous covariate, and the slope coefficients.
$$ \log \left(\frac{P\left(D=1|Z=z\right)}{1-P\left(D=1|Z=z\right)}\right)=\mathrm{logit}\left)P\left(D=1|Z=z\right)\right)={\beta}_0+{\beta}_1z,\mathrm{where}Z\sim {f}_{\mathrm{Z}}\left(\mathrm{z}\right) $$
$$ D\mid Z=z\sim \mathrm{Bernoulli}\left(\mathrm{logi}{\mathrm{t}}^{-1}\left({\beta}_0+{\beta}_1z\right)\right) $$
The Brier score can be represented using the joint distribution of D and X, where X = logit−1(β
0 + β
1
z).
$$ \mathrm{BS}\left(D,X\right)=E{\left[D-X\right]}^2=\sum \limits_{d=0}^1\underset{x=0}{\overset{1}{\int }}{\left(d-x\right)}^2{f}_{DX}\left(d,x\right) dx $$
where
$$ {f}_{DX}\left(d,x\right)={f}_{DZ}\left(d,z\right)\frac{{\left|{\beta}_1\right|}^{-1}}{x\left(1-x\right)}\kern0.5em ,\mathrm{and}\kern0.35em {f}_{DZ}\left(d,z\right)={f}_{D\mid Z=z}(d){f}_Z(z) $$
Therefore, the value of the Brier score in the case of logistic regression can be directly calculated using the following equation:
$$ \mathrm{BS}\left(D,X\right)=\sum \limits_{d=0}^1{\int}_{x=0}^1{\left(d-x\right)}^2\frac{{\left|{\beta}_1\right|}^{-1}}{x\left(1-x\right)}{x}^d{\left(1-x\right)}^{1-d}{f}_Z(z) dx,\mathrm{where}\ x=\frac{1}{1+{e}^{-\left({\beta}_0+{\beta}_1z\right)}} $$
Net benefit
Net benefit is a decision-analytic statistic that incorporates benefits (true positives) and harms (false positives), weighting the latter to reflect relative clinical consequences [15]. Net benefit is often reported as a decision curve, where net benefit is plotted against threshold probability, p
t defined as the minimum probability of disease \( \widehat{p} \) at which a patient will opt for treatment [16]. For example, a 5% threshold probability means that if a patient’s risk of disease is 5% or more, the patient should be treated; if it is less than 5%, treatment should be avoided. In other words, a threshold probability of 5% means that if a disease went untreated, it would be 19 times worse than an unnecessary treatment. Net benefit has been shown to be a proper scoring rule, as any difference between the true probability of the event and the predicted probability decreases net benefit [17, 18].
We wished to compare net benefit at various threshold probabilities of interest for various binary tests and continuous prediction models in our example scenarios [16]. The net benefit of the risk prediction weighs the relative harm of a false-positive and a false-negative result based on a desired threshold probability [16].
$$ \mathrm{Net}\ \mathrm{benefit}=\mathrm{TPR}-\mathrm{FPR}\left(\frac{p_{\mathrm{t}}}{1-{p}_{\mathrm{t}}}\right) $$