Methodological concerns about “concordance-statistic for benefit” as a measure of discrimination in predicting treatment benefit

Prediction algorithms that quantify the expected benefit of a given treatment conditional on patient characteristics can critically inform medical decisions. Quantifying the performance of treatment benefit prediction algorithms is an active area of research. A recently proposed metric, the concordance statistic for benefit (cfb), evaluates the discriminative ability of a treatment benefit predictor by directly extending the concept of the concordance statistic from a risk model with a binary outcome to a model for treatment benefit. In this work, we scrutinize cfb on multiple fronts. Through numerical examples and theoretical developments, we show that cfb is not a proper scoring rule. We also show that it is sensitive to the unestimable correlation between counterfactual outcomes and to the definition of matched pairs. We argue that measures of statistical dispersion applied to predicted benefits do not suffer from these issues and can be an alternative metric for the discriminatory performance of treatment benefit predictors.


Background
Precision medicine emphasizes optimizing medical care by individualizing treatment decisions based on each patient's unique characteristics.A better understanding of the heterogeneity of treatment effect is the foundation for formulating optimal treatment decisions [1].Treatment benefit predictors, mathematical functions that predict the average treatment benefit conditional on individuals' characteristics, are important enablers of such individualized care [2].
Much of the methodological development in predictive analytics has been based on risk predictors, functions that return an estimate of the risk of an event given patient characteristics.Focusing instead on the prediction of treatment benefit presents a relatively new paradigm.As such, there is an increasing interest in evaluating the performance of treatment benefit predictors.For risk prediction, the performance of a predictor is often categorized into discrimination, calibration, and clinical utility (net benefit) [3].Discrimination refers to the ability of the predictions to separate individuals with and without the outcome of interest.Calibration is about how close the predicted and the actual risks are.Net benefit evaluates the clinical utility of a risk prediction model by subtracting the harm from the false-positive classification from the benefit from the true positive classification.The treatment benefit paradigm can incorporate these concepts.For example, calibration plots [4], discriminatory performance measures [5,6], and net benefit [7] have been extended to evaluate treatment benefit predictors.Maas et al. further extended other performance measures, such as E-statistic, cross-entropy, and Brier score, to treatment benefit predictors [8].
Our work focuses on metrics (i.e., summaries) for the discriminatory performance of treatment benefit predictors.In particular, the concordance statistic for benefit (cfb) proposed by van Klaveren et al. (2018) evaluates the discriminatory performance of treatment benefit predictors by conceptually extending the idea of the concordance statistic (c-statistic) for risk predictors [5].The metric, cfb, has been used in a number of applied studies.For instance, Meid et al. (2021) used cfb to evaluate a treatment benefit predictor for oral anticoagulants for preventing strokes, major bleeding events, and a composite of both [9].Duan et al. (2019) used cfb to compare two models for predicting individual treatment benefit for intensive blood pressure therapy [10].
In this work, we scrutinize cfb on multiple fronts.In particular, we consider its properness as a scoring rule, its sensitivity to the correlation between counterfactual outcomes, and the definition of matched patient pairs at the population level.In a recent preprint [11], Hoogland et al. also considered some theoretical and methodological issues around cfb.Some of their concerns connect with some of ours, as we will indicate later.The rest of this manuscript is structured as follows.We first review the original description of cfb [5] and provide an analogous definition at the population level.We demonstrate several scenarios in which cfb is shown to be an improper scoring rule, followed by further discussions.

Notation
We consider scenarios arising from a binary treatment decision T (control: 0 v. treated: 1) and a binary outcome Y (unfavorable outcome: 0 v. favorable outcome: 1).The individual treatment benefit is usually formulated as the algebraic difference in outcomes under both treatment arms (i.e., treatment minus control).When the outcome is binary, individual treatment benefit can be described by a ternary variable B with levels consisting of harm (B = −1) , no effect (B = 0) , and benefit (B = 1) .In some narrow contexts, such as some cross-over studies where strong assumptions hold, B can be directly observed.However, in most situations, B is not observable.For instance, in a prototypical parallel-arm clinical trial, B cannot be observed directly as one of the outcomes is counterfactual.van Klaveren et al. provided an algorithmic definition of cfb in such studies based on comparing outcomes from different treatment arms measured on two similar patients [5].The definition of cfb in the above-mentioned work is based on a given sample.However, we stress that the descriptions of cfb involving phrases such as "the proportion of all possible pairs" and "probability that forms two randomly chosen matched pairs" make clear how cfb can be readily interpreted at the population level.In what follows, and without loss of generality, we view the cfb as a population-level attribute defined regardless of whether B is observable or estimated.
A vector of baseline covariates for patients is denoted as X, and E[B | X] is the average treatment benefit of a sub- population stratified by X.Let a function of X denoted by h(x) be a treatment benefit predictor, i.e., H = h(X) is taken as a prediction of B. The best possible predictor, denoted as , is a special case in which the cor- responding cfb is indicated as cfb * .We restrict our atten- tion to randomized controlled trials (RCTs) in which H * = h * (X) can be expressed as ranging between −1 and 1.

Definition of cfb
The original definition of cfb directly extends the definition of the c-statistic from risk predictors for binary outcomes to treatment benefit predictors [5].To define cfb, we randomly select two patients from the population, whose treatment benefit quantities are it is a tied pair.Otherwise, the pair is not considered.We score each concordant pair by 1, each tied pair by 0.5, and each discordant pair by 0. The cfb is the average of the scores.Mathematically, cfb can be expressed as In many applications ties among H will not occur, for instance, if X has a continuous component and h(•) is smooth.In this case, cfb is a proportion of concordant pairs over the pairs satisfying B 1 = B 2 , which can be determined as cfb = Pr(H 1 > H 2 | B 1 > B 2 ) .This was the working definition by van Klaveren et al. [5].By incorporating ties, a technique commonly used in defining the c-statistic, the concept behind cfb can be extended, allowing for a more intuitive demonstration of its properties in various scenarios.When H is independent of B, cfb = 0.5 .Reciprocally, if a large proportion of pairs are concordant, indicating units receiving greater B also have greater H , then the value of cfb is close to the maximum 1. (1) This definition of cfb is descriptive and directly reflects the concordance association between predictions and actual benefits.Its simplicity allows us to focus on the conceptual aspect of the cfb.

Methodological concerns about cfb cfb is an improper scoring rule
The definition of the proper scoring rules dates back to Savage (1971) [12].Gneiting and Raftery (2007) [13] defined the "proper scoring rule" for probability-type metrics, defining a metric as proper if the expectation of the metric is maximized (or minimized) when correct probabilities are used.This concept has since been frequently used to investigate the reliability of metrics [14,15].Pepe et al. (2015) expanded on the concept of properness for metrics that evaluate the improvement in prediction performance gained by adding extra covariates [16].While the technical definitions of a proper scoring rule vary slightly across the works cited above, the spirit is the same.And the obvious adaptation to treatment benefit prediction context would be to demand that in any given population the summarizing metric, as a function of h(•) , be maxi- mized by h * (•) that truly outperforms other benefit predic- tors (in the sense of expected squared error).
According to the provided definition of cfb, we start with a distribution of (B, X).We consider a single binary variable X.As such, the distribution of (B | X) can be summarized by two probability triples {(p −1 , p 0 , p +1 ), (q −1 , q 0 , q +1 )} respectively for X = 0 and X = 1 .Particularly, where i denotes the level of B, and i p i = i q i = 1.
An example might be breast cancer surgery, where the treatment is surgical therapy and the control is conservative therapy.The tumor grade X is binary with values (0: low grade; 1: high grade), and the individual treatment benefit, B, is ternary.Assume Pr(X = 1) = 0.5 , and con- sider the distribution of (B | X) summarized by two prob- ability triples to be {(0.25,0.01, 0.74), (0.14, 0.18, 0.68)} with E[B] = 0.515 for the treatment.
Properties of cfb are investigated by comparing two treatment benefit predictors: h * (•) and random guessing.Random guessing is the base predictor with cfb = 0.5 , as H ⊥ ⊥ B .For the predictor h * (•) , predictions take values from {h * (0), h * (1)} = {0.49,0.54} .We anticipate that cfb * ≥ 0.5 as the performance of the random guessing should be no better than that of h * (•) .Note that we can use the probability triples to express the distribution of (B | H * ) in this scenario.If we randomly select {(B 1 , H 1 ), (B 2 , H 2 )} from the population, the probability of getting concordant and discordant pairs is summarized in Table 1.
The joint probability of two pairs matching on both H * and B is calculated as 0.5 2 , con- sists of 3 mutually disjoint scenarios, and its probability is 0.5 2 a>b q a p b = 0.05545 , where a and b denote the level of B. Similarly, the probability of having a discordant pair, ) , is 0.5 2 a<b q a p b = 0.05955 .Thus, based on the definition (1), we obtain which is smaller than 0.5.In other words, cfb fails to sensibly contrast the predictive performance of h * (•) , result- ing in the best possible predictor having a cfb below that of random guessing, indicating that cfb is not a proper scoring rule.
These improper scenarios can easily extend to a continuous X (or a continuous H * ).One choice is through (3) the connection between Bernoulli and Beta distributions.

Table 1 Joint probabilities of paired experiments
For instance, a binary X with Pr(X = 1) = 0.5 can be treated as the limit of a continuous distribution Beta(ε, ε) as ε approaches 0. Therefore, we consider a continuous X having a Beta distribution and define Pr(B | X) as a lin- ear interpolation of two sets of probabilities (p −1 , p 0 , p +1 ) and (q −1 , q 0 , q +1 ) , which satisfy (2) and (3).That is, It is mathematically guaranteed that by making parameters for the Beta distribution small enough, cfb * < 0.5 (as demonstrated for the binary X).Appendix B provides two such examples that yield cfb * < 0.5.
Note that in the context of binary risk prediction and the absence of censoring, the c-statistic (equal to the area under the receiver operating characteristic curve) is a proper scoring rule [17].Why is cfb an improper scoring rule despite its conceptual analogy with the c-statistic?It is because there exist pairs of probability triples that satisfy both (2) and (3) simultaneously for a ternary B. Conversely, no such probability sets exist when B is binary.In the case of c-statistic where the outcome is binary, (2) and (3) cannot hold simultaneously when we additionally require that q −1 + q +1 = p −1 + p +1 = 1 .This finding raises con- cerns about applying the c-statistic to non-binary outcomes.Similar rank-based measures for other metrics that pertain to non-binary outcomes are also at risk of being improper.For instance, Blanche et al. ( 2019) demonstrated that the c-statistic is not a proper scoring rule for the concordance of time-to-event values [18].

Improperness under the counterfactual framework
Based on probability triples governing the distribution of B, we have characterized distributions of (B | X) that yield cfb * < 0.5 .In counterfactual scenarios, the outcome that would be observed under treatment T = t is denoted as Y (t) .The population defined by (Y (0) , Y (1) | X) imposes a distribution of B = Y (1) − Y (0) given X, but we cannot necessarily recover an arbitrary pair of probability triples describing (B | X) this way.However, probability distributions for (B | X) derived from a distribution for (Y (0) , Y (1) | X) can still result in cfb * < 0.5 .A way to specify such a population is by con- necting counterfactual outcomes with the previously identified (B | X) distributions that result in cfb * < 0.5 .Under the assumption Y (0) ⊥ ⊥ Y (1) | X , we can numeri- cally evaluate whether or not there exist distributions of (Y (0) | X) and (Y (1) | X) that can produce a given (B | X) distribution.We found a subset of identified probability triples that can arise from a counterfactual starting point and yield cfb * < 0.5 .The detailed screen- ing process is provided in Appendix C. Similar to us, Hoogland et al. also cast the development of cfb in the counterfactual outcome framework as a starting point for their investigations [11].

cfb is sensitive to correlation between counterfactual outcomes
Specifying the population distribution via counterfactuals reveals another caveat of cfb.We find that cfb * changes as a function of the conditional dependence between counterfactual outcomes.This finding is more general, but it is more easily understood in the context of continuous counterfactual outcomes that have linear relationships with normally distributed baseline characteristics.The setting allows a closed-form expression for cfb * , and the con- ditional dependency level is quantified by the correlation coefficient between unobserved random terms added to the linear relationships (see Appendix D for an example).We denote the correlation coefficient as ρ .In practice, as Y (0) and Y (1) cannot be observed simultaneously for the same individual, the conditional dependence is unidentifiable.But the fact is that ρ impacts the distribution of B and thus the value of cfb.The expression for dichotomous outcomes obtained by thresholding the continuous Y (0) and Y (1) does not lead to an obvious closed-form expression, but the dependence of cfb on ρ stands.

cfb is sensitive to definition of matched pairs
We consider a source population, specifically, its marginal distribution of X and its conditional distribution of (Y | T , X) .In this context, to calculate cfb for a given h(•) , a connection needs to be built between (Y | T , X) and (B | H) .One way, as proposed by van Klaveren et al., is constructing a matched population consisting of matched patient pairs [5].Specifically, in their original work, they used examples for h(•) based on a logistic regression model for the observed Y given treatment and covariates and defined the observed benefit as the difference in outcomes between two similar patients in a matched pair.These two patients were from different treatment groups, and the similarity was defined in two ways: similarity in covariate patterns or similarity in predicted benefits.While the authors considered both definitions of matching acceptable, we show that matched pairs based on these two criteria can yield varying cfbs.Within each matched pair, we denote the quantities of interest as {(Y 1 , X 1 , T 1 ), (Y 2 , X 2 , T 2 )}.
To give concrete examples, say we have an RCT with T ⊥ ⊥ X .The covariate X is presumed to be a ternary vari- able taking value x ∈ {0, 1, 2} , and we set the distribution of X as It is convenient to parameterize Pr(Y = 1 | T, X) as In this setting, parameters {a, b, β 0 , β x , β t , β xt } determines a distribution of X and a distribution of (Y | T , X) , which are the basic ingredients for constructing a matched population.For the same ingredients, matching patients by X or matching patients by H can lead to different distributions of (B | H) when h(•) is not a bijective func- tion mapping X to H (see the Appendix E.2 for a mathematical derivation).Therefore, it is possible to generate substantially different cfb values from two distinct distributions of (B | H) .Such sensitivity to the definition of matched pairs is also highlighted by Hoogland et al. and investigated in more detail [11].
To illustrate the difference in cfb for the two definitions of matched pairs, we evaluate the discriminative ability of treatment benefit predictor h(x) = x 2 − x − 1 with prediction H ∈ {−1, 1} .This benefit predictor will be applied to different source populations, which are determined as follows.Parameters a and b take values from the sequence that starts at 0 with increment 0.01.Intending to screen as many combinations of a and b as possible, we consider all that satisfy 0 < a + b < 1 and obtain 498,501 combinations.For each combination, β 0 , β x , β t and β xt are each generated independently from the uni- form distribution on the interval (−5, 5).
With the same set of parameters, we obtain two distributions of (B | H) for matching on X versus on H.We then compare the corresponding cfb calculated from the closed-form expression shown in Appendix A. Figure 2 shows the difference between the two cfbs among all generated source populations.In most cases, the absolute differences were smaller than 0.05.However, note that in some cases, the difference could be as much as 0.2466.This simple setup demonstrates that different definitions of matched pairs can induce substantial differences in cfb for the same h(•) in the same source population.When dealing with multiple covariates and a complex function h(•) , the choice of matched pairs definition can have a considerable impact.Furthermore, when continuous predictors are present, exact matching will need to be relaxed to some form of close matching.Limited computational resources may further affect the choice of matching definition, potentially leading to less efficient results.
Further, even when consistently using the same definition, say matching by X, cfb is also sensitive to the sampling scheme for constructing matched pairs at the population level.One method of forming a matched population is to draw two independent patients from a joint distribution of (Y, T, X) with conditioning on X 1 = X 2 and T 1 = 1 and T 2 = 0 .However, this procedure can change the distribution of the covariate X from that in the source population, which may not be desirable for inference purposes.Alternatively, we can first draw a treated patient from the distribution of (Y , X | T = 1) and then sequentially select a control group patient with the same X.Under an RCT setting and with an infinite sample, the second procedure does not alter the distribution of X (see Appendix E.1 for a detailed explanation).Thus, it is necessary to carefully define the sampling scheme for forming matched patient pairs to prevent altering the distribution of the covariate X in the source population.

Discussion
In this work, we presented three fundamental problems of cfb through examples and theoretical developments.First, we showed that cfb is not a proper scoring rule.In particular, we found that the best possible predictor h * (x) = E[B | X = x] can result in a cfb that is lower than the cfb of a useless predictor based on chance prediction.Improperness is a grave concern as it can lead to misleading conclusions about the performance of predictors of treatment benefit.
Further, we showed that cfb is sensitive to the unestimable correlation between counterfactual outcomes conditional on covariates and is also sensitive to the definition of matched pairs.These issues are indeed interrelated.Under the counterfactual framework and RCT settings, van Klaveren et al. suggested using matched patient pairs to create the target distribution of (B | H ) from the dis- tribution (Y | T , X) [5].But this matched population generally contains no information on the conditional dependency.The original work on cfb acknowledges this and makes it clear that the counterfactual outcomes under two treatment arms are assumed to be independent conditional on covariates [19].Clearly, such a strong assumption could be violated in many applications, resulting in a cfb that might not be faithful to reality.
To keep the arguments intuitive, we demonstrated the aforementioned limitations of cfb in the context of a single explanatory variable.Involving multiple covariates will not alleviate the problems explained above.Ultimately, multivariate benefit predictors generate a scalar predicted benefit which can be considered a single explanatory variable.On the other hand, multivariate benefit predictors pose additional challenges, particularly with regard to matching.In addition to the sensitivity of cfb to the definition of matched pairs, the ultimate approximation required (e.g., specifying a maximum acceptable distance between matched high-dimensional X) can further undermine the accuracy of cfb.
For evaluating the discriminatory performance of benefit predictors, there are alternative metrics that are not suffering from these issues identified in this work.One particular metric is the Concentration of Benefit ( C b ) [6].C b is directly related to the Gini index and is concerned about the dispersion of the distribution of E(B | X) .It differs from a rank-based metric like cfb in that it remains a proper scoring rule, is not sensitive to unobserved correlation among counterfactuals, and does not require matching for its estimation.As such, as long as there are no ties in predicted benefits, a sample will lead to an unequivocal value of C b .On a broader scale, how to develop and validate models for treatment benefit prediction is a nascent area of research, and critically evaluating the theoretical foundations and empirical performance of existing metrics should parallel the quest for the development of new ones.To illustrate the extension of examples in Fig. 1, we shaded the probability triples satisfied the four inequalities in black, which is displayed in Fig. 6.We find that only a subset of the previously found (B | X) distribu- tions with cfb * < 0.5 can arise from a counterfactual starting point, with this subset having relatively larger cfb values.Particularly, within the shaded area, the maximum cfb is 0.5 and the minimum is 0.4830.The mean and median of cfb are close to the maximum, which are 0.4961 and 0.4969 respectively.But the overriding point is that there are distributions of (Y (0) , Y (1) | X) which yield cfb * < 0.5.
One step further, if Pr(Y (0) = 1 | X) and Pr(Y (1) = 1 | X) are described by logistic regressions: Each distribution of (Y (0) , Y (1) | X) can be mapped to a set of logistic regression model parameters that give cfb * < 0.5 .For each pair of probability triples satisfying both (2) and (3), there exists a set of the logistic regression parameters, {β 0 , β x , β t , β xt } , that yields cfb * < 0.5 for binary X.Specifically, the parameters are We have demonstrated that the improper scenarios involving a continuous X can be constructed based on the scenarios with a binary X.The same reasoning and process can be used to find a distribution of (Y (0) , Y (1) , X) yielding cfb * < 0.5 with a continuous variable X.

Appendix D: Correlation between counterfactual outcomes
Consider a continuous benefit B with continuous variables Y (0) and Y (1) .Suppose X is generated from a stand- ard normal distribution, and counterfactual outcomes are characterized by linear functions: Y (0) =β 0 + β x X + ε 0 , Y (1) =(β 0 + β t ) + (β x + β xt )X + ε 1 , Fig. 5 The probability Pr(B = b | X = x) and h * (X ) for (p −1 , p 0 , p +1 ) = (0.54, 0.37, 0.09) and (q −1 , q 0 , q +1 ) = (0.68, 0.09, 0.31) where set S contains all possible x such that h(x) = H, x ∈ X , and set B X consists of all possible (y 1 , y 2 ) pairs that satisfy If matched pairs are matched by H, it is possible for patients in a matched pair to have different values of the covariate X.Specifically, a matched population is created by first selecting a treated patient with (Y 1 = y 1 , T 1 = 1, X 1 = x 1 ) where h(x 1 ) = H .Then, another patient is selected with (Y 2 = y 2 , T 2 = 0, X 2 = x 2 ) where h(x 2 ) = h(x 1 ) = H . Thus, the distribution of (B | H) can be expressed as Similarly, set B H consists of all (y 1 , y 2 ) pairs that make the equivalence holds.When h(•) is an invertible func- tion with domain X and codomain H, matching on X is equivalent to matching on H as the two joint distributions referred to as the two matched populations are the same.• support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year

•
At BMC, research is always in progress.

Learn more biomedcentral.com/submissions
Ready to submit your research Ready to submit your research ?Choose BMC and benefit from: ? Choose BMC and benefit from:

•
fast, convenient online submission

•
thorough peer review by experienced researchers in your field • rapid publication on acceptance