 Methodology
 Open access
 Published:
Using ordinal outcomes to construct and select biomarker combinations for singlelevel prediction
Diagnostic and Prognostic Research volume 2, Article number: 8 (2018)
Abstract
Background
Biomarker studies may involve an ordinal outcome, such as no, mild, or severe disease. There is often interest in predicting one particular level of the outcome due to its clinical significance.
Methods
A simple approach to constructing biomarker combinations in this context involves dichotomizing the outcome and using a binary logistic regression model. We assessed whether more sophisticated methods offer advantages over this simple approach. It is often necessary to select among several candidate biomarker combinations. One strategy involves selecting a combination based on its ability to predict the outcome level of interest. We propose an algorithm that leverages the ordinal outcome to inform combination selection. We apply this algorithm to data from a study of acute kidney injury after cardiac surgery, where kidney injury may be absent, mild, or severe.
Results
Using more sophisticated modeling approaches to construct combinations provided gains over the simple binary logistic regression approach in specific settings. In the examples considered, the proposed algorithm for combination selection tended to reduce the impact of bias due to selection and to provide combinations with improved performance.
Conclusions
Methods that utilize the ordinal nature of the outcome in the construction and/or selection of biomarker combinations have the potential to yield better combinations.
Background
In some clinical settings, a patient can experience one of several outcomes. For example, he can have no, mild, or severe disease. In the setting of cancer diagnosis, a patient can be diseasefree or have cancer at one of several grades. However, it may be most important to be able to predict one particular level of the outcome, typically the level that poses the greatest health threat. In the examples just given, this may be severe disease or the presence of highgrade cancer. Here, investigators are interested in “singlelevel prediction,” but an ordinal outcome is available. Ordinal outcomes are polytomous, or multilevel, outcomes whose levels can be ordered by, for example, their clinical significance. In contrast, nominal outcomes are polytomous outcomes whose levels cannot be ordered. The question becomes whether and how the information from the ordinal outcome can be leveraged to improve prediction of the outcome level of interest.
It is becoming increasingly common for studies to measure several biomarkers in each participant. Such studies often seek to develop a combination of biomarkers that can be used in risk prediction. Developing a biomarker combination involves fitting, or constructing, a combination and, if more than one combination is available, selecting from among the candidates. When an ordinal outcome is available, the development of such combinations becomes potentially more complicated; we consider two such complications.
The first complication relates to the construction of biomarker combinations, specifically, how the biomarkers should be combined when an ordinal outcome is available but there is interest in singlelevel prediction. A natural and common approach is to dichotomize the outcome and fit a binary logistic regression model. Of course, this discards some information available in the ordinal outcome. We evaluate the potential benefits of alternative regression methods that utilize the ordinal outcome.
The second complication concerns how a biomarker combination should be selected. In many studies, the number of candidate biomarker combinations is quite large. Investigators may consider, for example, all possible pairs of biomarkers. In a study with 20 biomarkers, there would be nearly 200 such candidate combinations. One strategy is to choose the combination with the best performance in terms of singlelevel prediction. As with combination construction, it may be possible to leverage the additional information in the ordinal outcome to aid in combination selection. We propose an algorithm for doing so and provide examples to illustrate the benefits of this method.
We illustrate the application of our combination selection algorithm to data from the Translational Research Investigating Biomarker Endpoints in Acute Kidney Injury (TRIBEAKI) study, a study of acute kidney injury (AKI) after cardiac surgery [1]. This study aims to use biomarkers measured immediately after surgery to provide an earlier diagnosis of AKI, which is typically not diagnosed until several days after surgery. Clinical definitions of AKI include both mild and severe types, though severe AKI is of primary clinical interest due to its impact on longterm morbidity and mortality [2]. As a result, there is interest in developing a biomarker combination to diagnose severe AKI.
Constructing combinations
We consider a set of predictors X and an outcome D with K levels.
Models for binary outcomes
One set of regressionbased approaches involves treating the outcome as binary by first dichotomizing D and/or subsetting the data and subsequently fitting one or more binary logistic regression models [3–7]. These approaches include:

(i)
Simple. One binary logistic model based on dichotomizing D at some fixed level, k^{′}: logit{P(D≤k^{′}X=x)}=α+β^{T}x [8–12].

(ii)
Each level vs. others.K binary logistic models comparing each level to the combination of the other levels: [13, 14].

(iii)
Each level vs. reference. (K−1) binary logistic models comparing each level to a reference level k^{∗}: [15, 16].

(iv)
Sequential. (K−1) binary logistic models comparing each level to the combination of the levels above it: etc. [13].
Models for nominal outcomes
The baselinecategory logit model is a very flexible approach that considers the categorical nature of the outcome but does not incorporate the ordering [7, 17, 18]. The baselinecategory logit model, typically referred to as the “multinomial model,” can be written as
k=1,…,K−1 [17]. Thus, the baselinecategory logit model allows the effect of the predictors to vary with the level of the outcome [17]. The set of models specified by the each level vs. reference approach (defined in the “Models for binary outcomes” section) with k^{∗}=K as the reference level is parametrically equivalent to the baselinecategory logit model.
Models for ordinal outcomes
Several regression models are available that fully model D (i.e., do not combine different levels of the outcome together) while accounting for the ordered, categorical nature of D [7, 17]. Such ordinal methods do not assume equal spacing between the levels of D; they simply use the ordering of the levels of D [19].
Cumulative logit model
The cumulative logit model can be written as
k=1,…,K−1, where the α_{ k } are ordered in k [17]. Under model (1), the log cumulative odds ratio is proportional to the distance between the predictor values being compared and the proportionality constant does not depend on k [17]:
As a result of this proportionality (sometimes referred to as the parallel slopes assumption), model (1) is also called the proportional odds model [17]. It is possible to include a separate β vector, β_{ k }, for each value of k [4, 17, 20, 21]. However, doing so may lead to crossing of the cumulative probability curves P(D≤kX=x) for some values of X, violating the ordering of the cumulative probabilities [17, 20]. In other words, for a given value x, P(D≤1X=x) may exceed P(D≤2X=x), for example, which is not valid. Importantly, the estimates provided by the cumulative logit model may be biased under casecontrol sampling [12, 22].
Adjacentcategory logit model
The adjacentcategory logit model can be written as
k=1,…,K−1 [17]. The adjacentcategory logit model can be used with data from casecontrol studies [17]. The set of logits produced by the adjacentcategory logit model is equivalent to that produced by the baselinecategory logit model (defined in the “Models for nominal outcomes” section), except that the adjacentcategory logit model assumes a common β [17]. Thus, the adjacentcategory logit model takes advantage of the ordinal outcome to achieve parsimony but does not involve cumulative probabilities [17, 20]. The adjacentcategory logit is more natural when there is interest in describing the effect of the predictor in terms of the odds relating to particular outcome levels since it easily allows the comparison of any two levels of the outcome [20]. Importantly, the issue discussed above for the cumulative logit model with separate effects β_{ k } (i.e., potential crossing of the cumulative probability curves, which violates the ordering of the cumulative probabilities) is not a problem for the adjacentcategory logit model with separate effects since this model does not involve cumulative probabilities [17].
Continuationratio logit model
The continuationratio logit model can be written as
k=1,…,K−1, where the α_{ k } are ordered in k [17]. The continuationratio logit model may be useful when a sequential mechanism determines the outcome, i.e., when individuals have to “pass through” one level of the outcome to get to the next [17, 19]. The continuationratio logit model considers conditional probabilities as opposed to cumulative probabilities [19]. As with the cumulative logit model, the continuationratio logit model restricts the regression coefficients to be the same for all levels of k [19, 21, 23]. Allowing a separate β vector for each levels of k gives the sequential approach defined in the “Models for binary outcomes” section [4, 19]. The estimates provided by the continuationratio logit model may be biased under casecontrol sampling [12].
Stereotype model
The stereotype model was proposed by Anderson [24] as a sort of compromise between models that incorporate the ordinality of the outcome and more flexible models (i.e., the baselinecategory logit model defined in the “Models for nominal outcomes” section, which allows the entire coefficient vector to vary with k). The stereotype model actually includes a hierarchy of models that vary in flexibility, as defined by the dimension of the model [24]. The dimension of the model can range from one to the maximum dimension d, which is related to the number of predictors and outcome levels [24]. A stereotype model of maximum dimension is a reparameterization of the baselinecategory logit model (defined in the “Models for nominal outcomes” section) [25]. While different dimensions are possible, the term “stereotype model” is generally reserved for the onedimensional model and we focus on that model here. This model can be written as
k=1,…,K−1 [24]. Essentially, this model allows some variation in the coefficient vector, but restricts β_{ k }=ϕ_{ k }β [24]. Identifiability constraints must be imposed on the ϕ_{ k }; typically, these are ϕ_{1}=0 and ϕ_{ K }=1 [24]. The definition of the stereotype model also typically includes the requirement that ϕ_{1}<ϕ_{2}<…<ϕ_{ K }; when this holds, the model given in (2) is an ordered model [24]. However, it has been noted that this ordering does not need to be specified a priori and most statistical packages (e.g., R and Stata) do not impose such a restriction [4, 25]. In his examples, Anderson [24] did not assume ordering among the ϕ_{ k }; rather, he fit the onedimensional model and evaluated whether the estimates \(\hat {\phi }_{k}\) were ordered. Thus, Anderson [24] recommended fitting a fairly flexible model and assessing whether the data suggest ordering. In other words, the model allows users to judge whether the outcome levels are ordered or not based on the estimates \(\hat {\phi }_{k}\), giving a datadriven analysis of ordering [23]. The stereotype model can be used with casecontrol data [12].
Comparing modeling approaches
Some work has been done to compare the models described above, particularly in terms of the efficiency of the parameter estimators. Briefly, gains in efficiency have been found for the baselinecategory logit model relative to the each level vs. reference approach [15, 16], for the cumulative logit model relative to the simple approach [4, 6, 9], and for the stereotype model relative to the baselinecategory logit model [26]. These results suggest that information can be gained from using all of the data in a single model, not dichotomizing D, and/or incorporating the ordinal nature of D.
Armstrong and Sloan concluded that in general, if the order of the outcome levels can be specified with confidence, models for ordinal outcomes, such as the cumulative logit model or the continuationratio logit model, are preferable to more flexible models [4]. In other words, it is reasonable to expect that when the outcome is ordinal, information is gained when this ordinality is used by the model [12]. In comparing the cumulative logit model to the simple approach, Risselada et al. argued that the impact of “mild violations” of the proportional odds assumption is expected to be less severe than the loss of information resulting from dichotomizing D [7]. On the other hand, others have noted that ordinal models become “increasingly unrealistic” as the number of outcome levels and/or predictors increases [9, 26].
Applications in risk prediction
Polytomous outcomes are frequently encountered in the risk prediction setting and a common approach is to dichotomize the outcome and fit a binary logistic regression model (the simple approach defined above) [3, 7, 14, 27]. The literature on using polytomous outcomes for singlelevel prediction has largely focused on the area under the receiver operating characteristic (ROC) curve (AUC) as a measure of predictive capacity. Briefly, the AUC assesses the ability of a model to discriminate between individuals who have or will experience the outcome level of interest and those who do not have or will not experience the outcome level of interest; the AUC for a model that is able to perfectly separate these groups is 1, while the AUC for a useless model is 0.5 [28].
The previous work in this area has primarily involved using individual datasets to compare modeling strategies. Biesheuvel et al. compared the baselinecategory logit model to the sequential approach and found fairly similar AUCs for both strategies [14]. Roukema et al. compared the baselinecategory logit model, the sequential approach, and the each level vs. others strategy [13]. They found similar discriminatory power for all three strategies, though they employed variable selection procedures for all of the models, making comparisons difficult [13]. Harrell et al. note that models can exhibit lack of fit and yet still provide quite accurate predicted probabilities [29], which may explain why several studies have found similar results in terms of the AUC when comparing different modeling approaches. In particular, it may be the case that a given model fits well for some predictors and does not fit as well for others, but when the coefficient estimates are combined to calculate predicted probabilities, the result is a fairly accurate estimate.
Combination selection
Often, a number of candidate biomarker combinations are available, and some form of selection is required. When the goal is to use a biomarker combination for risk prediction, it seems appropriate to select combinations based on predictive capacity. For a binary outcome, one possibility is to use the AUC (e.g., [30]). That is, the AUC for each candidate combination is estimated, and the combination with the highest AUC is chosen.
Two challenges arise in utilizing this approach. The first is that when the same data are used to construct a biomarker combination and estimate the AUC (or other measure of performance) for that combination, the resulting AUC estimate will be optimistic relative to the AUC for the same fitted combination in independent data; we refer to this as “resubstitution bias” [31]. Methods such as bootstrapping can be used to correct the apparent AUC estimate [19].
An additional challenge applies to selection more generally. If many models are considered and a model is selected on the basis of some estimated measure of performance, that estimated measure of performance will be optimistically biased even if it is corrected for resubstitution bias; we refer to this as “model selection bias” [31]. This idea has been explored in the bioinformatics/machine learning literature, where estimates of the classification error rate are often used to select a model. Broadly, it has been found that the estimated error rate for a model selected on the basis of its favorable error rate will be optimistic relative to the same model’s error rate in independent data [32–38]. Cawley and Talbot call this issue “overfitting the model selection criterion” [34]. In general, when some form of model selection is done and the performance of the chosen model is evaluated without accounting for the selection, that is, treating the selected model as though it were prespecified, optimistic bias is expected [19, 35, 39–41].
Methods
For an outcome D with K levels, “singlelevel prediction” relates to predicting D=K. In other words, we are interested in developing a biomarker combination that can differentiate D=K from D<K. Thus, we focus on measures of performance that evaluate the model in this regard. However, this begs the question of whether the ordered, multilevel nature of the outcome can be used in constructing and/or selecting a biomarker combination.
Methods for constructing combinations
We have described several regressionbased approaches to modeling polytomous outcomes. In particular, we can dichotomize the outcome and/or subset the data and use one of the four binary strategies, we can treat the outcome as ordered and use one of the four ordinal approaches, or we can use the more flexible baselinecategory logit model. Using a binary strategy requires combining several levels of the outcome together or fitting several models to subsets of the data. Likewise, the ordinal models restrict the nature of the relationship between the biomarkers and the outcome so as to achieve parsimony. The baselinecategory logit model, on the other hand, imposes no such restrictions and includes all of the data in a single model; of course, this comes at the cost of having to estimate additional parameters. We use simulations to evaluate the impact of these modeling choices on the performance of the resulting estimated combinations. Our focus in this investigation is whether more sophisticated modeling approaches can offer improvements in performance in terms of singlelevel prediction over the simple approach, that is, a single binary logistic regression model.
Since we are interested in predicting D=K, we dichotomized D at k^{′}=K−1 in the simple approach defined in the “Models for binary outcomes” section, giving logit{P(D≤K−1)}. Furthermore, for the purposes of predicting D=K, the simple approach and the each level vs. others strategy are identical, so the latter was not considered further. Finally, as the baselinecategory logit model is parametrically equivalent to the each level vs. reference approach defined in the “Models for binary outcomes” section with reference level k^{∗}=K and the former is generally more efficient than the latter, we did not include the each level vs. reference strategy in our investigation. Thus, we considered seven different modeling strategies: the simple approach (“Simple”), the sequential strategy (“Sequential”), the cumulative logit model (“CumLogit”), the adjacentcategory logit model (“AdjCatLogit”), the continuationratio logit model (“ContRatLogit”), the stereotype model (“Stereo”), and the baselinecategory logit model (“BaselineCat”).
We considered two broad simulation scenarios. In the first scenario, the biomarkers were simulated such that the cumulative logit model with proportional odds did not hold; in the second scenario, the data were simulated under the cumulative logit model where the assumption of proportional odds held. In both scenarios, we considered two biomarkers, X=(X_{1},X_{2}). We considered outcomes with either 3 or 5 levels, that is, K=3 or K=5. The combinations were constructed using training data with 200, 400, 800, or 1600 observations and evaluated in test data with 10^{4} observations. The training set sizes reflect sample sizes often encountered in biomarker studies. We used large test sets to provide reliable estimates of model performance. We simulated data such that P(D=1)=0.1 or 0.5 and P(D=K)=0.05 or 0.3; when K=5, P(D=2), P(D=3), and P(D=4) were equal. Thus, we considered scenarios where the outcome level K, the target of prediction, was rare and scenarios where it was common.
We used each of the modeling strategies to fit a linear combination of the biomarkers X in the training data, yielding estimates \(\hat {\boldsymbol {\beta }}\). We then applied these estimates to the test data to determine . Finally, we assessed the ability of to discriminate between D=K and D<K in the test data via the AUC. In the AKI example given above, this is the AUC for severe AKI vs. no or mild AKI. As we are interested in singlelevel prediction, this measure is the most relevant metric by which to compare the methods.
In the simulations where the cumulative logit model with proportional odds did not hold (the first scenario mentioned above), the biomarkers had conditional bivariate normal distributions. In particular, for K = 3, we considered (XD = 1)∼N(0,Σ), (XD = 2) ∼ N(μ,Σ), and (XD = 3)∼N(2,Σ), and for K = 5, we considered (XD = 1)∼N(0,Σ), (XD = 2)∼N(0.5,Σ), (XD = 3)∼N(1,Σ), (XD = 4)∼N(μ,Σ), and (XD = 5)∼N(2,Σ). We used μ∈{−1,0,1,2,3} and Σ = 2I_{2}, where I_{2} is the twodimensional identity matrix. Other covariance matrices (including those with correlation between the biomarkers and unequal covariance matrices) were explored; details are given in (Additional file 1: Section S1.1). The parameter μ determines whether the biomarker means are ordered by D. For K = 3, when μ is between 0 and 2, the biomarker means can be ordered by D. The same is true for K = 5 when μ is between 1 and 2. When the biomarker means were not ordered by D, we anticipated that some of the ordinal and/or binary methods would not perform well; these scenarios were included to investigate situations where the expected ordering of an outcome is not borne out by the data. We considered conditional bivariate normal distributions in these simulations because they yield the baselinecategory logit model.
To evaluate data generated by the cumulative logit model with proportional odds (second scenario), we simulated two independent normal biomarkers, both with mean 1 and variance 0.25. The outcome was then simulated as a multinomial random variable, where the success probabilities of the K levels were determined by α_{ k }−β^{⊤}X such that the cumulative logit model held. Three sets of coefficients β were considered (β=(1,2),(1,1.5),(1,−1)) and values of α_{ k } were chosen such that the desired prevalences (given above) were achieved in a large dataset. We chose to use normally distributed biomarkers in this simulation as, in our experience, many biomarkers are approximately normally distributed after a log transformation.
The simulations were repeated 1000 times and are described in Table 1.
Methods for combination selection
As above, we suppose that for an outcome D with K levels, “singlelevel prediction” relates to predicting D=K.
As with combination construction, the presence of an ordinal outcome requires that decisions about how to select a biomarker combination be made. One strategy is to simply estimate the AUC for D=K vs. D<K (including correcting this estimate for resubstitution bias due to any model fitting) and select the combination with the highest estimated AUC. As discussed above, the estimated AUC for this selected combination will be optimistically biased relative to the AUC for the same fitted combination in independent data due to model selection bias. In other words, because of model selection bias, the estimated AUC for the selected combination, chosen precisely because of its high estimated AUC, will be optimistically biased compared to the performance of this fitted combination in external data. We propose an alternative strategy where combination selection is done on the basis of not only the AUC for D=K vs. D<K, but also the AUC for D=K−1 vs. D<K−1, the AUC for D=K−2 vs. D<K−2, and so on.
We anticipate that in some settings, the estimated AUC for D=K vs. D<K for the combination selected in this way will be less affected by model selection bias and so this combination may be preferred. In particular, if some of the same biomarkers are associated with multiple levels of the outcome, our proposed method could offer improvements over the standard approach. Furthermore, we expect our approach to be useful when many biomarkers have modest associations with the outcome and the candidate combinations include subsets of these biomarkers. For example, if two biomarkers have very strong associations with D=K vs. D<K (i.e., the biomarkers for D=K are very different from those for D<K) and the remainder have much weaker associations, if we consider all biomarker pairs, the AUC for D=K vs. D<K for the pair consisting of the two strongly associated biomarkers is expected to be much larger than this AUC for the other combinations and it is unlikely this difference is due entirely to model selection bias. Thus, in this scenario, the “standard” approach of selecting a combination on the basis of the estimated AUC for D=K vs. D<K would be expected to yield the best combination in terms of the true AUC for D=K vs. D<K.
More precisely, for K=3, we define our algorithm (including constructing combinations and estimating their performance) as follows.

(1)
In the training data, dichotomize D at D=3 vs. D<3 and construct all candidate biomarker combinations using binary logistic regression. That is, to construct a candidate combination involving biomarkers X, fit logit{P(D=3X=x)}=θ_{0}+θ^{⊤}x in the training data.

(2)
Based on the fitted combinations from (1), e.g., \(\hat {\theta }_{0} + \hat {\boldsymbol {\theta }}^{\top }\mathbf {x}\), estimate (i) the AUC for D=3 vs. D<3 and (ii) the AUC for D=2 vs. D=1 in the training data.

(3)
Generate B bootstrap samples from the training data.

(a)
In each bootstrap sample, dichotomize D at D=3 vs. D<3 and construct all candidate biomarker combinations using binary logistic regression.

(b)
For each of the fitted combinations from (a), estimate (i) the AUC for D=3 vs. D<3 and (ii) the AUC for D=2 vs. D=1 in both the bootstrap sample and the training data.

(c)
Estimate the resubstitution bias as the average difference between the AUC in the bootstrap sample and the AUC in the training data across the B samples.

(a)

(4)
Correct the estimated AUCs from (2) using the estimated bias from (3c).

(5)
Determine the ranks for each of the two sets of corrected AUCs from (4) across all fitted biomarker combinations. The “standard” approach involves choosing the combination with the best AUC for D=3 vs. D<3. The “new” approach involves choosing the combination with the best sum of ranks for the two AUCs.

(6)
Apply the two chosen combinations to independent test data and estimate the AUC for D=3 vs. D<3 for each of the two selected combinations from (5). The estimated model selection bias is the difference between the AUCs in the test data and the AUCs from (4).
In practice, test data may not be available, so it may not be possible to complete step (6). An R package including code to implement this method, multiselect, is available on CRAN.
We used simulations to investigate the potential benefits of the proposed method. We considered five examples as a proof of concept; these are not intended to be exhaustive. In the first two examples, the cumulative logit model with proportional odds held, while in the other three, it did not. Throughout the simulations, K=3, there were p=30 biomarkers, and we considered the set of candidate combinations to be all possible pairs of these biomarkers, constructed via binary logistic regression. We used B=50 bootstrap replicates, a training set of 400 observations, and a test set of 10^{4} observations. We repeated the simulations 500 times. Our choice of p represents a study with a modest, but not large, number of biomarkers. As with the combination construction simulations, we chose a training set size similar to what might be encountered in a cohort study and a test set size that would yield reliable AUC estimates. We chose B=50 to provide computational efficiency without sacrificing performance. In these simulations, we considered all biomarker pairs to be candidate combinations. However, the number of biomarkers to be selected may not be fixed in practice. Our method is expected to would work equally well if the set of candidate combinations was all combinations of, say, between two and five biomarkers.
In Example 1, we had X∼N(1,2Γ), where X was a vector of dimension 30 and Γ was a 30×30 matrix where the diagonal elements were 1 and the offdiagonal elements were 0.3. The linear predictor was β^{⊤}X, where β_{1} = 1,β_{2}=2,β_{3}=…=β_{16}=0.5,β_{17}=…=β_{30}=0.1. The outcome was simulated under the cumulative logit model such that P(D=1)=0.6, P(D=2)=0.3, and P(D=3)=0.1 in a large dataset. Example 2 was identical to Example 1, except that P(D=2)=0.335 and P(D=3)=0.065.
In Example 3, we had P(D=1)=0.6, P(D=2)=0.335, and P(D=3)=0.065. Additionally, (XD=1)∼N(0,2Γ),(XD=2)∼N(β^{(2)},2Γ), and (XD=3)∼N(β^{(3)},2Γ) where X was a vector of dimension 30, Γ was as defined above for Example 1, and \(\beta ^{(2)}_{1} = 1.5, \beta ^{(2)}_{2} = 1, \beta ^{(2)}_{3} = \ldots = \beta ^{(2)}_{16} = 0.5, \beta ^{(2)}_{17} = \ldots = \beta ^{(2)}_{30} = 0.1, \beta ^{(3)}_{1} = \beta ^{(3)}_{2} = 2, \beta ^{(3)}_{3} = \ldots = \beta ^{(3)}_{16} = 0.8\), and \(\beta ^{(3)}_{17} = \ldots = \beta ^{(3)}_{30} = 0.1.\) Example 4 was identical to Example 3, except that \(\beta ^{(2)}_{1} = 1\) and \(\beta ^{(3)}_{17} = \ldots = \beta ^{(3)}_{30} = 0.2.\) Finally, Example 5 was identical to Example 3, except that \(\beta ^{(2)}_{1} = 1\), \(\beta ^{(2)}_{17} = \ldots = \beta ^{(2)}_{30} = 0\), and \(\beta ^{(3)}_{17} = \ldots = \beta ^{(3)}_{30} = 0.2\).
These examples reflect a range of scenarios, including situations where D=3 is common and situations where it is relatively rare. The biomarkers in these examples are moderately correlated, as might be expected in practice. We considered scenarios where two biomarkers had stronger effects on D and the remaining biomarkers either had more modest effects or were not associated with some levels of D. In Examples 1 and 2, “stronger effects” mean β values farther from zero, which translate into odds ratios (where the odds correspond to P(D≤k)/ P(D>k)) that are farther from one. For Examples 3–5, “stronger effects” mean β^{(k)} farther from zero, which translate into biomarker means for D=k that are more different from those for D=1.
Results
Results for constructing combinations
First we consider the scenario where the cumulative logit model with proportional odds did not hold. We present the results for a training set size of 400; the results for the other sample sizes were similar. Here, we focus on the results for P(D=K)=0.05 and provide the full results in (Additional file 1: Section S1.1).
Table 2 presents the results for K=3 and Table 3 presents the results for K=5. We see that when μ=−1, μ=0, or μ=1, the simple approach was comparable to or better than the other approaches. When μ=2, the simple approach did slightly worse than some of the ordinal approaches, particularly for K=3. For μ=3, the sequential approach, the stereotype model, and/or the baselinecategory logit model offered some gains over the simple approach. In sum, when the cumulative logit model with proportional odds did not hold but there was some ordering in the outcome by the biomarkers (that is, μ<2), the simple approach did well, but when μ≥2, some of the alternative approaches demonstrated improved performance. This is expected, since when μ<2, the simple approach is able to separate D<K from D=K, and so performs well. Furthermore, as indicated above, it is not surprising that when μ was extreme, many of the ordinal and binary approaches did not perform well.
While our main goal was to see if another regression method could yield improvements over the simple approach, it is instructive to note some of the patterns for the other methods. For instance, when K=3 and μ=3, the cumulative logit, adjacentcategory logit, and continuationratio logit models performed poorly relative to some of the other methods. This is not surprising as these methods are most appropriate when there is some ordering of the outcome, but when μ=3, the mean of the biomarkers for D=2 exceeds the mean of the biomarkers for D=3. This had more of an impact for K=3 than for K=5, since for K=5, four out of the five outcome levels were still ordered and so there seems to have been less of an effect on the models’ performance.
Similarly, when K=3, P(D=1)=0.5, and μ=–1, the cumulative logit, adjacentcategory logit, and continuationratio logit models again did poorly compared to some of the alternatives. However, we did not see this when P(D=1)=0.1. This is because when P(D=1)=0.5, P(D=2)=0.45, but when P(D=1)=0.1, P(D=2)=0.85. Thus, when P(D=1)=0.1, the D=2 level essentially overwhelmed D=1, so the fact that D=2 is “out of order” had less of an impact on model fitting. On the other hand, when P(D=1)=0.5 and P(D=2)=0.45, both D=1 and D=2, and their lack of ordering relative to D=3, influenced model fitting. Likewise, when K=5, P(D=1)=0.1, and μ=–1, the cumulative logit, adjacentcategory logit, and continuationratio logit models did not perform well. A similar argument applies here, but in reverse: when P(D=1)=0.1, P(D=2)=P(D=3)=P(D=4)≈0.28, and when P(D=1)=0.5, P(D=2)=P(D=3)=P(D=4)=0.15. Thus, when P(D=1)=0.5, the “out of order” level, D=4, was overwhelmed by the other (ordered) levels, but when P(D=1)=0.1, P(D=4) is considerably higher, and the fact that D=4 is “out of order” had a larger influence on model fitting.
Finally, we note that in these situations, where the cumulative logit, adjacentcategory logit, and continuationratio logit models performed poorly, the other ordinal approach, the stereotype model, offered enough flexibility to provide models with good performance.
When the cumulative logit model with proportional odds held, the performance was comparable across the approaches considered for a training set size of 400 (Additional file 1: Section S1.2); similar patterns were seen for other sample sizes. Thus, even when the data were generated by an ordinal model, the simple approach did well in terms of the predictive capacity of the fitted combinations.
For small to moderate sample sizes, several of the approaches had issues with convergence. When the training set had 200 observations, the simple approach failed to converge in up to 3.1% of simulations, the sequential approach failed to converge in up to 38% of simulations, the stereotype model failed to converge in up to 2.6% of simulations, and the baselinecategory logit model failed to converge in up to 1.4% of simulations. For training data with 400 observations, the sequential approach failed to converge in up to 7% of simulations. The proportion of convergence failures was below 0.2% for all methods for larger sample sizes.
Results for combination selection
Table 4 presents the results for Examples 1 and 4 for the proposed combination selection method. The results for Examples 2, 3, and 5 show similar patterns; the full results are presented in (Additional file 1: Section S2). The results in Table 4 demonstrate some benefit to using the additional information available in the ordinal outcome to select a biomarker combination for singlelevel prediction, both in terms of the degree of model selection bias and the ability of the chosen combination to discriminate D=3 from D<3 in independent test data.
In all of the examples considered, the first two biomarkers (X_{1} and X_{2}) had larger coefficients than the other biomarkers. This suggests that the pair (X_{1},X_{2}) may offer better performance than the other candidate biomarker pairs. To explore this, we investigated how often this pair was chosen by each method. In Example 1, the pair (X_{1},X_{2}) was chosen in 19.2% of simulations by the standard approach and in 41.4% of simulations by the new approach. For Example 2, these numbers were 13.4 and 34.8%, respectively. In Example 3, they were 27.4 and 57.6%, and for Example 4, they were 44 and 82.4%. Finally, for Example 5, they were 50.2 and 79.8%.
There were no issues with the logistic regression model failing to converge in the Example 1 simulations, eight simulations (out of 500) had convergence issues in Example 2, and one simulation had convergence issues in each of Examples 3, 4, and 5.
Application to TRIBEAKI
We applied our proposed method for combination selection to data from the TRIBEAKI study. As noted above, the outcome in this study, AKI, is an ordinal outcome as patients may be diagnosed with no, mild, or severe AKI. Furthermore, of the biomarkers measured in the study, it is believed that only a subset are likely to be useful for early diagnosis. Thus, we considered all possible pairs of 14 biomarkers measured in the study.
In the TRIBEAKI study, severe AKI is defined as a doubling of serum creatinine over preoperative levels or the need for dialysis during the hospital stay and mild AKI is defined as an increase in serum creatinine of 50%. The TRIBEAKI study is a multicenter study, but we restricted attention to the largest center in order to avoid issues related to center differences. We used the biomarker measurements taken immediately after surgery and removed observations missing any of these measurements. This left 465 observations (61 with mild AKI and 30 with severe AKI). We also logtransformed the biomarker measurements. As in the simulations, we applied our proposed method with 50 bootstrap replications.
The results for the ten best combinations in terms of the AUC for severe vs. no/mild AKI are given in Table 5. The combination with the highest AUC for severe vs. no/mild AKI, which would be selected by the standard approach, includes urine interleukin18 (IL18) and plasma NterminalproBtype natriuretic peptide (NTproBNP). The estimated AUCs (corrected for resubstitution bias) for this combination were 0.8575 for severe vs. no/mild AKI and 0.6125 for mild vs. no AKI. The combination with the highest combined rank for the AUC for severe vs. no/mild AKI and the AUC for mild vs. no AKI, which would be selected by the proposed method, included plasma hearttype fatty acid binding protein (hFABP) and plasma interleukin6 (IL6). The estimated AUCs (corrected for resubstitution bias) for this combination were 0.8365 for severe vs. no/mild AKI and 0.6757 for mild vs. no AKI. Thus, the AUC for severe vs. no/mild AKI for this second combination is slightly lower, but the AUC for mild vs. no AKI is substantially higher. It may be reasonable to expect that the estimated AUC for severe vs. no/mild AKI for the second combination (0.8365) is less affected by model selection bias than is the estimated AUC for severe vs. no/mild AKI for the first combination (0.8575), which may motivate choosing to validate the second combination instead of the first.
Discussion
When there is interest in developing biomarker combinations for singlelevel prediction of an ordinal outcome, common practice is to dichotomize the outcome for combination construction and selection. We have considered whether the information in an ordinal outcome can be leveraged in the development of biomarker combinations for singlelevel prediction.
In the context of constructing biomarker combinations, we used simulations to compare seven regressionbased approaches: two binary approaches, four ordinal approaches, and one nominal approach. We considered a variety of datagenerating scenarios and found that when some separation in the biomarker distributions between D=K and D<K existed (i.e., μ<2 in our first simulation scenario) or when the cumulative logit model with proportional odds held, the simple approach based on dichotomizing the outcome tended to work well in terms of the ability of the resulting combinations to predict D=K.
We have also proposed a method that utilizes the ordinal nature of the outcome in selecting a biomarker combination, as opposed to selecting a combination based solely on its ability to predict the targeted level. Simulations provide evidence that use of the proposed method may result in less model selection bias and could lead to selecting combinations with greater predictive capacity. We applied this method to data from the TRIBEAKI study, where we demonstrated how the method could be used to select a combination in practice. This approach is expected to be most useful when there is some ordering in the biomarkers by the levels of D. It is important to study this method further in order to fully elucidate the settings in which it could be beneficial.
In using this method for selection, it is generally informative to look at the results for the candidate combinations, as we have done in Table 5 for the top ten pairs in the TRIBEAKI study. If there is a clear “winner” in terms of the AUC for D=3 vs. D<3, that is, if this AUC is substantially higher for one candidate combination, it is probably reasonable to select that combination, regardless of the AUC for D=2 vs. D=1. This is because it is unlikely that such a markedly higher AUC estimate is due to model selection bias. On the other hand, if several combinations have fairly similar performance in terms of the AUC for D=3 vs. D<3, it may be worth using the AUC for D=2 vs. D=1 to aid in selection. One possible extension of this method could involve using a weighted average of ranks for the two AUCs, rather than the sum; additionally, using a weighted average of the AUC values themselves (as opposed to their ranks) could be considered.
The proposed method for selection is most appropriate when a modest number of biomarkers is available. When the number of candidate biomarkers is very large, some form of preselection may be required, for example, using pvalues or some other measure of the univariate association between a biomarker and the outcome. However, univariate measures of association may not reflect a biomarker’s potential to improve prediction when combined with other predictors [42], so preselection is best avoided when feasible [43].
In some settings, a polytomous outcome arises from the categorization of a continuous outcome; in such instances, methods for continuous data may yield better performance. Our work is primarily concerned with situations where the outcome is not based on the categorization of a continuous variable, such as the AKI and cancer grade examples.
Several measures of performance for polytomous outcomes have been proposed, including the polytomous discrimination index [44], the ordinal cindex [45], and the hypervolume under the manifold [46], each of which generalizes the AUC from binary to polytomous outcomes. These measures summarize the ability of a combination to predict all levels of the outcome, and so are not wellsuited to the work discussed here, which pertains to the situation where predicting a particular level of the outcome is the primary goal. Given this goal, we examined the AUC for D=K vs. D<K in constructing combinations. Additionally, while our selection algorithm uses information from multiple AUCs, we also recommend considering the magnitude of the AUC for D=K vs. D<K in selecting a combination. As noted above, a combination with a much higher AUC for D=K vs. D<K (relative to the other candidate combinations) should be preferred regardless of its ability to predict the other levels. Finally, we note that we have focused on the AUC as a measure of performance and a tool for combination selection. Alternative measures of discrimination may also be useful and could yield different results.
Conclusions
When an ordinal outcome is available and there is interest in using biomarker combinations to predict a single level of the outcome, the common approach of dichotomizing the outcome to construct and/or select a combination necessarily discards some information. We have considered the utility of leveraging this information to advance the goal of singlelevel prediction.
Abbreviations
 AKI:

Acute kidney injury
 AUC:

Area under the receiver operating characteristic curve
 hFABP:

hearttype fatty acid binding protein
 IL6:

interleukin6
 IL18:

interleukin18
 NTproBNP:

NterminalproBtype natriuretic peptide
 ROC:

Receiver operating characteristic
 TRIBEAKI:

Translational research investigating biomarker endpoints in acute kidney injury
References
Parikh CR, Coca SG, ThiessenPhilbrook H, Shlipak MG, Koyner JL, Wang Z, et al. Postoperative biomarkers predict acute kidney injury and poor outcomes after adult cardiac surgery. J Am Soc Nephrol. 2011; 22(9):1748–57.
Coca SG, Singanamala S, Parikh CR. Chronic kidney disease after acute kidney injury: a systematic review and metaanalysis. Kidney Int. 2012; 81(5):442–8.
Steyerberg EW. Clinical Prediction Models: a Practical Approach to Development, Validation, and Updating. New York: Springer; 2008.
Armstrong BG, Sloan M. Ordinal regression models for epidemiologic data. Am J Epidemiol. 1989; 129(1):191–204.
Bartfay E, Donner A, Klar N. Testing the equality of twin correlations with multinomial outcomes. Ann Hum Genet. 1999; 63(4):341–9.
Maas AIR, Steyerberg EW, Marmarou A, McHugh GS, Lingsma HF, Butcher I, et al.Impact recommendations for improving the design and analysis of clinical trials in moderate to severe traumatic brain injury. Neurotherapeutics. 2010; 7(1):127–34.
Risselada R, Lingsma HF, Molyneux AJ, Kerr RSC, Yarnold J, Sneade M, et al.Prediction of two month modified rankin scale with an ordinal prediction model in patients with aneurysmal subarachnoid haemorrhage. BMC Med Res Methodol. 2010; 10(1):86.
Manor O, Matthews S, Power C. Dichotomous or categorical response? Analysing selfrated health and lifetime social class. Int J Epidemiol. 2000; 29(1):149–57.
McHugh GS, Butcher I, Steyerberg EW, Marmarou A, Lu J, Lingsma HF, et al.A simulation study evaluating approaches to the analysis of ordinal outcome data in randomized controlled trials in traumatic brain injury: results from the impact project. Clin Trials. 2010; 7(1):44–57.
Norris CM, Ghali WA, Saunders LD, Brant R, Galbraith D, Faris P, et al.Ordinal regression model and the linear regression model were superior to the logistic regression models. J Clin Epidemiol. 2006; 59(5):448–56.
Roozenbeek B, Lingsma HF, Perel P, Edwards P, Roberts I, Murray GD, et al.The added value of ordinal analysis in clinical trials: an example in traumatic brain injury. Crit Care. 2011; 15(3):127.
Scott SC, Goldberg MS, Mayo NE. Statistical assessment of ordinal outcomes in comparative studies. J Clin Epidemiol. 1997; 50(1):45–55.
Roukema J, van Loenhout RB, Steyerberg EW, Moons KGM, Bleeker SE, Moll HE. Polytomous regression did not outperform dichotomous logistic regression in diagnosing serious bacterial infections in febrile children. J Clin Epidemiol. 2008; 61(2):135–41.
Biesheuvel CJ, Vergouwe Y, Steyerberg EW, Grobbee DE, Moons KGM. Polytomous logistic regression analysis could be applied more often in diagnostic research. J Clin Epidemiol. 2008; 61(2):125–34.
Begg CB, Gray R. Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika. 1984; 71(1):11–18.
Bull SB, Donner A. A characterization of the efficiency of individualized logistic regressions. Can J Stat. 1993; 21(1):71–8.
Agresti A. Categorical Data Analysis, 3rd. New Jersey: John Wiley & Sons; 2013.
Bender R, Grouven U. Using binary logistic regression models for ordinal data with nonproportional odds. J Clin Epidemiol. 1998; 51(10):809–16.
Harrell FE. Regression Modeling Strategies: with Applications to Linear Models, Logistic Regression, and Survival Analysis, 2nd. New York: Springer; 2015.
Liu I, Agresti A. The analysis of ordered categorical data: an overview and a survey of recent developments. Test. 2005; 14(1):1–73.
Ananth CV, Kleinbaum DG. Regression models for ordinal responses: a review of methods and applications. Int J Epidemiol. 1997; 26(6):1323–33.
Strömberg U. Collapsing ordered outcome categories: a note of concern. Am J Epidemiol. 1996; 144(4):421–4.
Feldmann U, Steudel I. Methods of ordinal classification applied to medical scoring systems. Stat Med. 2000; 19(4):575–86.
Anderson JA. Regression and ordered categorical variables. J R Stat Soc Series B Stat Methodol. 1984; 46(1):1–30.
Lunt M. Prediction of ordinal outcomes when the association between predictors and outcome differs between outcome levels. Stat Med. 2005; 24(9):1357–69.
Campbell MK, Donner A. Classification efficiency of multinomial logistic regression relative to ordinal logistic regression. J Am Stat Assoc. 1989; 84(406):587–91.
Van Calster B, Valentin L, van Holsbeke C, Testa AC, Bourne T, van Huffel S, et al.Polytomous diagnosis of ovarian tumors as benign, borderline, primary invasive or metastatic: development and validation of standard and kernelbased risk prediction models. BMC Med Res Methodol. 2010; 10(1):96.
Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. United Kingdom: Oxford University Press; 2003.
Harrell FE, Margolis PA, Gove S, Mason KE, Mulholland EK, Lehmann D, et al.Development of a clinical prediction model for an ordinal outcome: the world health organization multicentre study of clinical signs and etiological agents of pneumonia, sepsis and meningitis in young infants. Stat Med. 1998; 17(8):909–44.
Gevaert O, De Smet F, Timmerman D, Moreau Y, De Moor B. Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics. 2006; 22(14):184–90.
Kerr KF, Meisner A, ThiessenPhilbrook H, Coca SG, Parikh CR. Rigor: reporting guidelines to address common sources of bias in risk model development. Biomark Res. 2015; 3(1):2.
Bernau C, Augustin T, Boulesteix AL. Correcting the optimal resamplingbased error rate by estimating the error rate of wrapper algorithms. Biometrics. 2013; 69(3):693–702.
Boulesteix AL, Strobl C. Optimal classifier selection and negative bias in error rate estimation: an empirical study on highdimensional prediction. BMC Med Res Methodol. 2009; 9(1):85.
Cawley GC, Talbot NLC. On overfitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res. 2010; 11:2079–107.
Chatfield C. Model uncertainty, data mining and statistical inference. J R Stat Soc Ser A Stat Soc. 1995; 158(3):419–66.
Ding Y, Tang S, Liao SG, Jia J, Oesterreich S, Lin Y, et al.Bias correction for selecting the minimalerror classifier from many machine learning models. Bioinformatics. 2014; 30(22):3152–8.
Jelizarow M, Guillemot V, Tenenhaus A, Strimmer K, Boulesteix AL. Overoptimism in bioinformatics: an illustration. Bioinformatics. 2010; 26(16):1990–8.
Varma S, Simon R. Bias in error estimation when using crossvalidation for model selection. BMC Bioinformatics. 2006; 7(1):91.
Lukacs PM, Burnham KP, Anderson DR. Model selection bias and Freedman‘s paradox. Ann Inst Stat Math. 2010; 62(1):117–25.
Steyerberg EW, Bleeker SE, Moll HA, Grobbee DE, Moons KGM. Internal and external validation of predictive models: a simulation study of bias and precision in small samples. J Clin Epidemiol. 2003; 56(5):441–7.
Ye J. On measuring and correcting the effects of data mining and model selection. J Am Stat Assoc. 1998; 93(441):120–31.
Kerr KF, Bansal A, Pepe MS. Further insight into the incremental value of new markers: the interpretation of performance measures and the importance of clinical context. Am J Epidemiol. 2012; 176(6):482–7.
Bansal A, Pepe MS. When does combining markers improve classification performance and what are implications for practice?Stat Med. 2013; 32(11):1877–92.
Van Calster B, van Belle V, Vergouwe Y, Timmerman D, van Huffel S, Steyerberg EW. Extending the cstatistic to nominal polytomous outcomes: the polytomous discrimination index. Stat Med. 2012; 31(23):2610–26.
Van Calster B, van Belle V, Vergouwe Y, Steyerberg EW. Discrimination ability of prediction models for ordinal outcomes: relationships between existing measures and a new measure. Biometrical J. 2012; 54(5):674–85.
Li J, Fine JP. ROC analysis with multiple classes and multiple tests: methodology and its application in microarray studies. Biostat. 2008; 9(3):566–76.
Acknowledgements
The authors wish to acknowledge the TRIBEAKI study investigators: Steven G. Coca (Department of Internal Medicine, Icahn School of Medicine at Mount Sinai, New York, New York), Amit X. Garg (Institute for Clinical Evaluative Sciences Western, London, Ontario, Canada; Division of Nephrology, Department of Medicine, and Department of Epidemiology and Biostatistics, University of Western Ontario, London, Ontario, Canada), Jay Koyner (Section of Nephrology, Department of Medicine, University of Chicago Pritzker School of Medicine, Chicago, Illinois), and Michael Shlipak (Kidney Health Research Collaborative, San Francisco Veterans Affairs Medical Center, University of California, San Francisco, San Francisco, California). Assays used in the TRIBEAKI study were donated by Abbott Diagnostics, Randox Laboratories, and Beckman Coulter.
Funding
This work was supported by the National Institutes of Health (F31 DK108356, R01 HL085757, K24 DK090203). The opinions, results, and conclusions reported in this article are those of the authors and are independent of the funding sources.
Availability of data and materials
The dataset analyzed in this work is not publicly available as such widespread sharing of TRIBEAKI study data was not stipulated in the ethics approval for the study. An R package with code to implement the combination selection method described in this paper, multiselect, is publicly available on CRAN.
Author information
Authors and Affiliations
Contributions
AM, CRP, and KFK conceived and designed the study. AM conducted the analyses and drafted the paper. AM, CRP, and KFK edited and finalized the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
All participants provided written informed consent and each institution’s research ethics board approved the study. The study was approved by the Yale University Human Investigation Committee (HIC Protocol Number: 0603001221). Ethics approval was also obtained from the other five centers included in the TRIBEAKI study; however, the details of these approvals cannot be given as the identities of these centers were removed from the data per the TRIBEAKI study’s data sharing protocol.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
An additional file (“Additional File 1.pdf”) contains Sections S1 and S2. Section S1 contains results for simulations comparing methods for constructing combinations when the cumulative logit model with proportional odds did not hold (Section S1.1) and when the cumulative logit model with proportional odds held (Section S1.2). Section S2 contains results for simulations comparing methods for combination selection. (PDF 3152 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Meisner, A., Parikh, C. & Kerr, K. Using ordinal outcomes to construct and select biomarker combinations for singlelevel prediction. Diagn Progn Res 2, 8 (2018). https://doi.org/10.1186/s4151201800283
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4151201800283