 Methodology
 Open Access
 Published:
Combined performance of screening and variable selection methods in ultrahigh dimensional data in predicting timetoevent outcomes
Diagnostic and Prognostic Research volume 2, Article number: 21 (2018)
Abstract
Background
Building prognostic models of clinical outcomes is an increasingly important research task and will remain a vital area in genomic medicine. Prognostic models of clinical outcomes are usually built and validated utilizing variable selection methods and machine learning tools. The challenges, however, in ultrahigh dimensional space are not only to reduce the dimensionality of the data, but also to retain the important variables which predict the outcome. Screening approaches, such as the sure independence screening (SIS), iterative SIS (ISIS), and principled SIS (PSIS), have been developed to overcome the challenge of high dimensionality. We are interested in identifying important singlenucleotide polymorphisms (SNPs) and integrating them into a validated prognostic model of overall survival in patients with metastatic prostate cancer. While the abovementioned variable selection approaches have theoretical justification in selecting SNPs, the comparison and the performance of these combined methods in predicting timetoevent outcomes have not been previously studied in ultrahigh dimensional space with hundreds of thousands of variables.
Methods
We conducted a series of simulations to compare the performance of different combinations of variable selection approaches and classification trees, such as the least absolute shrinkage and selection operator (LASSO), adaptive least absolute shrinkage and selection operator (ALASSO), and random survival forest (RSF), in ultrahigh dimensional setting data for the purpose of developing prognostic models for a timetoevent outcome that is subject to censoring. The variable selection methods were evaluated for discrimination (Harrell’s concordance statistic), calibration, and overall performance. In addition, we applied these approaches to 498,081 SNPs from 623 Caucasian patients with prostate cancer.
Results
When n = 300, ISISLASSO and ISISALASSO chose all the informative variables which resulted in the highest Harrell’s cindex (> 0.80). On the other hand, with a small sample size (n = 150), ALASSO performed better than any other combinations as demonstrated by the highest cindex and/or overall performance, although there was evidence of overfitting. In analyzing the prostate cancer data, ISISALASSO, SISLASSO, and SISALASSO combinations achieved the highest discrimination with cindex of 0.67.
Conclusions
Choosing the appropriate variable selection method for training a model is a critical step in developing a robust prognostic model. Based on the simulation studies, the effective use of ALASSO or a combination of methods, such as ISISLASSO and ISISALASSO, allows both for the development of prognostic models with high predictive accuracy and a low risk of overfitting assuming moderate sample sizes.
Background
The proportional hazards (PH) model [1, 2] has been widely used for predicting timetoevent outcomes. The partial likelihood estimation of the PH model, however, is not appropriate for exploring the simultaneous relationship of the high dimensional variables with outcomes. For this reason, variable selection approaches, such as the least absolute shrinkage and selection operator (LASSO) [3], adaptive LASSO (ALASSO) [4], and a specific extension of random survival forests (RSF) [5], have been widely used to select the informative variables in a high dimensional setting [6,7,8,9]. In addition, LASSO, ALASSO, and RSF have been extended to timetoevent endpoints that are subject to censoring. Although these methods are capable of reducing the number of variables in high dimensionality, Fan et al. [10] and Zhao and Li [11] proposed methods, such as the sure independence screening (SIS) [10], the iterative SIS (ISIS) [10], and the principled SIS (PSIS) [11], to expedite computing time and improve estimation accuracy in a ultrahigh dimensional setting. Furthermore, Fan et al. define ultrahigh dimensionality by the exponential growth of the dimensionality in the sample size (that is, log(p) = O(n^{a}) for some a ∈ (0,0.5)) [9].
The SIS or PSIS aims to reduce the ultrahigh dimensional space to a manageable subset which encompasses the important variables. We define important variables as those in which the true coefficients are not equal to zero marginally on the survival outcomes, assuming a linear association between the loghazard function and the variables in the proportional hazard model. SIS or PSIS can select important variables, just as LASSO or ALASSO. When SIS or PSIS is combined with LASSO or ALASSO, however, SIS or PSIS is applied prior to running LASSO or ALASSO in the same way as employed in [9]. To emphasize the processing order among the methods, we broadly classify these methods into two groups: screening approaches which follow the genomic literature (SIS and PSIS) and variable selection methods (LASSO and ALASSO). Fan et al. [9, 10] theoretically proved that SIS is highly likely to choose all the important variables and showed that SIS and its variants are more efficient in excluding unimportant variables than LASSO alone.
We chose LASSO, ALASSO, and RSF due to their wide application in genomic medicine in selecting informative variables in high dimensional settings. Fan et al. [9, 10] showed that LASSO selects too many uninformative variables and thus the authors proposed to combine SIS with LASSO to efficiently exclude uninformative variables while keeping all the informative ones. Thus, the concept of implementing SIS, ISIS, and PSIS in high dimensional space before running ALASSO and RSF was innovative [9,10,11]. Moreover, statisticians who are not familiar with SIS, ISIS, and PSIS may want to consider using these variable methods as alternatives to select important variables and ultimately develop prognostic models with higher predictive accuracy in high dimensional space.
We are interested in identifying important germline singlenucleotide polymorphisms (SNPs) and integrating them into a validated prognostic model of overall survival in patients with metastatic prostate cancer. The data included 498,081 SNPs processed from blood samples from 623 Caucasian patients with prostate cancer after passing quality control. While the abovementioned variable selection approaches have theoretical justification in selecting important SNPs, the comparison and the performance of these combined methods have not been studied in ultrahigh dimensional space with timetoevent endpoints. In other words, it is unknown if the combined variable selection methods would perform well in a context of an ultrahigh number of variables. Our primary goal is to compare the performance of the methods in predicting a timetoevent outcome via a simulation study with 100,000 (100 k) variables. There is a plethora of variable selection methods, but it is not always apparent which approach is best to use in identifying important variables in ultrahigh dimensional space. Thus, another purpose of this paper is to provide applied statisticians with a basic understanding of these combined methods and to make recommendations on how best to use them.
The rest of this paper is organized as follows. We provide an overview of the widely used variable selection methods in the “Methods” section while offering a concise discussion of their advantages and disadvantages. In the “Results” section, we describe the design of the simulation studies and the comparison among the four single variable selection methods and the nine combinations of the variable selection methods. An ultrahigh dimensional example is used where the data are analyzed using a combination of methods to identify SNPs that will predict overall survival time for metastatic prostate cancer patients (“Discussion” section). The results of the simulations and recommendations are discussed in the “Conclusion” section.
Methods
Let p be the number of variables. A linear association between the loghazard function and the pdimensional variables is assumed in the PH model as follows:
where h(t x) is a hazard function of timetoevent t, provided that each of the p variables, x = (x_{1}, x_{2}, …, x_{p} )^{′}, is a vector with sample size n, for the x_{i} = (x_{i1}, x_{i2}, …, x_{in} )^{′} for i = 1, …, p. The parameters β = (β_{1}, β_{2}, …, β_{p})^{′} and h_{0}(t) are an arbitrary parameterfree baseline hazard function. We assume that the important variables are associated with the timetoevent based on (1). We consider the variable selection methods in order to only choose important variables of clinical outcomes, as presented in Fig. 1.
Screening approaches
We refer to SIS, ISIS, and PSIS as screening or variable selection approaches, and these words are used interchangeably. Fan et al. [10] introduced the SIS to decrease the number of variables p from the ultrahigh to a reduced subset size m for timetoevent outcomes. They choose m as ≤⌊n/ log(n)⌋ or ⌊n/(4 log(n))⌋ depending on the different variants of SIS. The former is known as “aggressive” SIS, and the latter is named “vanilla” SIS. We consider the aggressive SIS since its performance is shown to be more effective in high dimensional setting [10]. The aggressive SIS starts with randomly splitting the sample into two partitions, n_{1} and n_{2}, then SIS computes the marginal correlation between a single variable and the survival outcome within each partition. This is called marginal utility and is denoted as u_{k, 1} and u_{k, 2} for k = 1, …, p. The utility is obtained by maximizing the partial likelihood of each variable in the following way:
in which δ_{i} is the censoring indicator, x_{ik} is the kth element among the p variables, and R(y_{i}) is the risk set right before the time y_{i}, i.e., R(y_{i}) = {j; y_{j} ≥ y_{i}}. By ordering all variables in increasing order of their corresponding marginal utilities, u_{(1)} < u_{(2)} < … < u_{(p − 1)} < u_{(p)}, one selects the top s \( \left(=\left\lfloor \frac{n}{\log (n)}\right\rfloor \right) \) ranked variables for each partition. The details are presented in Fig. 2. Let I_{1} and I_{2} be the index sets of the selected s variables and are expressed as
where γ_{s, 1} and γ_{s, 2} are the cutoff values depending on s for each partition. The asymptotic probability that I_{1} ∩ I_{2} contains all the important variables is 1. This is the reason why this method is called sure screening.
The SIS is, however, unable to select important variables that are weakly and marginally associated with rightcensored outcomes. Furthermore, I_{1} ∩ I_{2} is most likely to retain unimportant variables that are highly correlated with the important variables [10]. To reduce the false negative and false positive rates, Fan et al. [10] also proposed ISIS and its aggressive variant. The iterative SIS begins similarly to SIS in the first iteration and derives the first subset, O_{1} containing variables selected from a variable selection method. Let m_{1} = O_{1} denote the first subset size. The second iteration repeats SIS but with the (p − m_{1}) variables and calculates the conditional utility for each partition given m_{1} variables. The selection method is again applied to the union set of m variables and O_{1} from which the second subset O_{2} is driven. The process runs iteratively until O_{j − 1} = O_{j}, for some j which is the number of iterations or until a prespecified number of iterations has been reached.
Despite the high probability for ISIS to select the important variables, there is no theoretical justification for choosing the cutoff values γ_{s, 1} and γ_{s, 2}. The lack of theoretical explanation motivates Zhao and Li [11] to develop the principled SIS for timetoevent endpoints. Furthermore, Zhao and Li provide the mathematical reasoning for the cutoff value γ_{m} by using the standard normal distribution and hence controlling the false positive rate. The PSIS requires the maximum partial likelihood estimates (MPLE) of the coefficients to be written as:
Once the false positive rate q_{m} is chosen, the cutoff value γ_{m} will be the quantile of the standard normal distribution, i.e., γ_{m} = Φ^{−1}(1 − q_{m}/2), assuming that the asymptotic null distribution of \( {I}_k{\left({\widehat{\beta}}_k\right)}^{1/2}{\widehat{\beta}}_k \) is N(0, 1), where I_{k} is the information matrix. Hence, the m variables are chosen in the following manner:
Variable selection methods
It is possible that the subset filtered by SIS or PSIS includes not only the important p^{∗} variables but also several of the unimportant (p − p^{∗}) variables. To remove the falsely selected variables, we employ some of the most widely used penalized regression methods in current clinical research, such as the LASSO and the ALASSO. Tibshirani [3] proposed LASSO with the L1 penalty while taking the survival analysis framework into consideration. The efficient computational algorithm is conducted in Simon et al. [12] by maximizing a scaled log partial likelihood of the PH model. The penalized maximum likelihood estimator is given by
where
Note that the R library glmnet scales the log partial likelihood by a factor of 2/n for convenience. LASSO evaluates (6) with a scalar α = 1 and a matrix W = I_{p} in which I_{p} is the pdimensional identity matrix in (7). ALASSO identically plugs in α = 1, but instead, a diagonal matrix, W, with \( 1/\mid \overset{\sim }{\beta}\mid \) as the diagonal entries in which \( \overset{\sim }{\beta }={\left({\overset{\sim }{\beta}}_1,{\overset{\sim }{\beta}}_2,\dots, {\overset{\sim }{\beta}}_p\right)}^{\prime } \) is a vector containing the MPLE of the log partial likelihood l(β). The coefficient estimates are obtained corresponding to the value of λ that minimizes the mean crossvalidated partial likelihood error in this paper. To be specific, 10fold crossvalidation was used. Only variables with nonzero coefficient estimates are finally selected by the penalized algorithm.
Unlike the penalized regression approaches, the random survival forest (RSF) is a nonparametric method and an extension of Breiman’s random forests [13] for analyzing rightcensored outcomes. These methods are known for their ability to easily deal with nonlinear effects, correlated variables, and variable interactions [13]. We implement minimal depth thresholding, which stems from the idea of observing the most important variables that are located on the upper nodes in a tree. Ishwaran et al. [14] provide the null distribution of the minimal depth of a maximal subtree that is given by
where D_{v} is the minimal depth for a given variable v, D(S) is the depth of the tree S, l_{d} is the number of nodes at depth d, and \( {L}_d=\sum \limits_{i=0}^{d1}{l}_i \). According to Ishwaran et al. [14], a variable v is chosen if its forest averaged minimal depth, D_{v} = d_{v}, is less than or equal to the mean minimal depth of D_{v} under the null distribution of (8). Thus, we build final models with the variables meeting the (mean) minimal depth thresholding.
Simulation studies
We conducted a series of simulations to compare the performance of the combined variable selection approaches in the ultrahigh dimensional setting within the survival framework. We used single variable selection methods as references and assumed a linear relationship between the loghazard function and the significant variables via the nonzero constant coefficients. In addition, we assumed that the baseline hazard function had a value of 1. We considered p^{∗} to be six because most articles in the context of the ultrahigh dimensional data assumed six or less important variables in their simulation studies as these methods are computationally intensive [10, 15, 16]. Another reason for considering a small number of important variables is that many genomewide association (GWAS) analysis identified only a few SNPs that are clinically important [17,18,19,20].
We limited p to 100 k, each variable having three classes since they were similar to the SNP genetic data that served as our motivation example. The random samples were generated assuming that the variables are mutually independent and identically distributed (IID). The IID variables were randomly and independently sampled from a multinomial distribution being specified with probabilities for the three classes, (0, 1, 2) as ((1 − q)^{2}, 2q(1 − q), q^{2}), respectively, which followed the HardyWeinberg equilibrium (HWE) [21, 22]. In the context of genetic data, q is the minor (risk) allele frequency (MAF) of a SNP and was assumed to represent 0.15 of SNPs. The MAF refers to the frequency of the allele with frequency no more than 50% in a population [22]. The three probabilities were formed given that the pairs of alleles were known to be independent, and hence, there was no deviation from the HWE.
The true regression coefficient values β were randomly chosen from (−1)^{u}(a+ z ), where u was sampled from a Bernoulli distribution having parameter 0.5, z generated from the standard Gaussian distribution. We followed Fan et al. [9] and set a = 1 and a = 2 for weak and strong signal strength, respectively. We considered two scenarios for the true values of β.
Strong signal:
Weak signal:
The true coefficients were multipliers of the first six variables, and hence, the other 99,994 variables had zero coefficients. Failure times were assumed following a Weibull distribution. The hazard function h_{w}(t x) was the function of the variables from a Weibull distribution, and it is expressed as
where τ was a shape parameter of the Weibull distribution. The failure times were randomly drawn from the Weibull distribution assuming τ=1.5. Also, we set the inverse of exp{τ(x_{1}β_{1} + x_{2}β_{2} + x_{3}β_{3} + x_{4}β_{4} + x_{5}β_{5} + x_{6}β_{6})} as the scale parameter.
The censoring rate C was specified as 20% across the simulation studies. Censored times were assumed to follow a uniform distribution on [0, θ_{c}]. For each specific censoring rate, θ_{c} was determined using the NewtonRaphson iteration (Appendix (A3) and (A5) of [23]). The failure time, T_{i}, for i = 1, …, n, was randomly generated from (9), whereas the censoring times C_{i}, for i = 1, …, n was randomly sampled from uniform distribution. Then, the observed failure times and censoring indicators were t_{i} = min(T_{i}, C_{i}) and δ_{i} = I(T_{i} ≤ C_{i}), respectively. We considered sample sizes of 150 and 300.
Because of the intensive computing time (100 k variables), we generated 300 data sets for each simulation scenario. Single variable selection approaches (SIS, PSIS, LASSO, and ALASSO) were examined as references. In addition, nine combinations of the variable selection methods were considered. These are ISISLASSO, SISLASSO, PSISLASSO, ISISALASSO, SISALASSO, PSISALASSO, ISISRSF, SISRSF, and PSISRSF. It is important to note that it is not possible to run ISIS alone as it works simultaneously with SIS in combination with a variable selection method. We sought to run RSF by itself; however, it had insufficient memory. We used the default maximum number of iteration (five) from the SIS library in R version 0.6 when running the ISIS combinations. To expedite computing time in running RSF, we set the number of trees in the forest and the maximum number of random split points to be 100 and 10, respectively, to increase computing speed. The split points split for a variable. For PSIS, we set the false positive rate at 0.001 (=q_{m} from (5)) over all simulation data sets as this will limit PSIS from selecting hundreds of variables in a reduced subset.
Prognostic models were developed based on the training datasets, whereas independent testing sets were generated from the underlying joint distribution of the variables. The testing sets had much larger sample sizes (n = 2000) than the training sets, as suggested by Hothorn et al. [24]. The above variable selection approaches were compared for their predictive ability using overall performance, discrimination, and calibration. For the overall performance, we picked up Graf et al. \( {R}_{BS}^2 \) [25], which assesses the percentage gain in predictive accuracy compared to the null model (model with no variables) at a single time point. In this paper, \( {R}_{BS}^2 \) was evaluated at 2 years because it was very close to the median overall survival time in our motivation example of prostate cancer. A positive value of \( {R}_{BS}^2 \) means that a prognostic model predicts better relative to the null model, whereas a negative value indicates that the model predicts poorly relative to the null model. Discrimination describes the ability of a prognostic model to distinguish between patients with and patients without the outcome of interest [26]. Discrimination was measured using Harrell’s cindex. A value of 0.5 represents random prediction whereas a value of 1 represents perfect discrimination [26]. Calibration refers to the agreement between the observed survival probability at 2 years versus the predicted probability at 2 years. Patients were divided into quartiles, and the average predicted survival probability was compared with the observed survival probability similar to how it is described by Harrell [26]. The calibration slope was estimated by regressing new survival outcomes on the predicted prognostic index [27]. Lastly, overfitting was evaluated by using the rms library in R using the validated command with 100 bootstrapped samples generated from the original sample.
The high computational burden was alleviated by using shared cluster computers at the University of Iowa Supercomputer Center. The LASSO, ALASSO, and RSF were executed in R using glmnet library version 2.05 and randomForestSRC version 2.1.0. We created R codes for PSIS according to the algorithms in [9] because there is no available library for PSIS. We employed R codes from the SIS library (version 0.6) and combined it with ALASSO and RSF in accordance with the techniques described in [28] since SIS can only implement SISLASSO or ISISLASSO. The R codes were written by the first author and are available on https://duke.box.com/s/q2wj7m4gxv0gnslaw91nooindjjat6cf.
Results
Overall performance
Table 1 shows the means and standard deviations of \( {R}_{BS}^2 \) values evaluated at 2 years out of the 300 simulations for the training sets. The highest \( {R}_{BS}^2(2) \) were observed for PSISLASSO and PSISALASSO when n = 150 regardless of the signal strength. For n = 300, and assuming a weak signal among the six variables, LASSO had the highest \( {R}_{BS}^2(2) \) which was 83% (Table 1). This implies that the final model containing the selected variables from LASSO had 83% gain relative to a null model in explaining the survival outcomes at 2 years. When n = 300, and assuming a strong signal among the six variables, PSISLASSO and PSIS had \( {R}_{BS}^2(2) \) above 85%, whereas LASSO had negative average \( {R}_{BS}^2(2) \), indicating that the average model size of 108 variables performed poorly compared to the null model.
We present the mean and standard deviation (SD) of the final model sizes in Table 2. We observed that PSIS, PSISLASSO, and PSISALASSO chose an excessive number of uninformative variables given that the number of informative variables was fixed at six. When n = 300, and assuming a strong signal among the six variables, the prediction models that employed LASSO performed poorly compared to the null model. When the sample size was 300 and a strong signal was assumed, the combinations with ISIS chose a small number of variables and their \( {R}_{BS}^2(2) \) values were almost 80% (Table 2).
Table 3 presents the mean number of important variables chosen in the final models over the 300 simulations in the training sets. In addition, we present the proportion of the number of unimportant variables selected in each model relative to the size of the final models. When n = 150, PSIS, PSISLASSO, PSISALASSO, LASSO, and ALASSO selected four or more important variables on average. These final models chose too many unimportant variables as the proportions of unimportant variables were greater than 85% (Table 3). On the other hand, when n = 300 and the signal strength was weak, ISISLASSO, ISISALASSO, and ISISRSF selected five or more important variables on average. These combinations however chose a small number of unimportant variables as the largest proportion was only 3.3% for ISISRSF (Table 3).
Overfitting and discrimination
We also assessed overfitting using the original samples (training sets). Table 4 presents the mean optimism and corrected cindex over 100 boostrapped samples when the sample size is 300. We observed the lowest mean optimism with SIS combination when the signal was weak, but these approaches led to the smallest corrected cindex. On the other hand, PSISLASSO had the highest mean optimism. LASSO, ALASSO, and the ISIS combinations had reasonable cindexes relative to the mean optimism (Table 4). The same pattern was observed when a strong signal among the important variables was assumed.
The mean and standard deviation values of C_{H} over the 300 simulations in the testing sets are presented in Table 5. We observed that when the signal is weak and n = 150, ALASSO had the highest cindex. On the other hand, prediction models from ISISLASSO, ISISALASSO, ISISRSF, and ALASSO had the highest average cindex when the sample size was 300 and the signal was weak. When the sample size was 150 and a strong signal was assumed, both LASSO and ALASSO had the highest cindex (Table 5). In contrast, when the sample size was large and a strong signal was assumed, all variable selection approaches had high cindices. The ISIS combinations had the highest cindex, followed by LASSO and ALASSOPSIS combinations, respectively.
Calibration
The mean and standard deviation of the calibration slopes for the testing sets are displayed in Table 6. Calibration slope identifies a linear association between the observed survival outcomes and the predicted prognostic index. When n = 300, the average calibration slopes from the prognostic models containing variables selected by ISISLASSO, ISISALASSO, and ISISRSF were close to a value of 1. This implies that these prognostic models were well calibrated. On the other hand, when n = 150, most variable selection approaches had calibration slopes considerably below 1, suggesting overfitting.
The models were also evaluated for their calibration by plotting the observed versus the predicted probabilities for the variable selection approaches that had the calibration slopes closet to 1. Figure 3a–d presents the calibration plots of the observed survival probability versus the predicted probability at 2 years. The observed survival probability was close to the predicted probability when n = 150 and the signal strength is weak (Fig. 3a). We detected similar patterns as the observed survival probability was close to the predicted probability for ISISALASSO when the sample size was 300 with a weak signal (Fig. 3b), for ISISLASSO when a strong signal among the variables is assumed and n = 150 (Fig. 3c) and for ISISLASSO when a strong signal among the variables is assumed n = 300 (Fig. 3d).
Realdata example
The variable selection methods (SIS, PSIS, LASSO, LASSO, RSF) and the nine combinations were used to identify SNPs that predict overall survival in men with metastatic prostate cancer who participated in a phase III clinical trial (CALGB 90401). Overall survival was defined as the date of death from date of random assignment. We confined our analysis to 623 genetically defined Caucasians men who participated in the GWAS study [29]. Deaths were observed in 94% of the 623 Caucasian men.
Among all the SNPs, 498,081 SNPs were selected after quality control utilizing GenABEL (R package): call rate < 99%, p value of HWE < 1e−08, MAF < 0.05, and removal of nonautosomal SNPs. The missing SNP values were simply replaced with the average of nonmissing SNP values. Although this may not be the optimal method for handling missing data, it is an effective ad hoc approach when the proportion of missing SNPs is relatively low. Eightyfive percent of the 498,081 SNPs that passed quality control were complete. The proportion of missing SNPs among 498,081 were 11.9% in 1 SNP, 2% had 2 SNPs, 0.5% had 3 SNPs, 0.2% had 4 SNPs missing, 0.10% had 5 SNPs missing, and 0.4% had more than 5 SNPs missing.
The dataset was randomly split into a 2:1 allocation ratio to training (n = 419) and testing (n = 204) sets, respectively. From the training set, important SNPs were selected using the different variable selection approaches. In all the proportional hazards models, SNPs were considered assuming additive models and were adjusted for risk score based on the predicted survival probability [30]. As expected, there was a slight indication of overfitting in the training set when the cindex from the original sample was compared to the bootstrapped samples. The corrected cindex is presented in Table 7. Three variable selection methods (ISISALASSO, SISLASSO, and SISALASSO) selected only the validated risk score. It is worth mentioning that for PSIS, PSISRSF, and LASSO, the corrected cindex could not be estimated due to the singularity problem among the predictors. The estimated coefficients of the selected SNPs from the training set were applied to the testing set, and we calculated the cindices and the 95% confidence intervals. As presented in Table 7, ISISALASSO, SISLASSO, and SISALASSO combinations achieved the highest cindex of 0.67.
Discussion
Advancement in laboratory technologies has led to ultrahigh dimensional data that are now being routinely captured in clinical studies. A critical element of genomic medicine is implementing validated prognostic models for identifying patients with cancer for clinical trial design and/or for optimal therapy. For example, the Decipher signature has been developed for predicting metastasis after radical prostatectomy in men with prostate cancer [31,32,33,34,35]. The Decipher score was developed using random forest and has been independently validated [32,33,34,35]. Another example is the wellknown oncotypeDx recurrence score that has been utilized to stratify randomization and to guide treatment in breast cancer patients in the TAILORx trial [36,37,38].
Prognostic models are usually built and validated utilizing common statistical methods and machine learning tools. The challenge in ultrahigh dimensional space is not only to reduce the dimensionality of the data, but to keep the important molecular variables in predicting the timetoevent endpoints. Another major challenge which is common in high dimensional data is overfitting, which is identified by the high accuracy for a prognostic model based on a training set, but a low accuracy is observed when the model is evaluated on an external validation dataset.
This article explores the feasibility of combining several variable selection methods in an ultrahigh dimensional setting for the purpose of developing prognostic models for timetoevent outcomes. To our knowledge, this is among the first articles that systematically compared the performance of ultrahigh dimensional screening with variable selection methods for timetoevent endpoints.
When the sample size was small (n = 150) and the signal strength among the variables was weak, ALASSO outperformed all other approaches having the highest cindex and the calibration slopes closest to 1. Assuming a moderate sample size (n = 300) and a strong signal among the six variables, the combinations with ISIS not only selected all the important variables, but also excluded the unimportant variables. When n = 300 and the signal strength is weak, the combinations with ISIS had similar discriminative abilities compared with ALASSO in terms of high average cindex. The calibration slope, however, for ALASSO (0.732) indicated overfitting as compared with the slope value for ISISALASSO (0.945). On the other hand, when n = 150 with a strong signal strength, ALASSO had the best performance. While the combinations with ISIS or SIS failed to select all the important variables, the combinations with PSIS chose too many unimportant variables.
Our results do not agree with Fan et al. [10], who demonstrated that SIS combined with LASSO is efficient in excluding unimportant variables rather than using only LASSO. This could be due to the fact that different performance measures (median and squared estimation errors) were reported [10]. Another possible explanation may be due to the fact that Fan et al. simulated correlated variables whereas we did not.
Other variable selection methods based on p values have also been used. One of these approaches is based on the false discovery rate (FDR). We have previously demonstrated the feasibility of the sequential use of FDR as a screening method with ISIS and other variable selection methods for predicting binary endpoints [39].
Turning back to our motivating example, the results from the prostate cancer data demonstrate the difficulty in choosing the important SNPs in predicting overall survival. The top three combinations, ISISALASSO, SISLASSO, and SISALASSO, outperformed other approaches, but they only selected the validated risk score [29]. The other combinations of variable selection methods had smaller cindices even if they included more SNPs in addition to the validated risk score. This highlights the importance of validating the selected SNPs. Although external validation is considered the gold standard, model developers may not have access to external data. Other validation procedures, such as bootstrapping, are acceptable approaches [40, 41].
There are some caveats in applying the results from our simulation studies to real data. First, our simulations start with randomly generating categorical variables of which each has an identical probability to select three classes: 0, 1, and 2. The identical probability assumption, however, may not hold in a real problem. The second limitation is due to the independently and identically distributed assumption in generating the variables. Actual SNPs may be correlated with each other, although in our example, we observed zero correlation among SNPs across genes and modest correlations among SNPs within the same genes. Thirdly, we fixed several parameters in the simulations, such as the number of important variables, sample size, MAF, and signal strength, and all of these factors may have an impact on the simulation studies. Lastly, we used the R codes of the COXvarISISscad function in SIS (version 0.6) which was available at the time of the simulation study to combine it with ALASSO and RSF. The results from the aggressive variant of ISISLASSO using the latest SIS library (version 0.86) may produce different results when the sample size is small and the signal strength is weak. Despite the above limitations, these simulations are valuable to modelers.
Another vital point to consider is the computational time for the combination variable selection methods as it depends on the sample size. For instance, running ISISALASSO and ALASSO for one training set with a sample size of 150 took 16 and 3 min, respectively. When n = 300, the processing time increased to 41 and 7 min for ISISALASSO and ALASSO, respectively. Furthermore, running 500,000 variables over 200 bootstrapped samples was computationally intensive, and for the ISISLASSO combination, it took 6 h simply to choose the important variables. Although ISIS is computationally intensive, the combination performed well as indicated in our results.
Based on our simulations, we provide the following guidelines as a tradeoff between predictive accuracy, calibration, and overfitting:

a)
When the sample size is small, and regardless of the strength of the signal among the variables, ALASSO performs well in selecting the important variables and achieving a high cindex.

b)
When the sample size is small and the signal among the covariate is strong, ALASSO again performs well in terms of having high cindex and high calibration slopes.

c)
When the sample size is large and a weak signal among the variables is present, ALASSO and the ISIS combinations had the highest performance and were comparable. The computing time, however, is much faster for ALASSO (six times faster than ISIS combination).

d)
When sample size is large and a strong signal is present, the ISIS combinations are the preferred approaches.
Conclusion
Building and validating prognostic models of clinical outcomes will remain an important research area in 21st genomic medicine. Choosing the appropriate variable selection method for training a model is a critical step in developing a robust prognostic model. Based on the simulation studies, the effective use of ALASSO or a combination of methods, such as ISISLASSO and ISISALASSO, allows both for the development of prognostic models with high predictive accuracy and a low risk of overfitting assuming moderate sample sizes.
Abbreviations
 ALASSO:

Adaptive LASSO
 AR:

Autocorrelation
 GWAS:

Genomewide association study
 IID:

Independent and identically distributed
 ISIS:

Iterative SIS
 LASSO:

Least absolute shrinkage and selection operator
 PSIS:

Principled SIS
 RSF:

Random survival forests
 SIS:

Sure independence screening
 SNP:

Singlenucleotide polymorphism
References
 1.
Cox DR. Regression models and lifetables (with discussion). J R Stat Soc Ser B. 1972;34:187–220.
 2.
Cox DR. Partial likelihood. Biometrika. 1975;62:269–76.
 3.
Tibshirani R. The LASSO method for variable selection in the Cox model. Stat Med. 1997;16:385–95.
 4.
Zhang HH, Lu W. Adaptive LASSO for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. https://doi.org/10.1093/biomet/asm037.
 5.
Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Ann Appl Stat. 2008;2:841–60.
 6.
Cui C, Wang D. High dimensional data regression using LASSO model and neural networks with random weights. Inf Sci. 2016;372:505–17.
 7.
Ishwaran H, Kogalur UB, Chen X, et al. Random survival forests for highdimensional data. Stat Anal Data Min. 2011;4:115–32.
 8.
Huang J, Ma S, Zhang CH. Adaptive LASSO for sparse highdimensional regression models. Stat Sin. 2008;18:0613–8.
 9.
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space (with discussion). J R Stat Soc Ser B. 2008;70:849–911.
 10.
Fan J, Feng Y, Wu Y. Highdimensional variable selection for Cox’s proportional hazards model. In: Berger JO, Cai T, Johnstone I, editors. Borrowing strength: theory powering applications—a Festschrift for Lawrence D. Brown. Beachwood: Institute of Mathematical Statistics; 2010. p. 70–86.
 11.
Zhao SD, Li Y. Principled sure independence screening for Cox models with ultrahighdimensional covariates. J Multivar Anal. 2012;105:397–411. https://doi.org/10.1016/j.jmva.2011.08.002.
 12.
Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw. 2011;39:1–13.
 13.
Breiman L. Random forests. Mach Learn. 2001;45:5–32.
 14.
Ishwaran H, Kogalur UB, Gorodeski EZ, Minn AJ, Lauer MS. Highdimensional variable selection for survival data. J Am Stat Assoc. 2010;105:205–17.
 15.
Zhu LP, Li L, Li R, Zhu LX. Modelfree feature screening for ultrahighdimensional data. J Am Stat Assoc. 2011;106:1464–75.
 16.
Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. J Am Stat Assoc. 2012;107:1129–39.
 17.
Eeles RA, KoteJarai Z, Al Olama AA, et al. Identification of seven new prostate cancer susceptibility loci through a genomewide association study. Nat Genet. 2009;41(10):1116–21. https://doi.org/10.1038/ng.450.
 18.
KoteJarai Z, Olama AA, Giles GG, et al. Seven prostate cancer susceptibility loci identified by a multistage genomewide association study. Nat Genet. 2011;43(8):785–91. https://doi.org/10.1038/ng.882.
 19.
FitzGerald LM, Kwon EM, Conomos MP, et al. Genomewide association study identifies a genetic variant associated with risk for more aggressive prostate cancer. Cancer Epidemiol Biomark Prev. 2011;20(6):1196–203. https://doi.org/10.1158/10559965.
 20.
Li M, Mulkey F, Jiang C, et al. Identification of a genomic region between SLC29A1 and HSP90AB1 associated with risk of bevacizumabinduced hypertension: CALGB 80405 (Alliance). Cancer Clin Res. 2018. In press.
 21.
Palmer LJ, Burton PR, Smith GD. An introduction to genetic epidemiology. Policy Press at the University of Bristol.
 22.
Edwards AWF. Foundations of mathematical genetics. 2nd ed. Cambridge: Cambridge University Press; 2000. ISBN 0521775442
 23.
Halabi S, Singh B. Sample size determination for comparing several survival curves with unequal allocations. Stat Med. 2004;23:1793–815. https://doi.org/10.1002/sim.1771.
 24.
Hothorn T, Leisch F, Zeileis A, et al. The design and analysis of benchmark experiments. J Comput Graph Stat. 2005;14(3):675–99.
 25.
Graf E, Schmoor C, Sauerbrei W, Schumacher M. Assessment and comparison of prognostic classification schemes for survival data. Stat Med. 1999;18:2529–45.
 26.
Harrell FE. Regression modeling strategies with applications to linear models, logistic and ordinal regression, and survival analysis. 2nd ed: Springer International Publishing; 2015.
 27.
Van Houwelingen HC. Validation, calibration, revision and combination of prognostic survival models. Stat Med. 2000;19:3401–15.
 28.
Saldana DF, Feng Y. SIS: an R package for sure independence screening in ultrahigh dimensional statistical models. J Stat Softw. 2018;83:1–25.
 29.
Kelly WK, Halabi S, Carducci M, et al. Randomized, doubleblind, placebocontrolled phase III trial comparing docetaxel and prednisone with or without bevacizumab in men with metastatic castrationresistant prostate cancer: CALGB 90401. J Clin Oncol. 2012;30(13):1534–40. https://doi.org/10.1200/JCO.2011.39.4767.
 30.
Halabi S, Lin CY, Kelly WK, et al. An updated prognostic model for predicting overall survival in firstline chemotherapy metastatic castrationresistant prostate cancer patients. J Clin Oncol. 2014;32(7):671–7. https://doi.org/10.1200/JCO.2013.52.3696.
 31.
Erho N, Crisan A, Vergara IA, et al. Discovery and validation of a prostate cancer genomic classifier that predicts early metastasis following radical prostatectomy. PLoS One. 2013;8:e66855. https://doi.org/10.1371/journal.pone.0066855.
 32.
Karnes RJ, Bergstralh EJ, Davicioni E, et al. Validation of a genomic classifier that predicts metastasis following radical prostatectomy in an at risk patient population. J Urol. 2013;190:2047–53.
 33.
Ross AE, Johnson MH, Yousefi K, et al. Tissuebased genomics augments postprostatectomy risk stratification in a natural history cohort of intermediate and highrisk men. Eur Urol. 2016;69:157–65.
 34.
Glass AG, Leo MC, Haddad Z, et al. Validation of a genomic classifier for predicting postprostatectomy recurrence in a community based health care setting. J Urol. 2016;195:1748–53.
 35.
Cooperberg MR, Davicioni E, Crisan A, et al. Combined value of validated clinical and genomic risk stratification tools for predicting prostate cancer mortality in a highrisk prostatectomy cohort. Eur Urol. 2015;67:326–33.
 36.
Paik S, Tang G, Shak S, et al. Gene expression and benefit of chemotherapy in women with nodenegative, estrogen receptor positive breast cancer. J Clin Oncol. 2006;24:3726–34.
 37.
Paik S, Shak S, Tang G, et al. A multigene assay to predict recurrence of tamoxifentreated, nodenegative breast cancer. N Engl J Med. 2004;351:2817–26.
 38.
Sparano JA, Gray RJ, Della F, et al. Prospective validation of a 21gene expression assay in breast cancer. N Engl J Med. 2015;373(21):2005–14.
 39.
Kim S, Halabi S. High dimensional variable selection with error control. Biomed Res Int. 2016:820945322.
 40.
Steyerberg EW, Moons KGM, van der Windt DA, et al. Prognosis research strategy (PROGRESS) 3: prognostic model research. PLoS Med. 2013;10(2):e1001381.
 41.
Moons KG, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med. 2015;162(1):W1–73.
Acknowledgements
The authors would like to thank Dr. Kevin Kelly and Dr. Howard Mcleod for ensuring access to the clinical trial and the GWAS data. In addition, the authors thank the Associate Editor and three reviewers for their helpful comments.
Funding
This research was supported in part by the National Institutes of Health Grant U01CA157703, United States Army Medical Research W81XWH1510467, and the Prostate Cancer Foundation.
Availability of data and materials
The data were simulated using R code which is available from the corresponding author. The example analyzed used data from CALGB 90401 which is available from the Alliance for Clinical Trials in Oncology.
Author information
Affiliations
Contributions
SH developed the concept of the study, participated in its design, and drafted the manuscript. LP performed the simulations and statistical analyses and helped draft the manuscript. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This analysis was performed under the approval of Duke University IRB.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Pi, L., Halabi, S. Combined performance of screening and variable selection methods in ultrahigh dimensional data in predicting timetoevent outcomes. Diagn Progn Res 2, 21 (2018). https://doi.org/10.1186/s4151201800434
Received:
Accepted:
Published:
Keywords
 Variable selection
 Calibration
 Overfitting
 Machine learning
 Proportional hazards model
 Prognostic models
 Elastic net
 Random forest
 High dimensional data
 Germline singlenucleotide polymorphism
 Survival outcomes