### Data source

We defined two cohorts from a Clinical Practice Research Datalink (CPRD) [18] dataset, which comprised primary care data linked with Hospital Episode Statistics [19] (HES), and mortality data provided by the Office for National Statistics (ONS) [20]. For the first cohort (referred to as historical cohort), the cohort entry date was the latest of attaining age 25 years, attaining 1 year follow-up as a permanently registered patient in CPRD, or 1 Jan 1998. The end of follow-up was the earliest date of patient’s transfer out of the practice or death, last data collection for practice, or 31 Dec 2015. Patients were excluded if they had a CVD event (identified through CPRD, HES or ONS) or statin prescription prior to their cohort entry date (code lists available in additional file 1). The second cohort comprised patients actively registered on 1 Jan 2016 (referred to as contemporary cohort). This cohort of patients represents a contemporary population, for whom a risk prediction model would subsequently be applied to estimate their CVD risks. To be eligible for this second cohort, a patient had to be aged 25–85 years on 1 Jan 2016, and be actively registered in CPRD with 1 year prior follow-up with no history of CVD or statin treatment.

### Overview

We mimicked the process of sampling an overarching target population for the development of a risk prediction model by randomly sampling *N* patients from the historical cohort (containing 1,965,079 and 1,890,582 individuals for female and male cohorts respectively). A risk prediction model was developed on this sample and used to generate risk scores for the contemporary cohort. This process was repeated 1000 times, giving 1000 risk scores for each patient, for each sample size. The sample sizes considered were *N* = 10,000, 50,000, 100,000, *N*_{epv10} (sample size required to meet the 10 events per predictor rule) and *N*_{min} (minimum sample size required to meet criteria outlined by Riley et al. [15]). We chose 10,000 as it is similar to the number of females and males used to develop ASSIGN [21] (6540 and 6757), Framingham [22] (3969 and 4522) and Pooled Cohort Equations [9] (9098 and 11 240). The upper limit of 100,000 was chosen to match the SCORE [23] equations, which were developed on 117,098 and 88,080 females and males respectively. The criteria by Riley et al. [15] ensure that overfitting is minimised on both the relative scale (through the shrinkage factor) and the absolute scale (small difference between apparent and adjusted proportion of variance explained), and that the overall risk is estimated with a sufficient level of precision. Derivation of *N*_{min} = 1434 (female) and 1405 (male) and *N*_{epv10} = 2954 (female) and 2297 (male) is described in additional file 2. There are also sample size formula suggested by van Smeden et al. [17], which focus on minimising the MAPE or root mean squared prediction error (rMSPE) of the resulting model; however, the formula are for logistic models, so they could not be used in this study. Prediction error is closely linked to the variability in risk considered in this work (if risk scores are unbiased and there was little variability, then the MAPE and rMSPE would both be small). It was important to consider prediction error in this work, and the process for doing this is outlined later in the “Methods” section.

### Generation of risk scores

The historical cohort and contemporary cohort were both split into female and male cohorts, and missing data was imputed using one stochastic imputation using the mice package [24]. All variables included in QRISK3 [7], including the Nelson Aalen estimate of the baseline cumulative hazard at the event time and the outcome indicator, were included in the imputation process. The following process was then carried out separately for females and males: 100,000 individuals were chosen at random from the historical cohort to form an internal validation cohort, the remaining individuals formed the development cohort. The development cohort (containing 1,865,079 and 1,790,582 individuals for female and male cohorts respectively) was then viewed as the population.

First, we calculated a 10-year risk for each patient in the contemporary cohort and the validation cohort using a model developed on the entire development cohort, called the population-derived risks. To do this, a Cox model was fit to the development cohort, where the outcome was defined as the time until the first CVD event. Predictor variables included in the model were continuous variables, and categorical variables with > 1% prevalence in all categories calculated from the entire development cohort (age, body mass index, cholesterol/high density lipoprotein ratio, family history of CVD, treated hypertension, smoking status, systolic blood pressure, Townsend deprivation index and type 2 diabetes). These 9 variables resulted in 13 model coefficients. This set of variables reflects the smaller number of variables used in models with lower sample sizes in practice [9, 21, 22]. The risks were calculated by multiplying the cumulative baseline hazard of the model at 10 years follow-up, by the exponent of the linear predictor for each individual, and converting into a survival probability using standard survival analysis relationships. Harrell’s C [25] and the calibration-in-the-large (mean predicted risk − observed/Kaplan Meier risk) of this model were also calculated in the validation cohort. Calibration is reported on the % scale (as an absolute difference in risk), as opposed to probability scale.

Next, for each value of *N*, we sampled *N* patients from this population (the development cohort) without replacement, 1000 times. The following process was repeated within each sample. A Cox model was fit to the sampled data using the techniques described in the previous paragraph. The developed model was used to generate 10-year risk scores for each individual in the contemporary cohort and the validation cohort. Harrell’s C [25] statistic for this model and the calibration-in-the-large were calculated in the validation cohort. The mean absolute prediction error (MAPE_{practical}) was also calculated for each model. This was the average (across patients) difference between the predicted risks and population-derived risks of patients in the validation cohort (difference calculated on the % scale, as opposed to probability). Note that we distinguish MAPE_{practical} from the MAPE used in the work by van Smeden et al. [17]. This is because in the present study, there is no “true” risk from which individual’s risk scores may deviate from and instead the population-derived risk is used. This can be thought of as a practical approximation to the MAPE metric used in the study by van Smeden et al. [17]. A graphical representation of the sampling process is given in Fig. 1.

### Analysis of stability of risk scores

For each sample size, the stability of risks for each patient in the contemporary cohort across the 1000 models was calculated in the following ways. First, the 5–95th percentile range of risks for each patient across the 1000 models was calculated. The distribution of these percentile ranges was then plotted in box plots stratified by the population-derived risk. Next, the 5–95th percentile range of risk for each patient was calculated across the subset of models with the highest C-statistic (top two thirds of models and top third of models). These percentile ranges were again presented in box plots stratified by population-derived risk. This process was repeated, restricting models to those where the calibration-in-the-large deviated from that of the population derived model the least (top two thirds of models and top third of models). This process was repeated again, restricting models to those where MAPE_{practical} was as small as possible (top two thirds of models and top third of models). This allowed us to explore whether only considering models with high discrimination, good calibration-in-the-large or small MAPE_{practical} would reduce the instability in the risk scores of individuals across these models. Finally, we grouped patients from the contemporary cohort into risk groups of width 1% as defined by their population-derived risk. The proportion of the 1000 models that classified a patient on the opposite side of the 10% risk threshold from the population-derived risk was then calculated (10% is threshold for statin eligibility according to the recommended guidelines in the UK [8]). This can be interpreted as the probability that an individual from a given risk group will be assigned a risk score on the opposite side of the treatment threshold, and highlights the impact this variability may have on an individual’s treatment decision in practice. For contrast, we also reported the net benefit [26, 27] of each model at the 10% threshold in the validation cohort, which informs on the impact this variability has on the population level.

Note that instability was assessed in the contemporary cohort as this cohort best represents the people who would have their risk assessed in practice today. Due to a lack of follow-up, model performance could not be assessed in the contemporary cohort. Instead, it was assessed in the same cohort the model was developed on (split sample internal validation), as would be done in practice if a dataset was not available for external validation.