Methods for Evaluation of medical prediction Models, Tests And Biomarkers (MEMTAB) 2018 Symposium

Introduction: Decisions about test availability for patient care are often based on limited evidence. Faecal calprotectin (FC) testing has been approved by NICE for the differential diagnosis of inflammatory bowel disease and irritable bowel syndrome in UK primary care in adults with unexplained abdominal complaints. The decision was based solely on evidence from secondary care. However, transferability of test accuracy estimates between settings cannot be assumed when patient populations differ between settings. We aimed to reassess the evidence against a primary care pathway with FC testing to evaluate what we know about test accuracy of FC testing in primary care. 
 
Methods: We updated the previous test accuracy review [1] of FC testing with colonoscopy as the reference standard. Meta-analyses in R version 3.4.1 explored heterogeneity. 
 
Results: Thirty-eight studies were eligible including five from primary care. The studies’ patient populations, however, resembled a continuum from primary to secondary care. None of the studies sufficiently addressed the research question. Primary care studies either defined the target disease broader than the intended IBD group or did not use the preferred reference standard. The studies were highly heterogeneous in terms of tests and clinical question frequently offering more than one 2x2 diagnostic table for different tests and different clinical questions. Meta-analysing outcomes and investigating setting as a covariate was not feasible as this would have required expressing a preference for a test and clinical question and disregarding others. Separate exploration of test type and clinical question by meta-regression showed that neither can be assumed to be generic. 
 
Discussion: We are lacking evidence to ascertain the assumed test performance of FC testing in primary care. Alternative approaches to simply categorising settings into primary and secondary care are needed to assess studies for their plausibility to reflect the performance of FC testing in primary care.


Background
Personal experience and a growing body of empirical studies (initiated by Gerd Gigerenzer) show that people generally find it hard to understand statistical measures of test accuracy. Sensitivity and specificity are technically useful for comparing assay performance because they are (mathematically at least) independent of study design and disease prevalence. For patients and clinicians, the clinically important measures for decision-making are predictive values and their relation to decision thresholds that depend on the personal values (positive and negative) placed on outcomes. Objectives To help people develop an intuitive understanding of diagnostic accuracy measures and their technical and clinical application.

Methods
We introduce the concepts of technical accuracy (sensitivity and specificity) and clinical accuracy (predictive values) to distinguish between the two main applications of test performance measures, and to be used alongside the concept of clinical utility. We developed two free interactive tools using the RStudio application "Shiny" that allow users to quantitatively and visually explore the effects on technical and clinical accuracy of true and false test results and prevalence. https://micncltools.shinyapps.io/TestAccuracy/ https://micncltools.shinyapps.io/ClinicalAccuracyAndUtility/ Both tools also show the effects of study sample size on uncertainties in test performance measures. The clinical accuracy tool visualises pre-test and post-test probabilities of disease in relation to clinical decision thresholds for positive and negative test results.

Results
Using a point of care test for Clostridium difficile, we demonstrate the effect of prevalence and distinct clinical scenarios on the clinical accuracy and utility of the test in the UK NHS. Conclusions These tools may be useful for developers of clinical tests, authors of test evaluation reports, and clinicians and patients for interpreting and applying test results. Future developments should include tools to help people quantify their utilities for the outcomes resulting from acting/not acting on test results, and determine what their decision thresholds are.
Background: Risk prediction models for early-onset pre-eclampsia (requiring delivery <34 weeks' gestation) may improve maternal and infant health outcomes by identifying women who will benefit from management such as aspirin prophylaxis. Risk models using routinely measured factors are needed in settings where specialised tests are not available. However, few such models have been externally validated. Objective: To assess the performance of the Baschat (2014) [1] risk model that incorporates history of chronic hypertension, diabetes and mean arterial pressure (MAP) to predict early-onset preeclampsia in early pregnancy using the Perinatal Antiplatelet Review of International Studies (PARIS) randomised controlled trial dataset. Methods: A retrospective individual-participant data meta-analysis to validate the Baschat model (reported sensitivity 55%/66% at 10%/ 20% false positive rates (FPRs) respectively, area-under-curve (AUC) 0.83). Trials were eligible if they did not select women based on the presence/absence of high-risk factors; enrolled women <28 weeks' gestation; and reported model predictors and pre-eclampsia. Women assigned to the control arm were included. Model performance was assessed by estimating sensitivity, specificity, positive (PPV) and negative (NPV) predictive value for predicting early-onset preeclampsia at: (i) 0.7% risk threshold to classify low-versus high-risk; and (ii) 10%/20% FPRs as reported in the original publication. The AUC and 95% confidence interval (CI) was calculated. Model calibration was assessed using the Hosmer and Lemeshow goodness-of-fit test and a calibration plot. Results: Three eligible trials included 4510 women. Pre-eclampsia prevalence was 4.9%. For prediction of early-onset pre-eclampsia (n=25, 0.6%), model sensitivity was 28.0% (95% CI 14.3-47.6%), specificity 84.3% (83.2-85.3%), PPV 1.0% (0.5%-2.0%), NPV 99.5% (99.3-99.7%). At 10% and 20% FPRs, sensitivity was 20.0% (8.9-39.1%) and 32.0% (17.2-51.6%) respectively; AUC=0.55 (0.43-0.68), goodness-of-fit p=0.86.

Conclusion:
Model performance for predicting early-onset pre-eclampsia was poor in this validation population. Determining appropriate risk thresholds for assessment of clinical performance will be important for ongoing model development.
Background: Systematic reviews in prognosis can become unmanageable due to large numbers of predictors, many of which are considered by very few individual studies. However, a "rule of thumb" excluding predictors only found in few studies can risk excluding clinically important predictors.
Methods are needed to select those that are worthwhile for review. Objectives: To describe methods used to select biomarkers for inclusion in a systematic review of prognostic factors for severe Crohn's disease. Methods: To manage the potentially large number of candidate predictors, we first subdivided the full review into four separate biomarker areas: (1) serological; (2) clinical; (3) genetic and; (4) combinations of tests/biomarkers. Only biomarkers reported in five or more primary studies were included automatically, with the remainder reviewed by a panel of gastroenterologists to identify those believed to be "promising" despite being reported in few studies. We stipulated a priori that only five such "promising" biomarkers would be included across all reviews. The panel was blinded to how many and which studies had considered each biomarker. Each member ranked their top five across all biomarker areas, with the top scoring biomarkers then being eligible. Results: Overall 169 candidate predictors were identified, 32 were included and 137 were excluded. The panel selected one additional biomarker each for the serological (CRP) and genetic reviews (FOX03A), while three were selected for the clinical review (severe endoscopic lesions, stricturing disease and response to therapy) in addition to those which were automatically eligible. Conclusion: Our approach eliminated a large volume of biomarkers with insufficient evidence to be clinically useful, and which were not considered promising by our panel.
Reference to expert opinion ensured the review did not exclude important or newer biomarkers while simultaneously minimising inclusion of results that have not been well evaluated in the literature. time points, before and after red cell transfusion. MitoPO 2 measurements were performed using a COMET monitor (Photonics Healthcare, Utrecht, The Netherlands) on skin primed during 4 hours with an ALA containing patch (Alacare, Photonamic, Wedel, Germany) for induction of mitochondrial PpIX. Reported values are a mean mitoPO 2 of 5 consecutive measures at each time point. Results: A mitoPO 2 measurement was obtained in all but 1 participant, most likely due to excessive chlorhexidin at the measurement site. All measurements were above the signal-to-noise ratio of 25, irrespective of severity of critical illness assessed via APACHE IV score (range 49-171). The median and interquartile ranges of mitoPO 2 before and after transfusion were 66.9 mmHg (IQR 61.5-77.7 mmHg), and 65.8 mmHg (IQR 57.5-87.2) mmHg, respectively. Median within-subject variability was limited during the first 3 hours after transfusion (3.96 (IQR 2.1-11.4)mmHg), but increased considerably after 24 hours (7.9 (IQR 4.3-13.9) mmHg). Conclusion: It is feasible to measure mitochondrial oxygen tension in critically ill patients. The measurements seem to be most reliable in the first 3 hours after patch removal. Interestingly, mitoPO 2 values in our study population were higher than those previously reported in healthy volunteers.

P6
The area between curves, a non-parametric method to evaluate a biomarker for patient treatment selection Background: Biological markers able to predict the benefit of a given treatment vs. another one are essential in precision medicine. Classically, a predictive marker is detected through testing a marker-by-treatment interaction in a parametric regression model, and most of the other methods rely on modelling the risk of event occurrence under each treatment arm. All these methods make assumptions that may be difficult to check.
Objectives: A simple approach, which does not make any parametric assumption, is proposed to detect and assess the overall predictive ability of a quantitative marker in clinical trials. Methods: This approach is a non-parametric and graphical method that relies on the area between each treatment-armspecific ROC curve (ABC) as an indicator of the predictive ability of the maker. The approach is justified by the relationship between ROC curves and risk curves, the latter being key tools in assessing predictive markers. Results: A simulation study was conducted to assess the ABC estimation method and compare it with two approaches based on risk modelling: the Total Gain approach (TG) and the interaction approach. The simulations showed that the ABC estimate has a low relative bias and that its confidence interval has a good coverage probability. The mean relative bias in the ABC is at least as low as in the TG in almost all combinations of sample size, ABC, and risk. The power of the ABC estimation method was close to that of the interaction coefficient. The method was applied to PETACC-8 trial data on the use of FOLFOX4 vs. FOL-FOX4 + cetuximab in stage III colon adenocarcinoma. It enabled detecting a predictive marker: the DDR2 gene amplification level.

Conclusion:
The ABC is a simple indicator that may be recommended as a first step in the identification and overall assessment of a predictive marker.

P7
Development and internal validation of a prognostic model including quantitative fetal fibronectin to predict preterm delivery in symptomatic women (QUIDS study): an IPD meta-analysis M. M. C. Bruijn 1 , E. Schuit 2,6 , R. D. Riley 3 , J. Norrie 4 , R. K. Morris 5 , J. E. Norman 1 , S. J. S. Stock 1 on behalf of the QUIDS team Background Accurate prediction of preterm delivery remains notoriously challenging. It would enable targeted interventions and reduce unnecessary hospital admissions and transfers. Quantitative fetal fibronectin (qfFN) is a new bedside test to improve diagnosis of preterm labour. Objectives To evaluate the accuracy of qfFN to rule out spontaneous preterm delivery within seven days, and to develop and internally validate a decision support tool for the management of symptomatic women.

Methods
We performed an IPD meta-analysis of 5 European studies of symptomatic women at 22 +0 -34 +6 weeks gestation. We used qfFN and clinical risk factors from a pre-defined set of predictors. We used multivariable logistic regression firstly with all predictors, and secondly with backward stepwise selection (threshold of p-value<0.1) to develop a prognostic model to predict preterm delivery within seven days. Multiple imputation was used for predictor values considered missing at random, and non-linear trends allowed for continuous predictors. Clustering and between-study heterogeneity of outcome incidence was taken into account by a separate intercept term per study. The performance of the model was assessed by overall fit (Nagelkerke R 2 ), discrimination (AUC). Bootstrap re-sampling techniques were used for internal validation and optimism-adjustment using shrinkage.

Results
We included 1783 women, with 139(7.8%) events of preterm delivery within seven days. Table 1 shows the prognostic model before and after variable selection. For the latter, besides qfFN, the model included smoking, ethnicity, nulliparity and multiple pregnancy. After applying a uniform shrinkage factor of 0.92, the model showed an R 2 of 0.39 and an AUC of 0.89 (95% CI 0.87-0.93). Conclusion A prognostic model including qfFN and clinical risk factors showed excellent performance in the prediction of preterm delivery. As part of the QUIDS study, the model (including choice of intercept) will be externally validated using data from a prospective cohort study in 26 UK sites.

Background
The UK Biobank dataset (http://www.ukbiobank.ac.uk/about-biobankuk/) is a resource established by the Wellcome Trust, available to researchers based anywhere. Over 500,000 UK participants contributed extensive health-related data, giving a unique opportunity to investigate predictors of disease. Data were collected from people aged 40-69, initial assessments were from 2006-2010 and follow-up is ongoing. Objectives To use early life factors and clinical data to predict stroke and recurrent stroke. To develop a method to identify participants with stroke and date of stroke. Strokes and dates can be self-reported via touchscreen, nurse-led interview, or taken from hospital records. Selfreported stroke without corroboration is not reliable (REF http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0137538), and hospital data is challenging to use.

Methods
We compared self-reported strokes, interview-reported strokes, and hospital stroke data and tried to ascertain consistency and accuracy. We estimated the proportion of missing data for key variables. Results 7669 people out of 502,619 reported stroke at initial assessment. This was not confirmed in interview for 1068 participants, while 793 people did say they had had a stroke in interview but not via touchscreen. Only 75% of participants had an interview. Reported dates of stroke have inconsistencies. The hospital data uses consultant referral as the unit-of-analysis, so a single stroke may have multiple rows. 6548 participants had from 1 to 24 strokes. Admittance dates, needed to work out if a participant has had two strokes or two consultant referrals are incompletely collected, with 23% missing. 846 of the hospital strokes occurred prior to Biobank recruitment but were not self-reported via touchscreen, of these 656 were also not picked up at interview. Missing data in non-stroke predictors can be extensive. For example, 33% did not report age left full-time education, and 67% are missing cognitive data. Conclusion UK Biobank is a huge resource, but poses challenges for researchers.

P10
Surprising results when selecting predictors for a clinical prediction rule Francesca M. Chappell 1 , Fay Crawford 2 , Margaret Horne 3 , on behalf of PODUS CPR Group

Background
We conducted a systematic review and meta-analysis of individual patient data (IPD) on predictors of diabetic foot ulceration. These predictors can be used to develop a clinical prediction rule for health professionals working directly with patients.

Objectives
To develop a clinical prediction rule Methods Using IPD from nine studies (14897 patients), we chose candidate predictors based on (i) clinical plausibility, (ii) availability, (ii)  Transformation of continuous variable 'quantitative fetal fibronectin' because of non-linearity consistency of definition, and (iv) acceptable heterogeneity. From 22 variables, this left six candidates: age, gender, diabetes duration, monofilament testing, pulses testing, and history of ulceration to be used in a two-step meta-analysis (11522 patients). We used a tenth externally held dataset (1489 patients)not available to the project teamfor validation. Predictors were considered validated if the external dataset's results were consistent with meta-analysis results and they achieved statistical significance.

Results
Three predictors were validated in the external dataset: an inability to feel a 10g monofilament, any absent pedal pulse and ulcer history, all binary. Two non-validated predictors were age and diabetes durationgenerally considered highly plausible predictors of diabetes complications. They are also continuous variables, which have more statistical power than corresponding categorical variables. We therefore compared logistic regression models using the three validated predictors and all six predictors using discrimination (ROC plots and area under the curve) and calibration plots. The models using the three validated predictors were not lower performing than the models using six predictors.

Discussion
The three validated predictors are all foot-specific. The non-validated predictors are all "systemic". It may be that in the prediction of foot ulcer, data on foot health is more informative than data on the whole patient.

Conclusion
Understanding the clinical context and sound statistical methods are important in the selection of predictors. Background Taiwan patient safety reporting system 2016 annual report states that "communication factor" caused by an event 41% belongs to "between medical staff and patients" communication problems. Objectives Make use of the simple, easy-to-understand questions on the Shared Decision Making platform to provide a clear and complete explanation of the medical staff's interpretation, cross-comparison, assessment, patient selection, and patient support make decisions, express their willingness to accept and exercise medical consent.

Methods
This study will collect the diagnostic statements of all diseases related to thyroid cancer with radioactive iodine 131, purpose of the treatment, the methods of implementation, the possible complications, the success rate and the risk of non-treatment, the treatment alternatives and post-treatment precautions, health status, patient preferences, patient values and so on into the database so that the physician can discuss directly with the patient from the platform to display the relative information needed by the patient to check, which may be appropriate to integrate into the patient Questions and consideration of the problem, to help patients make the most appropriate way to check this.

Results
This study is based on the Iodine 131 examination project of Chang Gung Memorial Hospital, Kaohsiung Medical Center and the two major concepts of Evidence-Based Medicine and Shared Decision Making. I-131 Shared Decision Platform architecture is divided into five parts: Patient Search System, Shared Decision Making System, Health Education System, Evidence-Based Medicine System, Data Repository System.

Conclusion
To guide patients and their families in structured steps to make important considerations. After discussions between both doctors and patients to reduce their mutual cognitive deficits, they also have three elements of knowledge, communication and respect. They have reached the philosophy of "Quality, Efficiency and service" so as to obtain the best and feasible treatment, protect the patients' medical interests and enhance the quality of medical care.  Objective To evaluate four approaches used to provide dynamically updates of personalized predictions for a binary outcome based on a repeatedly measured biomarker: likelihood two-stage method (2SMLE), likelihood joint model (JMMLE), Bayesian two-stage method (2SB) and Bayesian joint model (JMB).

Method
We applied the four approaches to predict the development of gestational trophoblastic neoplasia (GTN) based on age and repeated measurements of human Chorionic Gonadotropin (hCG), using data from the Dutch Central Registry for hydatidiform moles at the Radboudumc in Nijmegen. We assessed the predictive power using the area under the ROC curves, and obtained dynamically updated predictions for new patients.

Results
The JMMLE failed to achieve convergence due to incomplete optimization. The remaining three approaches (2SMLE, 2SB and JMB) gave basically the same estimates, but with slightly higher posterior parameter estimates of the binary submodel of JMB. Using all available data, the three models equivalently showed excellent predictive power. The updated subject-specific predictions for new patients were approximately the same.

Conclusion
This study provides comprehensive explanation and R syntax for a toolbox of approaches to obtain updated predictions of a binary outcome based on newly available measurements. To explore the use of a Delphi process in selecting candidate predictors for use in the development of a prognostic model for Atrial Fibrillation (AF).

Methods:
A selection of AF expert healthcare professionals were invited to participate in a Delphi process to select candidate predictors from a group of patient characteristics. This process consisted of completing multiple surveys (rounds) with the aim of gaining consensus amongst the participants for each patient characteristic. Each characteristic was rated independently (using a Likert scale) on how important it is in predicting recurrence of AF. When consensus was reached, the results were analysed and the characteristics were ordered from the most to the least predictive.

Results:
Three rounds of the Delphi survey were completed, with the addition of a consensus meeting which concluded in 217 days. In round 1, 57 of 120 characteristics gained consensus (47.5%). In round 2, 35 of 63 characteristics gained consensus (55.6%) and in round 3, 11 of 28 characteristics gained consensus (39.3%). At the consensus meeting the remaining 17 characteristics (14.2%) were discussed and subsequently gained consensus.

Conclusions:
Undertaking a Delphi process requires a large amount of time in which to complete and requires commitment from each individual within the expert group to adequately find the most predictive patient characteristics for recurrence of AF. Overall the Delphi process works efficiently when combining a group of expert's knowledge to identify candidate predictors to use in developing the prognostic model.
Background: Test accuracy reviews are increasingly published in the literature and their results are used in making clinical and policy decisions. In contrast to clinical trials, there has been little research into the determinants, magnitude, and impact of optimal sample size needed for test accuracy studies. The objective of our study is to assess the proportion of test accuracy systematic reviews that consider sample size when analyzing and interpreting results. Methods: We conducted a methodological systematic survey of test accuracy systematic reviews published in 2016 and 2017. We are reviewing a 1:1 stratified random sampling of 280 Cochrane vs. non-Cochrane systematic reviews. We will calculate the proportion of systematic reviews discussing sample size in the results, discussion and conclusion of included reviews. For each systematic review, we will calculate the preferred sample size required for accurate results using an equation that integrates the values of prevalence, margin of error and values of sensitivity or specificity (1). We will report the proportion of reviews that meet the minimum sample size.

Results:
We are in the process of completing this work and we will have the results ready at the time of the presentation. Conclusion: The findings of this study will inform the test accuracy researchers community and clinicians about the current practice of considering sample size as a factor that may affect the quality of the results in both Cochrane and non-Cochrane reviews. We will also explore the frequency that systematic reviews achieve a preferred minimum sample size to appropriately calculate test accuracy. This will work will inform future initiatives to empirically assess the effect of imprecision in test accuracy reviews. The objective of this research is to evaluate the predictive performance of regression methods to develop clinical risk prediction models using multicenter data, and provide guidelines for practice. To this end, we compared the predictive performance of standard logistic regression, generalized estimating equations, random intercepts logistic regression and fixed effects logistic regression. First, we presented a case study on the diagnosis of ovarian cancer using data from the International Ovarian Tumor Analysis group (IOTA). Subsequently, a simulation study investigated the performance of the different models as a function of the amount of clustering, development sample size, distribution of center-specific intercepts, the presence of a center-predictor interaction and the presence of a dependency between center effects and predictors. During validation, both new patients from centers in the development dataset and from new centers were included. The results showed that sufficiently large sample sizes lead to calibrated predictions under conditional models and miscalibrated predictions under marginal models. Small sample sizes led to overfitting and unreliable predictions. This miscalibration was worse with more heavily clustered data. Calibration of random intercepts logistic regression was better than that of standard logistic regression even when centerspecific intercepts were not normally distributed, a center-predictor interaction was present, center effects and predictors were dependent, or when the model was applied in a new center.
In conclusion, to make reliable predictions in a specific center, we recommend random intercepts logistic regression. Background: Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were established in 2011. Studies of agreement and/ or reliability are in our experience more often than not part of larger diagnostic accuracy studies, clinical trials, or epidemiological studies in which agreement and/or reliability are reported as quality control by using data of the main study. Unfortunately, the planning of such minor studies regularly fails to precede its conduct and/or researchers are unfamiliar with central concepts of agreement and reliability.

P22
Objectives: To propose 5 questions to be addressed in the planning phase from a statistical point of view in order to secure an appropriate analysis plan for an agreement and/or reliability study that actually illuminates what it is supposed to illuminate.

Methods:
We gathered examples from our consultancy experience and derived an overview sheet characterizing agreement and/or reliability studies. Then, we identified 5 central questions to fine-tune the statistical analysis and related these to respective items of GRRAS.   Adoption of a clinical test into NHS practice requires evidence on its accuracy, usability, clinical utility, affordability and cost-effectiveness. This entails evaluating the changes in clinical and economic outcomes to the care pathway (ie journey that patients make through the healthcare system) resulting from potential adoption. Methods for evaluating utility and cost-effectiveness are well established, and new methods are being developed. However, little research has been done to evaluate the processes that provide the data for these evaluations, ie care pathway analysis, modelling, implementation and evaluation.

Objectives
We aim to identify new methodologies for care pathway analysis. In this instance we evaluated, through a case study, the utility of collecting and analysing NHS local guidelines.

Methods
The pathways used to recognise patients with suspected sepsis were compared between 14 Trusts and with the NICE guidelines. Recommended symptoms and thresholds were identified and categorized.

Results
The recommended physiological signs to consider for early identification of patients with suspected sepsis were consistent across sites and with NICE guidelines, but thresholds were different. The number of steps that would lead to identifying patients for review and initiation of the Sepsis 6 bundle was also different across Trusts. This leads to a different number of patients treated for sepsis across UK independently of the true disease prevalence.

Conclusion
The analysis of local guidelines: 1. clarified the physiological signs that influence the clinical decision making during the pathway; 2. supported the development of a high-level map common to the majority of Trusts, first step for care pathway modelling; 3. provided insights about variability in patient care across different UK.
In conjunction with data from the Hospital Episode Statistics, the algorithms described in the guidelines can be powerful tools to calculate the ranges for prevalence and distributions of outcomes associated to specific diseases. both models over-estimating in the higher risk groups (Fig 1).

Conclusion
We developed highly discriminatory models for classifying patients requiring early insulin therapy. Addition of GAD improved the model performance. Further investigation is required to identify the reason for the miscalibrations at the highest probabilities in external validation.  Methods: Sixteen prediction models were validated using a Dutch patient cohort of 1,001 men who underwent extended PLND between October 2008 and May 2017. Patient characteristics included serum prostate specific antigen (PSA), clinical tumor (cT) stage, primary and secondary Gleason scores, number of biopsy cores taken, and number of positive biopsy cores. Model performance was assessed using the area under the curve (AUC) of the receiving operator characteristic (ROC) curve. Calibration plots were used to visualize over-or underestimation of the models. Results: Lymph node involvement was identified in 276 (28%) patients. Patients with LNI had a higher PSA, higher primary Gleason pattern, higher Gleason score, higher number of harvested nodes, higher number of positive biopsy cores, and higher cT stage, compared to patients without LNI. Predictions generated by the 2012 Briganti nomogram (AUC = 0.76) and the MSKCC web-calculator including biopsy core information (AUC = 0.75) were found most accurate. Underestimation of LNI probability was present when looking at patients with a predicted probability below 20%. Conclusion: Models predicting LNI in PCa patients were externally validated in a Dutch patient cohort. The 2012 Briganti and the MSKCC nomograms were the most accurate prediction models available.

P30
Optimizing the risk threshold of lymph node involvement for performing pelvic lymph node dissection in prostate cancer patients: a cost-effectiveness analysis Background: Clinical prediction models support decision making on the performance of pelvic lymph node dissection (PLND) in prostate cancer (PCa) patients. However, international guidelines recommend different risk thresholds to select patients who may benefit from PLND. Objectives: We aimed to quantify the cost-effectiveness of using different risk thresholds for predicted lymph node involvement (LNI) in PCa patients with the Briganti nomogram (2012) to inform decision making on omitting pelvic lymph node dissection (PLND). Methods: Four different thresholds (2%, 5%, 10% and 20%) used in practice for performing PLND were compared using a decision analytic model, using the 20% threshold as reference. Baseline characteristics for the hypothetical cohort were based on an actual Dutch patient cohort containing 925 patients who underwent extended PLND with risks of LNI predicted by the 2012 Briganti Nomogram. Compared outcomes consisted of quality adjusted life years (QALYs) and costs. The best strategy was selected based on the incremental cost effectiveness ratio (ICER) when applying a willingness to pay (WTP) threshold of €20,000 per QALY gained. Probabilistic sensitivity analysis was performed with Monte Carlo simulation to assess the robustness of the results. Results: Costs and health outcomes were lowest (€7,207 and 6.22 QALYs) for the 20% threshold, and highest (€9,670 and 6.27 QALYs) for the 2% threshold, respectively. The ICER for the 2%, 5%, and 10% threshold compared with the first threshold above (i.e. 5%, 10%, and 20%) were €84,974/QALY, €65,306/QALY, and €28,860/QALY, respectively. Applying a WTP threshold of €20.000,the probabilities for the 2%, 5%, 10%, and 20% strategies being cost-effective were 0%, 3%, 28%, and 69%, respectively. Background: Policy making on diagnostics presents huge challenges which organisations like NICE have tried to address. High amongst these are that diagnostics have multiple ways in which they can bring benefit to patients, carers, health services and society. This makes the task of evaluating the impact and collecting evidence on those impacts complex. One suggestion how this complexity can be handled is a clear statement of value proposition, in which developers are specific not just about how and where the new test will be used, but also what they expect will be achieved. In this way attention is directed to aspects of impact which should receive priority in the evidence development process. We intend to describe the degree to which the concept of value proposition has been used in NICE's Diagnostic Guidance. This extends previous work on the use of end-to-end studies.
Objectives: To explore whether value proposition has been clearly described in NICE Diagnostic Guidance and whether evidence has been found which directly demonstrates the aspects of value proposition identified.
Methods: We will extend the approach used in past analysis of the methodological features of NICE guidance. All NICE diagnostics guidance will be interrogated. We will abstract data on the policy question addressed and the underlying value proposition and whether evidence has been identified on these aspects of value proposition. Analysis will be qualitative.
Results: This work is in progress. Conclusion: Value proposition is a potentially useful way for policy makers to help make sense of the many different ways in which a test might represent an effective and cost-effective addition to health care. This project will inform the degree to which the concept is already being used by one prominent policy-maker, but also offer ways in which greater use can be made of it in the future. There is limited data in the application and use of machine learning techniques to predict postpartum depression. Furthermore, there is scarce information on how machine learning techniques can be integrated in the epidemiological framework of identifying persons at risks in clinical psychology.

Objectives
We explore machine learning methods to develop predictive models for postpartum depression and compare them with current methods of predictive modeling.

Methods
Data is obtained from the Pregnancy Anxiety and Depression (PAD) prospective cohort study designed to investigate risk factors for antenatal and postnatal anxiety and depression. We use data retrieved from 6,930 participating women by questionnaires providing information on social support, anxiety and personality traits, as well as information on socio-economic status, lifestyle, and stressful life events during pregnancy. Assessments took place at baseline, 24 and 36 weeks of gestation and 6 months postnatal. We attempt to create classification and regression models using machine learning techniques such as logistic regression, decision trees, and linear discriminant analysis to predict postpartum depression as assessed by the Edinburg Postnatal Depression scale ≥ 10. We then apply crossvalidation and bootstrap techniques to compare the predictive validity, assumptions and interpretability of the methods.

Conclusions
This exploratory study aims to investigate the potential of machine learning methods for the prediction of postpartum depression risk in comparison to established statistical techniques with respect to suitability, applicability and accuracy of the methods. As a further step we aim to compare the predictive assessment, inferential strengths and weaknesses and statistical pitfalls that may appear from the use of such methods. Background Biomarkers are increasingly used to personalise treatment, and biomarker-guided trials are the gold standard for testing their clinical utility. A lack of trials is one of the main obstacles delaying translation of biomarker discoveries into clinic. Before a trial takes place, there must be robust evidence for the biomarker's validity. However, the extent of evidence required, and how it should be compiled, is unclear.

Objectives
We have undertaken a literature review to identify biomarker-guided randomised controlled trials (RCTs) and explore what evidence has been used to justify inclusion of a biomarker, and how the evidence was compiled.

Methods
We conducted a systematic search of four databases. Our search yielded 11399 papers when duplicates were removed. After screening titles and abstracts, 284 papers remained for full text screening. After full-text screening, and restricting to papers published in the past five years, 119 papers were included.

Results
The majority of trials were in the field of oncology (55.5%) with cardiovascular disease being the second most common (17.6%). Many trials justified use of a biomarker based on previous retrospective or pilot studies. Others were based on evidence from literature reviews, case studies, or in vitro/in vivo work. Some trials provided strong evidence for biomarker use, citing meta-analyses and previous RCTs.

Conclusion
To our knowledge, no prior review has systematically identified the methods used for compiling evidence for inclusion of biomarkers in previous biomarker-guided RCTs. We have identified large variations in methods, with several RCTs based on little evidence. We have also quantified how many of each biomarker-guided design have been utilised, as well as the clinical areas in which they have been used.
No standard approach exists for gathering evidence to justifying inclusion of a biomarker in RCTs, and our further work will focus on optimal approaches for doing so. The phrase "care pathway" refers to the journey a patient takes during an episode of healthcare. Mapping the care pathway for a medical condition is a vital step in the evaluation of a new diagnostic test, helping developers identify the optimal role of their test; where it leads to greatest patient and economic benefit. Care pathways can be established through interviews with relevant medical experts, which are transcribed and analyzed thematically in software packages such as NVivo or ATLAS. These packages provide a validated environment for qualitative research, but are expensive and rigid in terms of data manipulation and analysis options, thus, have limited utility.
In a recent project we utilized the R programming language and an R package called 'RQDA 'to thematically analyze interviews designed to elicit expert opinion on C.diff testing in the UK NHS. The advantage of using R is that it is free, powerful and flexible, plus there is a strong community of users continually developing and advancing packages for use in the R environment.

Objective
To outline a potentially novel approach to thematic analysis in R using the RQDA package. Demonstrated with interview data from an exploratory study aimed to understand the potential role of a new point of care test for C.diff, within the UK NHS.

Methods
We interviewed 10 clinicians with expertise in the diagnosis and management of C.diff infection in the UK NHS. These interviews were transcribed verbatim and thematically analyzed in R, using the RQDA package.

Results & Conclusions
This study resulted in the explication of a potentially novel approach to thematic analysis of interviews to inform care pathway analysis for new diagnostic tests. It is our view that this new approach is systematic, scientifically reproducible and widely available, thus, a useful approach to communicate to the wider diagnostic community. Background Multinomial Logistic Regression (MLR) has been advocated for developing clinical prediction models that distinguish between three or more unordered outcomes. Which factors drive the predictive performance of MLR is still unclear.

Objectives
We aim to identify the key factors that influence predictive performance of MLR models. Further, we aim to give guidance on the necessary sample size for multinomial prediction model development and on the usage of penalization during model development.

Methods
We present a full-factorial simulation study to examine the predictive performance of MLR models in relation to the relative size of outcome categories, number of predictors and the number of events per variable. Further, we present a case study in which we illustrate the development and validation of penalized and unpenalized multinomial prediction models for predicting malignancy of ovarian cancer Results It is shown that MLR estimated by maximum likelihood yields overfitted prediction models in small to medium sized data. In most cases, the calibration and overall predictive performance of the multinomial prediction model is improved by using penalized MLR. Events per variable, the number of predictors and the frequencies of the outcome categories affect predictive performance.

Conclusion
As expected, our study demonstrates the need for optimism correction of the predictive performance measures when developing the multinomial logistic prediction model. We recommend the use of penalized MLR when prediction models are developed in small data sets, or in medium sized data sets with a small total sample size (i.e. when the sizes of the outcome categories are balanced). Our simulation study also highlights the importance of events per variable in the multinomial context as well as the total sample size. Background Biomarker-guided treatment is a rapidly developing area of medicine. A biomarker-guided trial is the gold standard approach to testing clinical utility of such an approach, and several biomarker-guided trial designs have been proposed. Due to the complexity of some of the designs they are often difficult to understand in terms of how they should be implemented and analysed. Further, due to the large number of different designs available, it is challenging to decide which is the most appropriate in a particular situation.

Objectives
To develop a user-friendly online tool, informed by a comprehensive literature review, to guide and inform those embarking on biomarker-guided trials in terms of optimal choice, design, practical application and analysis.

Methods
We undertook a comprehensive literature review. All unique biomarker-guided trial designs were identified and their design features, analysis approach and positive and negative qualities described. Importantly, a graphical representation of each trial design was developed, standardised to allow easy comparison of features across designs. Based on our review we developed our online tool, 'BiGTeD' to allow easy and free access to the information gathered.

Results
Our literature review identified 211 papers describing biomarkerguided trials. Information gathered during the review has been incorporated into our newly developed online tool, BiGTeD, a key feature of which is a clear and interactive graphical representation of each trial design to aid interpretation and understanding.

Conclusions
Navigating the literature to gain understanding of which biomarkerguided trial design to choose, and the practical implications of doing so is difficult. Our online tool, BiGTeD (www.bigted.org) is aimed at improving understanding of the various biomarker-guided trial designs and provides valuable and much-needed guidance on their implementation in a user-friendly way. Knowledge on how to design, implement and analyse these trials is essential for testing the effectiveness of a biomarker-guided approach to treatment. We are proposing a novel method for evaluating the diagnostic efficacy of a marker at the baseline level accounting for measurement error.

Methods
We propose a joint modelling approach to link the individual-level deviation of the baseline marker profile from the population mean and the risk of clinical endpoint. At any time t, we define cases as diseased individuals prior to t and controls as individuals survive beyond t. The estimated random effects at baseline are used to define the measurement error adjusted marker. We evaluate the proposed approach in several simulation studies by varying the variance of measurement error and the strength of association between marker and risk of disease, and illustrate in real data.

Results
The proposed measurement error adjusted maker performs better over the observed marker as compared to the true area under the ROC curve (AUC) with low biases and high coverage percentages. Conclusion An observed marker could underestimate the true diagnostic effectiveness due to measuerement error and hence useful markers might be overlooked. The proposed methodology effectively adjust for measurement error when evaluating the diagnostic effectiveness of a marker. The increasing availability of diagnostic tests and biomarkers is accompanied by an increase in health economic evaluations of these tests. However, such evaluations are typically complex and model-based because tests primarily affect health outcomes indirectly and real-world data on health outcomes are often lacking. General frameworks for conducting and reporting health economic evaluations are available but not specific enough to cover the intricacies of diagnostic test evaluation. In addition, certain aspects relevant to the evaluation may be unknown, and therefore unintentionally omitted from the evaluation. This leads to a loss of transparency, replicability, and (consequently) a loss of quality of such evaluations.

Objectives
To address the abovementioned challenges, this study aims to develop a comprehensive reporting checklist.

Methods
This study consisted of three main steps: 1) the development of an initial checklist based on a scoping review; 2) review and critical appraisal of the initial checklist by four independent experts; 3) development of a final checklist. Each item from the checklist is illustrated using an example from previous research.

Results
The scoping review followed by critical review by the four experts resulted in a checklist containing 43 items which ideally should be considered for inclusion in a model-based health economic evaluation. The extent to which these items were included, or discussed, in the studies identified in the scoping review varied substantially, with 13 items not being mentioned in ≥47 (75%) of the included studies.

Conclusion
As the importance of health economic evaluations of diagnostic tests and biomarkers is increasingly recognized, methods to increase their quality are necessary. The checklist developed in this study may contribute to improved transparency and completeness of such modelbased health economic evaluations. Use of this checklist is encouraged to enhance their interpretation, comparability, andindirectly the validity of the results. To investigate the impact of using DiagnOSAS, a screening tool to predict the risk of OSAS in individuals suspected of this condition to guide PSG referral decisions. on health outcomes and costs, and to assess its cost-effectiveness in the Netherlands compared to usual care (no screening tool).

P39
Methods A Markov cohort model was constructed to assess cost-effectiveness of the prediction tool in men aged 50 years. The diagnostic process of OSAS was simulated with and without the use of the DiagnOSAS tool, taking into account the risks and consequences of the most severe OSAS effects: car accidents, myocardial infarction and stroke. Base case cost-effectiveness was based on equal time to OSAS diagnosis with and without the use of the prediction tool. In a scenario analysis cost-effectiveness was assessed assuming that the prediction tool would halve this time to diagnosis.

Results
Base case results show that, within a 10 year time period, DiagnOSAS saves €226/patient at a negligible decrease in health outcomes (<0.01 quality-adjusted life years; (QALYs)), resulting in an incremental cost-effectiveness ratio of €56,997/QALY. In the scenario with time-to-diagnosis halved, DiagnOSAS dominates usual care (i.e. is both cheaper and more effective). For a willingness-to-pay threshold of €20,000/QALY the probability that using DiagnOSAS is costeffective equals 91.7% (base case) and 99.3% (time-to-diagnosis halved), respectively. Conclusion DiagnOSAS appears to be a cost saving alternative for the usual OSAS diagnostic strategy in the Netherlands. When this prediction tool succeeds in decreasing time-to-diagnosis, it could substantially improve health outcomes as well. Background: Existing meta-analyses of depression screening tool accuracy have treated clinician-administered semi-structured diagnostic interviews and lay-administered fully structured diagnostic interviews as equivalent reference standards for assessing major depressive disorder (MDD). Semi-structured interviews are akin to a guided diagnostic conversation. Standardized questions are asked, but interviewers may insert additional queries and use clinical judgment to decide whether symptoms are present. In contrast, fully structured interviews are fully scripted. Standardized questions are read verbatim, without additional probes. Fully structured interviews are considered potentially more reliable but possibly less valid for MDD classification.
Objectives: To compare estimates of diagnostic test accuracy of the Patient Health Questionnaire-9 (PHQ-9) depression screening tool when semi-versus fully structured interviews are used as the reference standard. Methods: Due to selective cutoff reporting within primary studies (Levis et al, AJE, 2017), we used an individual participant data metaanalysis approach to compare accuracy of the PHQ-9 across reference standards, using accuracy results for all cutoffs for all studies rather than only published results. Electronic databases were searched for datasets that compared PHQ-9 scores to MDD diagnosis based on validated interviews. For PHQ-9 cutoffs 5-15, we estimated pooled sensitivity and specificity among studies using semi-and fully structured interviews as the reference standard separately. Results: Data were obtained from 43 of 53 eligible studies, for a total of 14,405 participants (1,763 MDD cases). Specificity estimates were similar across reference standards (within 2%); however, sensitivity estimates were 5-22% higher (median=18%, at standard cutoff of 10) when semi-structured interviews were used as the reference standard compared to fully structured interviews (Table 1).
Conclusion: The PHQ-9 more accurately classifies patients when compared to semi-versus fully structured interviews as the reference standard. Meta-analyses of depression screening tool accuracy should take into consideration potential differences in reference standards.

P41
Are semi-structured and fully structured diagnostic interviews equivalent reference standards for major depression? An individual participant data meta-analysis comparing diagnoses across diagnostic interviews Background: Existing meta-analyses of depression screening tool accuracy have treated clinician-administered semi-structured diagnostic interviews and lay-administered fully structured diagnostic interviews as equivalent reference standards for assessing major depressive disorder (MDD). Semi-structured interviews are akin to a guided diagnostic conversation. Standardized questions are asked, but interviewers may insert additional queries and use clinical judgment to decide whether symptoms are present. In contrast, fully structured interviews are fully scripted. Standardized questions are read verbatim, without additional probes. Fully structured interviews are considered potentially more reliable but possibly less valid for MDD classification. No studies have assessed whether semi-and fully structured interviews differ in the likelihood that MDD will be diagnosed. Objectives: To evaluate the association between interview method and odds of MDD diagnosis, controlling for depressive symptom scores and participant characteristics. Methods: We analysed data collected for an individual participant data meta-analysis of Patient Health Questionnaire-9 (PHQ-9) diagnostic accuracy. Binomial Generalized Linear Mixed Models with a logit link were fit. An interaction between interview method and PHQ-9 scores was assessed. Introduction: Evaluations of the impact of malaria rapid diagnostic tests (RDTs) have shown beneficial effects of RDTs on intermediate process outcomes, such as reduced time to diagnosis and treatment, but limited impact on later stage patient outcomes, such as morbidity and mortality. These unclear benefits could be partly due to shortcomings in study design and factors influencing intervention fidelity (extent to which the test-treatment intervention is delivered as designed). We aim to critically review the designs, outcome measures and intervention fidelity of studies evaluating the impact of malaria RDTs on patient-important outcomes and explore factors that may influence intervention fidelity. Methods: We are conducting a systematic review of quantitative and qualitative studies. We have searched relevant electronic databases and grey literature and included studies based on predefined inclusion criteria. To evaluate the methodological quality of included studies, we are using the revised Cochrane risk of bias tool for randomized studies (ROB 2.0), the revised Cochrane risk of bias tool for non-randomised studies of interventions (ROBINS-I), the checklist to assess implementation (Ch-IMP) for assessing the quality of intervention delivery and an adaptation of the Critical Appraisal Skills Programme (CASP) tool for qualitative studies. Two authors have reviewed the search output and are currently extracting data and assessing methodological quality independently, resolving any disagreements by consensus. We will synthesize information from quantitative studies narratively and through descriptive statistics and use a thematic framework analysis approach for qualitative studies. Results and Discussion: Our electronic searches yielded 2731 hits of which 123 studies (quantitative (n=72) and qualitative (n=51)) have been included for data extraction. We will present the review results, including a graphical classification of key methodological issues affecting malaria RDT impact studies (see logic framework in Fig. 1) with discussion of considerations for selecting and interpreting a particular study design. Introduction: Evaluating the impact of diagnostic tests on patients' health is complex. Due to the multiple steps involved between the decision to administer a test and effect on patient's health, a broad range of outcomes can be measured in studies that evaluate the impact of tests on a patient's health, and various forms of bias can be introduced along this pathway. The revised Cochrane risk of bias tool for randomized studies (RoB 2.0), and that for non-randomised studies of interventions (ROBINS-I), focus on risk of bias (RoB) assessment in general but do not point out issues specific to test-treatment interventions which are a distinct type of complex intervention. We describe our experience in using the Cochrane RoB tools to investigate bias in primary studies evaluating the impact of malaria rapid diagnostic tests (RDTs) on patient-important outcomes. Methods: We searched relevant electronic databases and grey literature and included studies based on predefined inclusion criteria. We included any primary randomized or non-randomized study that compared a malaria RDT with one or more other diagnostic tests for malaria, with an aim of measuring the impact of these tests or strategies on patient-important outcomes. We are currently extracting data and using the ROB 2.0 tool for randomized studies and the ROBINS-I for non-randomised studies of interventions to assess RoB of included test treatment studies. Two authors have reviewed the search output and are currently extracting data and assessing RoB independently, resolving any disagreements by consensus. We will present our assessment of RoB across each domain and overall RoB results for included studies narratively, graphically and by descriptive statistics. Results and Discussion: Our data set contains 27 randomised studies and 22 non-randomised studies. During the conference we will present our RoB results as well as discuss special considerations for investigating the RoB in test-treatment studies. Background: For a clinician, diagnostic test results alone are not informative unless they are able to estimate the prevalence in their setting. Diagnostic Test Accuracy (DTA) reviews are facilitating the interpretation of the pooled test performance using a pretestprobability in a hypothetical cohort. However, it is unknown what methods are used in DTA reviews to select the target condition's pretest probability. Objectives: To assess what methods in Cochrane DTA reviews are used for selecting a pre-test probability to demonstrate a test's performance using summary sensitivity and/or specificity. Methods: DTA reviews were selected from the Cochrane Library on the 2 nd of February, 2018. Reviews were eligible when a pooled or summarized accuracy measure was provided. Data were extracted by one author and checked by a second author. Preliminary results: From 81 DTA reviews 59 reviews were eligible comprising 307 meta-analyses. The following methods for selecting a pre-test probability were observed: using one point estimate from in- Preliminary conclusions: This is an ongoing study and updated results and conclusions will be presented during the conference. No consensus currently exists on what method should be used to select a representative pre-test probability. However, it is probably more informative to use multiple pre-test probabilities from data included for analyses (e.g. a point estimate and a measure of dispersion). Multiple pre-test probabilities could facilitate the test's performance interpretation for clinicians in their own practice. Background Overdiagnosis can be defined as a screen-detected cancer that would have not been detected in the absence of screening. It can be estimated by the "excess-incidence" in the screened arm of a control trial but no consensus exists on how exactly this should be done.

Objectives
To determine the potential biases associated with excess-incidence estimates of overdiagnosis under different scenarios in the screening and post-screening period.

Methods
Cancer was assumed to progress from an undetectable state, to a pre-clinical state and finally to a clini-cal state as first described by Zelen and Feinleib (1969). Screening participants were categorised on the basis on their state before screening started, whether the disease progressed to the clinical state during the study, in the period following screening (relevant cases) or not at all (overdiagnosed). Standard math-ematical manipulations were used to assess the sources bias under four different scenarios 1) excess in screening arm immediately following the end of screening 2) removing the prevalence round cases 3) Complete follow-up 4) complete follow-up but where trial participants access screening after the trial finishes.

Results
Excess incidence in the screening at the end of screening (scenario 1) is biased upwards for overdiag-nosis as it includes prevalent and incident round cases with clinical disease that would have arisen after screening. Scenario 2 is biased as it fails to include all relevant and overdiagnosed cancers. Scenario 3 is unbiased but only in the absence of screening in the period after the end of the trial as per scenario 4.

Conclusion
Estimates of overdiagnosis using the excess-incidence approach are subject to bias. Studies that follow-up both cohorts provide unbiased estimates but only if screening is not accessed in the postscreening period. Background: Shortcomings in study design have been hinted at as one of the possible causes of failures in translation of discovered biomarkers into clinical use, but systematic assessments of biomarker studies are scarce. Objective: We wanted to document study design features of recently reported evaluations of biomarkers in ovarian cancer. Methods: We performed a systematic search in PubMed (MEDLINE) for recent reports of studies evaluating the clinical performance of putative biomarkers in ovarian cancer. We extracted data on design features and study characteristics. Results: Our search resulted in 1,026 studies; 329 (32%) were found eligible after screening, of which we evaluated the first 200. Of these, 93 (47%) were single center studies. The median sample size was of 156 (minimum 13 to maximum 50,078). Few studies reported eligibility criteria (17%), sampling methods (10%) or a sample size justification power calculation (3%). Studies often used disjoint groups of patients, sometimes with extreme phenotypic contrasts; 46 studies included healthy controls (23%), but only 5 (3%) had exclusively included advanced stage cases. Conclusions: Our findings confirm the presence of suboptimal features in recent evaluations of the clinical performance of ovarian cancer biomarkers, and the need for a greater awareness of these issues. Accordingly, this may lead to premature claims about the clinical value of these markers or the risk of discarding other potential biomarkers that are urgently needed. Background: Prediction models for cancer have been developed into tools to aid GP decision-making on referral of symptomatic patients in primary care. This includes mouse-mats, flip-charts, an electronic system for the Risk Assessment Tool (RAT) and an electronic system for Qcancer. Although these tools are available to GPs in the UK, an exploration of their effectiveness, and any validation of the underlying prediction models is lacking. Objectives: To discuss the impact of available evidence on informing decisions on when prediction models are ready for use in practice, using examples from our recent systematic review of the clinical effectives of cancer risk prediction tools to aid decision making in primary care.

P47
Methods: We conducted two systematic reviews to assess: 1) the effectiveness of tools, 2) the validation of the prediction models. The systematic reviews identified evidence on any tool/prediction model that met our inclusion criteria, but here we focus on just the two models (and associated tools) already available to GPs: Qcancer and RATs. Electronic databases were searched, hits double-screened, data extracted and risk of bias of included studies was assessed. Results: 2 studies investigated the effectiveness of the RATs tool, one suggesting an increase in rapid referrals and investigations with the tool, the other suggesting no impact on time to diagnosis compared to no use of the tool. We found no studies investigating the impact of the Qcancer tool. The majority of Qcancer prediction models had been validated externally, by researchers not involved in the development of the models, and showed good performance. We did not find any external validation of the RATs models. Conclusion: Prediction models for cancer diagnosis in primary care are available for GPs to use, but neither has been fully evaluated or validated. We will highlight these gaps and discuss implications for further work and policy-making. Background Interobserver variability studies estimate variation between results in which more than one observer interprets the same data. Agreement in medical imaging interpretation is very important, particularly whether radiologists agree on the presence or absence of disease in an imaging dataset. Generally, interobserver variability study results are shown in a table, using statistical measures such as kappa and percentage agreement. A table format is however very limiting when presenting data from multiple observations made in the same patient especially where data includes disease location.

Objectives
We propose two graphical representations to better encapsulate the results of complex interobserver variability studies, and improve data accessibility Method We performed a preliminary analysis of data from an interobserver variably study of small bowel ultrasound in diagnosing and staging Crohn's disease, performed as part of a larger diagnostic accuracy study (the METRIC trial). A subset of recruited patients underwent two ultrasound examinations performed and interpreted by two different radiologists. Radiologists documented the presence or absence of disease in 10 pre-defined bowel segments. For the reference standard, an expert consensus panel decided the patient disease status based on all clinical data collated during six months patient follow up. We developed novel graphical methods to present the interobserver data.

Results
We will present two different graphical presentations from the analysis we have completed. One shows where observers agreed and disagreed on disease location with the consensus panel results. The other shows agreement and disagreement by disease location separately for disease positive and disease negative locations. We also give examples of how this method could be extended to other similar scenarios.

Conclusion
Graphical representation of interobserver variability could improve understanding of the results and may provide more informative results than current summary statistics alone. Background: MODY is a rare, young-onset, genetic form of diabetes. Diagnosing MODY is important to ensure appropriate treatment, but identifying MODY patients is challenging. Diagnostic testing is expensive, prohibiting universal testing. We developed the MODY probability calculator (https://www.diabetesgenes.org/ mody-probability-calculator/, >34000 visitors to date), a validated model that calculates probability of MODY based on clinical features, to help clinicians prioritise which patients to refer for diagnostic testing. Objectives/Methods: To assess the use of the MODY calculator in the real world setting: 1) its performance in a population cohort of patients diagnosed <30y (n=1407), 2) its utility in clinical referrals sent to the Exeter molecular genetics diagnostic laboratory for MODY testing (n=1285) between 1/8/14 and 31/12/17. Results: 1) In the population cohort, 51/1407 (3.6%) were diagnosed with MODY; 1293 (45 MODY) had sufficient data to calculate their MODY probability. The model performed well (ROC AUC=0.9) and showed good calibration (Hosmer-Lemeshow p=0.24). 39/397 (10%) individuals with probabilities >3.6% had MODY (87% sensitivity, 69% specificity for this cutoff). 14/21 (67%) individuals with >75% probability had MODY (31% sensitivity, 99% specificity). 2) In the diagnostic laboratory, 621/1285 (48%) referrals reported use of the calculator. Referrals that stated they had used the calculator had a higher pick-up rate of MODY than those that did not (33% v 25%, p=0.002). MODY probability could be calculated on 425/664 referrals that did not use the calculator. The mean probability was lower compared with referrals that had used the calculator (16.5% v 42.6%, p<0.001).
Conclusion: The MODY model appears to work well in a population setting, although analysis was limited by small numbers of MODY patients. The MODY model is frequently being used prior to sending referrals for MODY testing. Referrals that use the calculator appear to be more appropriate, with higher probabilities and a higher pick up rate of MODY. Background Biological variability (BV) studies aim to measure variability in a biomarker between and within individuals. Knowledge of BV allows the potential for a biomarker to diagnose and monitor disease to be assessed. Sample sizes for BV studies involve stating numbers of participants (n 1 ), observations per participant (n 2 ) and repeat assessments of each observation (n 3 ). Little guidance exists to compute these values.

Results
Increasing participants decreases the range of estimates for σ A , σ I and σ G ; increasing observations decreases the range of σ A and σ I , however the range of estimates of σ G appeared constant. Increasing assessments decreases the range of σ A with the range of σ I and σ G unchanged. Increasing participants and observations decreases the range of estimates of II and RCV. II was overestimated with few participants. Increases in assessments made little change in the range of estimates of II and RCV. We have produced a shiny app which allows precision of estimates to be estimated for given parameter values: https://alicesitch.shinyapps.io/bvs_simulation/. Conclusion Sample size decisions for BV studies can use a precision based approach. Changing numbers of participants, assessments and observations impacts on the precision of different estimates. Increasing the number of participants increases precision for all estimates. Simulation of the range of results obtained for a given sample size can guide planning of studies. Background: Systematic and/or random errors in test measurement (collectively 'measurement uncertainty') can result from various factors along the testing pathway, from the time of day a test sample is taken to the specific platform used for sample analysis. The consequence of this uncertainty is that any observed test value may differ to the underlying 'true' target value. Crucially, although this uncertainty can significantly affect clinical accuracy and utility, it is rarely considered in test outcome/impact studies.

Objectives:
To identify current methodology utilized in studies assessing the impact of measurement uncertainty on test outcomes (including clinical accuracy, clinical utility and cost-effectiveness).

Methods:
A literature reviewusing MEDLINE, Embase, Web of Science and Biosiswas used to identify relevant studies published in the last 10 years. Subsequent citation tracking was conducted to identify additional material (published any date). Ongoing data extraction is focused on identifying study aims, methods (in particular the components of measurement uncertainty addressed, data sources, input values and distributional assumptions) and the impact of measurement uncertainty on baseline results.

Results:
Based on interim findings, 45 studies conducted across a range of settings and indications have been identified. The majority utilize simulation techniques to explore the impact of measurement bias (systematic error) and/or imprecision (random error) on clinical accuracy or utility. Typically these draw on an 'error model' in which bias is assumed fixed and imprecision normally distributed, e.g.: [where CV = coefficient of variation and N(0,1) = a random draw from a normal distribution (mean 0, standard deviation 1)]. Both bias and imprecision have been reported to have a significant impact on test outcomes within these studies. Conclusions: Analysis of the final results will enable identification of key methodological considerations for future applications and research in this field. Clinical guidelines recommend cardiovascular disease (CVD) risk assessment to identify patients who will benefit from lifestyle advice +/-drug therapy. As current risk tools are not perfect, high-sensitivity cardiac troponin (hs-cTn), an independent predictor of CVD risk, has been suggested to improve risk classification for individuals at average risk.

Objectives:
To assess whether the clinical performance of risk assessment tools including hs-cTn supports further investigation for clinical use.

Methods:
We searched MEDLINE to identify studies comparing the performance of validated CVD risk assessment tools in the adult general population when adding hs-cTn. We extracted data on troponin, risk tools, risk categories, risk of bias, and performance measures: discrimination, calibration and reclassification. We summarised the proportion more correctly up (TP) or downgraded (TN), falsely up (FP) or downgraded (FN) with the addition of hs-cTn. We used the treatment threshold of 10% 10-year CVD risk to dichotomise low versus high risk. We calculated the number needed to screen (NNS) to avoid one additional cardiovascular event. We defined the minimum acceptable troponin model performance to support further investigation as a higher TP rate and/or a higher TN rate. We considered the potential benefits of an additional TP higher than an additional TN. In case of a trade-off we considered < 10 FP:1 TP acceptable.

Results:
Two studies reported adequate data for our analysis. Both reported a net improvement in TP and TN rate and modest reduction in NNS (Table 1). Neither reported troponin model performance in an external validation population.

Conclusion:
Two studies provide consistent evidence that including troponin in two different CVD risk assessment tools improves discrimination of patients who will/will not develop CVD to guide management at the risk threshold of 10% 10-year risk. These findings warrant further investigation, including external validation, for assessment of clinical benefits, harm and cost-effectiveness.

P57
Using clinical registry data to develop outcome prediction models in spine surgery L. Background: Spine surgeons need to be able to make evidence-based predictions on the outcome of surgery. The risks (e.g. complications) and benefits (e.g. pain alleviation) of treatment modalities have to be adequately communicated to their patients. There is a lack of validated prognostic tools to support spine surgeon and patient decisions in daily practice. Evidence on predictors for surgical outcomes is available; however, to date no studies have developed comprehensive clinical prediction models for spine surgery.

Objectives:
To use data from a spine unit collected within a large clinical spine registry to develop outcome prediction models for patients undergoing surgery after lumber disc herniation.

Methods:
We built lasso regression models to identify relevant predictors and estimate the parameters for 12-month-outcome prediction of a quality of life (QoL) score, back and leg pain scores, surgical complications, and patient satisfaction. A freely available online prognostic tool was developed to present the predicted outcomes for individual patients, based on their pre-operative characteristics.

Results:
Data from 1127 patients (mean age 49yrs, 42% female) was used for model development. Number of previous spine surgeries, insurance class (private vs general), body-mass index, and preoperative leg pain were the strongest outcome predictors in most models. The R 2 of the models ranged from 0.16 to 0.21. A preliminary online tool was programmed for QoL and pain scores (Fig. 1).

Conclusion:
Clinical use of the tool requires further validation. Temporal validation of the models in the same spine unit is underway. Prospective collection of additional factors is planned to improve prediction precision. The main challenges include how and when to update models in ongoing data collection; generalisability of models to other clinics; and the limitation of this observational single-arm cohort study, which does not allow treatment comparisons.

P58
Latent class meta-analysis in a Cochrane review for improving accuracy estimates in the absence of a perfect reference standard Karen R. The slope of a plot of observed against estimated outcomes is a useful validation statistic for prediction models, for example when used as part of a four-part ABCD approach to validation. The slope is often referred to as a calibration slope. While some authors have used "calibration" to mean overall calibration, others have reserved the term "calibration in the large" for overall calibration and used calibration to mean the accuracy of a prediction rule at a more detailed level.

Objectives
To review current use and interpretation of the calibration slope, and compare it to behaviour of the calibration slope in practice. Methods 1. We searched for papers published in 2016 and 2017 using the calibration slope, and analysed the text to determine whether authors interpreted it as a measure of calibration, discrimination, or not explicitly as either. 2. We studied the behaviour of the calibration slope first in artificial, examples, and secondly in a re-analysis of a previously published paper.

Results
In 40 papers using calibration slope, 30 (75%) interpreted it explicitly as a measure of calibration, 1 interpreted it explicitly as a measure of discrimination, and 9 as neither. Proof-of-concept examples show that the calibration slope can remain constant as calibration varies. In a real example the calibration slope correlates with the c-statistic (r=0.95; p<0.001) but not calibration-in-the-large (r=0.016, p>0.9).

Conclusions
Although the calibration slope is useful when used in combination with the intercept, it does not in itself quantify calibration. Many authors inadvertently fail to quantify calibration, by depending on the calibration slope alone. To prevent misunderstanding, and to promote the use of better strategies for prediction model validation, the term "calibration slope" should be retired in favour of a less misleading alternative.

P60
Sharing and use of biomarker discovery data and computations for research and education; Using R to improve reproducibility of data analyses Marc A. T. Teunis 1 , Jan Willem Lankhaar 2,3 , Eric Schoen 4 , Shirley Kartaram 1 , Raymond Pieters 1,5

Introduction
Working with big(ger) datasets has become an essential part of Life Sciences research. Due to the existence of a large amount of open data in this field it has become pivotal to use workflows that support the data analysis process in a reproducible way. Here we demonstrate such a workflow. We conclude that the combination of using literate programming, a self-written R-package to contain all data and analyses and the use of Git/Github.com greatly enhances reproducibility and the ability to share and publish the project work.

Methods
For our analytics workflow we used the Statistical Programming Language R [1]. In this research project, young adult men were requested to cycle four different training protocols on a bikeergometer. Rest conditions were used as a control. Before, during and after exercise blood and saliva were collected from the volunteers. Before and after cycling, the volunteers also donated a urine sample. Biological samples were analyzed for a range of biomarkers including hormones, cytokines and blood cells. Furthermore, metabolome and the transcriptome analysis were performed. The main research question addressed in this study was whether we can use a subset of these biomarkers to classify the amount of exercise that was delivered. To answer this question we analyzed the data with multi-level statistical models and supervised machine learning.

Results
Due to the data intensive nature of the project and the fact that many laboratories were involved, R was used in all phases of the analytics cycle. We implemented the 7 Guerilla Analytics principles posed by Edna Ridge [2]. These principles help maintaining a link between the original data and the data in a curated and combined dataset and ensure reproducibility of analysis and visualizations.

Conclusions
Here we demonstrate that these principles can be implemented in a R-package, thereby contributing to the reproducibility of this research. We demonstrate our machine-learning experiments to illustrate the package. Articles that met the eligibility criteria were included in the review. A PRISMA chart was used to depict the number of articles searched and included in the review. Data Analysis: Data will be collected from eligible articles using the data collection form developed by the authors. Data extracted from the articles will include: the method applied or proposed, important assumptions, case-studies used, and the strengths and weaknesses of each method. The information obtained will be synthesised qualitatively.

Conclusion:
The review will describe novel methods using case studies, articulate the strengths and weaknesses of the methods and develop recommendations for their use. Background Heart failure (HF) is a major public health problem with rising prevalence, especially in elderly. Survival rates for advanced HF patients are worse than those for breast or prostate cancer. Two decades of biomarker research highlighted the prognostic ability of certain markers, and informed the development of new or updated prognostic models.
Despite numerous published models and NICE's recognition of the need for prognosis information, no risk stratification models have been adequately established, nor has the quality of the models and the evidence they present being systematically brought together and tested.

Objectives
We hypothesise that HF-related biomarkers may offer an added value to the traditional prognostic factors for HF clinical outcomes, independent of other present co-morbidities. We aim to test this hypothesis through a systematic reviews series assessing the evidence of HF prognostic models using novel meta-analysis (MA) methodology and relevant critical appraisal tools.

Methods
We follow Cochrane methodology. Published search filters were combined for a sensitive literature search. Prognostic models including at least one HF-related biomarker were eligible. Independent pairs of co-authors carried out screening and data extraction. Based on the CHARMS and PROBAST checklists we considered model development studies with and without external validation in independent data, and model updating studies. MA will be carried out using recently published novel methodology.

Results
Searches yielded over 40,000 titles, highlighting the need for tighter, updated prognostic search filters. A pilot screening of 10% of these (ie 4000) returned only a 2% for full text screening, with an ultimate estimate of 150 included models for evaluation.

Conclusions
This is a complex time constrained project with potential to advise on future HF prognostic model design; contribute to improved HF clinical management; apply recently developed MA methodology for combining prognostic model data, and inform the project for developing Cochrane methodology standards of prognostic model reviews.

Acknowledgements
Project funded by the British Heart Foundation (grant no. PG/17/49/33099) Appendix Planned systematic reviews SR1: Characteristics and methodological quality of prognostic models in HF SR2: Characteristics and methodological quality of studies exploring the added prognostic value of the biomarkers SR3: Model validation quality and prediction accuracy of prognostic models in HF SR4: Meta-analysis of the performance of prognostic models externally validated SR5: Impact assessments of prognostic models in HF

P63
The incremental value of biomarkers to the Revised Cardiac Risk Index to predict major cardiac events and overall mortality after noncardiac surgery: a systematic review and meta-analysis Lisette M. Background Diagnosis in primary care can be challenging; many early symptoms of cancer are non-specific and low risk. Clinicians may use 'routine' blood tests in such patients for reassurance, assuming negative tests represent absence of disease. Diagnosis is a two-step process; the first Bayesian step is the clinicians' decision to perform a test; the second is the test result itself.

Objectives
To determine incidence of cancer in primary care populations with inflammatory marker (C-reactive protein, ESR and plasma viscosity), or platelet tests using primary care records.

Methods
Two independent prospective cohort studies of 40,000 and 200,000 UK Primary Care patients using Clinical Practice Research Datalink (CPRD). The primary outcome for both studies was 1year cancer incidence.

Results
For context, NICE recommends urgent cancer investigations or referral for patients with cancer risk of 3% or above. For inflammatory markers those with positive tests had a 1-year cancer incidence (PPV) of 2.80%, test negatives 1.28% and untested 0.84%, the last of these being marginally below expected figures from National Cancer Registry and Analysis Service (NCRAS). For platelets those with positive tests had 1-year cancer incidence of 7.84%, test negatives 2.82%, and population baseline from NCRAS 1.41%. For both tests a significant gender difference was demonstrated; men with normal inflammatory markers have 1.75% 1-year cancer incidence, compared to 0.98% for women; men with normal platelet count have 4.1% 1-year cancer incidence, compared to 2.2% in women.

Conclusions
These results demonstrate a clear Bayesian phenomenon in selection of patients for simple testing in primary care. The selection process identifies a group at significantly higher risk, with this additional risk not wholly eliminated by a negative result. This phenomenon demonstrates the need for clinical vigilance with negative test results. We anticipate a similar phenomenon occurs with other test results, and may occur in secondary care.
Background: Quality assessment of included studies is a crucial step in any systematic review (SR). Review and synthesis of prediction modelling studies is an evolving area and a tool facilitating quality assessment for prognostic and diagnostic prediction modelling studies is needed. Objectives: To introduce PROBAST, a tool for assessing the risk of bias and applicability of prediction modelling studies in a SR. Methods: A Delphi process, involving 40 experts in the field of prediction research, was used until agreement on the content of the final tool was reached. Existing initiatives in the field of prediction research such as the REMARK and TRIPOD reporting guidelines formed part of the evidence base for the tool development. The scope of PROBAST was determined with consideration of existing tools, such as QUIPS and QUADAS-2. Results: After six rounds of the Delphi procedure, a final tool was developed which utilises a domain-based structure supported by signalling questions similar to QUADAS-2. PROBAST assesses the risk of bias and applicability of prediction modelling studies. Risk of bias refers to any shortcomings in the study design, conduct or analysis leading to systematically distorted estimates of predictive performance or an inadequate model to address the research question. The predictive performance is typically evaluated using calibration, discrimination and sometimes classification measures. Assessment of applicability examines whether the prediction model development or validation study matches the systematic review question in terms of the target population, predictors, or outcomes of interest Background: Recent years have witnessed the development of Health Technology Assessment (HTA) methods for use by developers and public bodies to assess potential cost-effectiveness at the early stages of device development.
Objectives: 1) To provide an overview of current methods used; and 2) To identify issues and needs for future key methodological development in early health technology assessment.
Methods: Rapid review methods will be used to identify published methods papers and literature reviews related to early HTA by searching relevant electronic databases including MEDLINE, EMBASE, The National Health Services Economic evaluation database (NHS EED), the Cochrane library, and Econlit. Contacts will be made with research groups who have published early HTA work in both the UK and the Netherlands to identify relevant unpublished papers. Inclusion criteria will be research and review papers that report early HTA methods, as well as commentaries describing or discussing early HTA methods, published in English. The overview will extract data from papers to answer the below questions:

Background
Prediction models for people with type 2 diabetes are often static models, using only a person's risk profile at a single time point. As the cumulative amount of HbA1c ('glycemic burden') was found to be associated with various diabetes outcomes, including information on dynamics of HbA1c over time may increase the accuracy of predicting nephropathy over using a single HbA1c measurement only.

Objectives
To compare a 'static' prediction model based on Cox regression analysis to a joint modelling approach for the prediction of diabetic nephropathy, using a single or repeated HbA1c measurements respectively.

Methods
This study included 7616 people with type 2 diabetes from the Hoorn Diabetes Care System cohort, who were followed annually from 1998 onwards. Nephropathy was defined as macroalbuminuria.
For the Cox regression only the baseline HbA1c value was taken into account. For the joint model, repeated measurements of HbA1c were used. In both models, baseline variables sex, age, diabetes duration, systolic blood pressure, BMI, triglycerides and total cholesterol were included as other predictors. All variables were standardized before joint model analysis. Results were expressed in terms of hazard ratios and Harrell's C statistic for discrimination.

Results
In total, 394 (5.3%) people developed nephropathy during a mean follow-up of 6.3 (±4.9) years. In both models, sex, age, systolic blood pressure, BMI, triglycerides and HbA1c were independent predictors of diabetic nephropathy. Specifically, the hazard ratio for HbA1c was equal to 1.14 (95% CI 1. Planning a clinical trial is always tainted with uncertainty. On the one hand for the sample size calculation assumptions are necessary, which prove to be false during the ongoing trial. On the other hand it can be of interest to modify design aspects during the study (whereas these modifications have to be pre-specified in the study protocol). While for treatment studies there are plenty of methods for such adaptive study designs, for diagnostic accuracy studies there are almost no methods for adaptive designs. Accordingly, no diagnostic accuracy trials with an adaptive design could be found in the literature. Since diagnostic trials fail very often because of wrong assumptions, it is highly necessary to develop methods for adaptive designs in the field of diagnostic trials. An example is the recently published diagnostic trial from Waugh et al. [1]. In the talk I will present different settings where adaptations would be helpful in diagnostic trials and I will distinguish between blinded and unblinded sample size re-estimation. Furthermore, drawing from the literature on adaptive designs in the field of treatment studies, I will show where existing methods can be transferred to diagnostic trials [2,3]. Regarding the remaining blind spots, I will present existing and new methods specific for diagnostic trials [4].

P70
Confronting uncertainties in prognosis: Statistics should support an approach to clinical decision-making based on "Prepare for the worst. Hope for the best. Bet according to the odds" M Power 1 , BC Lendrem 2 , AJ Allen 2 , T Fanshawe 3 , J To develop an evidence-based approach to improving the utility of prognostic information.

Methods
We synthesized key literature on presenting and using prognosis information, and developed a prototype for visualizing prognostic statistics and confronting the inherent uncertainties, using ovarian cancer survival data from SEER as an example.

Results
Quantifying prognosis Can be expressed as:

Objectives
To compare methods to account for treatment use during follow-up when developing a prognostic model.

Methods
A prognostic (Cox) model to predict 5-year mortality risk without using selective beta-blockers was developed using the electronic health record data of 1905 patients (585 deaths). We compared 5 methods to account for selective-beta blocker use during follow-up: (i) excluding treated individuals, (ii) censoring treated individuals, (iii) inverse probability of censoring weighting after censoring treated individuals, (iv) including treatment as a binary covariate in the model and (v) including treatment use as a time-varying covariate in the model. The comparisons were repeated in a highly-treated patient subset and with a simplified prognostic model. Results 324 (17%) patients began using selective beta-blockers during follow-up. The coefficients of the prediction models varied according to each modelling method. Excluding treated individuals resulted in a model that provided, on average, slightly higher predictions compared to a model that ignored treatment. However, these differences did not translate to substantial differences in predictive performance (c-statistic, calibration slope, Brier score) in any of the analyses.

Conclusion
Treatment hardly affected predictive performance in our case study. Despite theoretical advantages of certain methods to account for treatment use, in practice the actual benefit of applying these methods may be small. Further case studies and simulations are needed to investigate when it is necessary to take into account the effect of treatment when developing a prognostic model.

Objectives
To systematically review the evidence and assess the relative performance of clinical prediction models in the evaluation of patients with early rheumatoid arthritis. Methods A systematic review of studies describing the development, external validation and impact of eligible clinical prediction models was conducted in accordance with PRISMA guidelines and current best practice for undertaking prognostic reviews. Data on predictive performance were described in a narrative synthesis, presented separately for internal and external validation studies. Evidence synthesis using meta-analysis was considered for external validation studies.
Results Twenty-two model development studies and one combined development/external validation study reporting 39 clinical prediction models for three relevant outcomes were included. Five external validation studies evaluating eight models for radiographic joint damage were also included. C statistics for radiographic progression outcomes (different definitions) ranged between 0.63 and 0.87 (n=8) and between 0.78 and 0.82 for Health Assessment Questionnaire (HAQ) outcomes (n=2). For models that had been externally validated, predictive performance varied considerably, suggesting unexplained heterogeneity in the populations in which the models are being tested. Three models (ASPIRE-CRP, ASPIRE-ESR, BeSt) were validated using the same outcome definition in two external populations. The random effects meta-analysis suggested the most favourable performance across external validations was for BeSt (C statistic 0.72, 95% CI: 0.20, 0.96). However, for all models, there is substantial uncertainty in the expected predictive performance in a new sample of patients, indicating that we cannot be confident that the performance of the models is better than would be expected by chance.

Conclusion
Meta-analysis was limited by the small number of external validation studies and the results do not provide a definitive conclusion about performance of the models in future studies. Reasons for the heterogeneity in performance could not be explored. Uncertainty remains over the optimal prediction model(s) for use in clinical practice. Some fifty years ago, Schwartz and Lelouch discussed how therapeutic trials can try to resolve two very different problems. The first set of so-called explanatory trials aims at understanding a treatment, seeking to discover whether it causes benefit and/or if difference exists between two treatments. The second set of pragmatic trials aims at decision-making: these trials try to answer the question which treatment is preferable under usual clinical circumstances. The difference affects the definition of the treatments, the choice of study participants, and the way in which the treatments are compared. The PRagmatic Explanatory Continuum Indicator Summary (PRECIS-2) was later developed to further clarify this distinction. Diagnostic accuracy studies evaluate the performance of a test in correctly identifying those with and without the correct target condition, by comparing index test results with those of the clinical reference standard. We argue that, like therapeutic trials, diagnostic accuracy studies can also try to answer two different questions. One set, explanatory accuracy studie, aims at understanding how different conditions affect the distribution of test results. A second set, aimed at decision-making, evaluates the consequences of relying on the test's results for clinical management: pragmatic accuracy studies. Confusingly, both types of trials often present their findings in terms of sensitivity and specificity or the area under the ROC curve. The difference between pragmatic and explanatory diagnostic accuracy studies cannot be simplified as a matter of design-related bias and applicability. It has implications for the definition of the index test, the eligibility criteria and recruitment of study participants, the choice of the study outcomes, the analysis of results, and the interpretation. We present the building blocks for PRECIS-DTA, at tool that can be used in the design, analysis, interpretation, reporting and communication of diagnostic accuracy studies.

Oral Presentations
Objectives: To assess the relation between study characteristics and the results of external validation studies of prognostic models. Methods: We searched electronic databases for systematic reviews of prognostic models. Reviews from non-overlapping clinical fields were selected if they reported common performance measures (concordance (c)-statistic or ratio of observed over expected number of events (OE ratio)) from ten or more validations of the same model. From the included validation studies we extracted study design features, population characteristics, methods of predictor and outcome assessment, and the aforementioned performance measures. Random effects meta-regression was used to quantify the association between study characteristics and model performance.
Results: We included ten reviews, describing a total of 224 validations. Associations between study characteristics and model performance were heterogeneous across reviews. C-statistics were most associated with population characteristics and measurement of predictors and outcomes, e.g. validation in a continent different from the development study resulted in a higher c-statistic, compared to validation in the same continent (difference in logit c-statistic 0.10 [95% CI 0.04, 0.16]), and validations with eligibility criteria comparable to the development study were associated with higher cstatistics compared to narrower criteria (difference in logit c-statistic 0.21 [95% CI 0.07, 0.35]). Using a case-control design was associated with higher OE ratios, compared to using cohort data (difference in log OE ratio 0.97 [95% CI 0.38, 1.55]). Conclusion: Variation in performance of prognostic models appears mainly associated with variation in case-mix, study design, and predictor and outcome measurement methods. Researchers validating prognostic models should carefully take these study characteristics into account when interpreting the achieved performance of prognostic models. Background: It is widely recommended that any developed -diagnostic or prognostic -prediction model is externally validated in terms of its predictive performance measured by calibration and discrimination. When multiple validations have been performed, a systematic review followed by a formal meta-analysis helps to summarize overall performance across multiple settings, and reveals under which circumstances the model performs suboptimal (alternative poorer) and may need adjustment. Objectives: To discuss how to undertake meta-analysis of the performance of prediction models with either a binary or a time-toevent outcome. Methods: We address how to deal with incomplete availability of study-specific results (performance estimates and their precision), and how to produce summary estimates of the c-statistic, the observed:expected ratio and the calibration slope. Furthermore, we discuss the implementation of frequentist and Bayesian meta-analysis methods, and propose novel empirically based prior distributions to improve estimation of between-study heterogeneity in small samples. Finally, we illustrate all methods using two examples: metaanalysis of the predictive performance of EuroSCORE II and of the Framingham Risk Score. All examples and meta-analysis models have been implemented in our newly developed R package metamisc.
Results: Information on model discrimination and calibration was often incomplete, but could be restored for most studies. Although the proposed meta-analysis models yielded similar summary estimates, the Bayesian approach allows for more accurate estimation of between-study heterogeneity when few studies are included in the meta-analysis. Conclusion: Meta-analysis of prediction models is a feasible strategy despite the complex nature of corresponding studies. As developed prediction models are being validated increasingly often, and as the reporting quality is steadily improving, we anticipate that evidence synthesis of prediction model studies will become more commonplace in the near future. The R package metamisc is designed to facilitate this endeavor, and will be updated as new methods become available.

O4
Are Background: Cochrane has been publishing Diagnostic Test Accuracy (DTA) reviews for 10 years, and close to publishing their 100 th DTA review. The methods and reporting of Cochrane DTA reviews were designed to ensure they address patient management questions by providing evidence summaries suitable for incorporation in clinical guidelines.
Objectives: To assess the extent to which Cochrane DTA reviews have been incorporated in clinical guidelines; identify which guideline developers and topics are most likely to make use of Cochrane DTA evidence; note key features of reviews most cited.

Background
In April 2018 WHO held the first meeting of the Strategic Advisory Group of Experts in In Vitro Diagnostics (SAGE IVD) to define the methods that will be used to create a Model List of Essential in vitro Diagnostics (EDL). The EDL intends to provide evidence-based guidance, and set a reference for the development of national lists of essential IVDs. The initial EDL meeting looked at existing WHO guideline recommendations for tests for TB, HIV, hepatitis B and C, malaria and syphilis.

Objective
To identify key methodological challenges for WHO evidence review methods to support the EDL process.

Methods
Full reports for tests for TB, HIV, hepatitis B and C, malaria and syphilis were identified and methods and reporting compared against a 20-item framework developed from PRISMA-DTA, the Cochrane DTA Handbook, GRADE guidance and discussions of experts. One of three DTA experts reviewed guidance and noted both good practice and notable differences.

Results
Nine evaluations for TB, 8 for hepatitis B and C, 1 each for malaria and syphilis, and 16 for HIV were identified and reviewed. Methods and themes identified where harmonisation is required include: abandoning the PICO question to one suited to test accuracy; emphasis on comparative accuracy; the value of protocols; the role of indirect evidence; assessment of risk of bias and applicability; use of existing systematic review evidence; statistical methods; reporting consequences of test use for accuracy evidence; evaluation of evidence beyond accuracy; grading and assessing the strength of evidence. We will illustrate key issues with examples.

Discussion
Standardisation of WHO evidence review methods for tests is needed to support development of the EDL. We will report the progress of the SAGE IVD in deciding on a methodological approach for, and report on the key outstanding methodological areas which require further development and research.

O6
Harnessing individual participant trial data alongside electronic health records to evaluate the potential of precision medicine: application to type 2 diabetes drug therapy Background: Individual participant data from randomised trials are increasingly available for researchers to answer secondary research questions. Repositories include YODA and Clinical Study Data Request. There may be great potential to harness these data to evaluate potential precision medicine approaches. We propose a framework involving discovery analysis in routine clinical data followed by validation in trials, and apply this to evaluate a precision medicine approach to predict good response to type 2 diabetes drug therapy.
Methods: Discovery analysis: We included 30,511 patients with type 2 diabetes starting either a SGLT-2 inhibitor(SGLT2i) or DPP4inhibitor(DPP4i) in routine clinical data from the UK (CPRD). Associations between clinical measures and glycaemic response (6 month HbA1c-baseline HbA1c) to each drug were evaluated individually using linear regression. Validation: From YODA, we pooled individual participant data from 6 randomised drug efficacy trials of SGLT2i, 2 had a DPP4i comparator arm (n=3929). In the pooled trial data we tested clinically relevant associations observed in CPRD using multivariable three-level (trial-patient-study visit) linear mixed-effects models.
Results: In CPRD, we identified key clinical features associated with differential response to the two drugs (Table 1). Higher baseline HbA1c was associated with greater response (a reduction in HbA1c) to both drugs, but to a greater extent with SGLT2i. Greater SGLT2i response was associated with higher eGFR and lower HDL. Greater DPP4i response was associated with lower triglycerides and lower BMI. All associations replicated in trial data (Table 1). Conclusion: The availability of individual trial data from repositories such as YODA and Clinical Study Data Request provides a tremendous opportunity to evaluate potential precision medicine approaches. Discovery in routine data followed by validation in trial data provides a principled framework to utilise trial data without data-mining. Our findings using this framework suggest there may be potential to develop prediction models for drug response in type 2 diabetes.

O7
Tailoring prediction models for use in new settings: Individual participant data meta-analysis for ranking model recalibration Background: The availability of individual participant data (IPD) from multiple sources allows the external validation of a prediction model across multiple settings and populations. When applying an existing prediction model in a new population it is likely that it will suffer from some over or under fitting, potentially causing poor predictive performance. However, rather than discarding the model outright, it may be possible to modify components of the model to improve its performance using model recalibration methods. Here, we consider how IPD meta-analysis methods can be used to compare and select the most appropriate recalibration method, or whether a completely new model is warranted in a particular setting. Methods: We examine four methods for recalibrating an existing logistic prediction model in cardiovascular disease across multiple centres: (i) re-estimation of the intercept, (ii) adjustment of the linear predictor as a whole (calibration slope), (iii) adjustment of individual heterogeneous predictor effects, and finally (iv) re-estimation of all model parameters. We use multivariate IPD meta-analysis to jointly synthesise calibration and discrimination performance across centres for each of the methods. The most appropriate recalibration method can then be evaluated based on the joint probability of achieving a given model performance in a new setting, using this to rank recalibration methods.
Results: We present a new Stata package allowing estimation of the joint probability of achieving a set level of model performance in a new setting for each recalibration method, therefore easily identifying the method with the highest probability. We show that the best  recalibration method is case specific and promote the use of recalibration as opposed to developing new models unnecessarily when the probability of improved performance through recalibration is high.
Conclusions: Multivariate meta-analysis allows quantification of the most appropriate recalibration methods to improve the performance of an existing prediction model in new settings.

O8
Risk model based stratified patient management of cardiac chest pain versus uniform "non-invasive first" strategies: A summary of short term findings from the CE-MARC2 randomised trial

Background
In early diagnostic test evaluations the potential benefits of the introduction of a new technology in the current healthcare system are assessed in the challenging situation of limited empirical data. These evaluations provide tools to evaluate which technologies should progress to the next stage of evaluation.

Objectives
We aim to identify new approaches within the Bayesian framework for care pathway analysis for early test evaluations.

Methods
In this study a diagnostic test for patients suffering from Chronic Obstructive Pulmonary Disease (COPD) was evaluated with Bayesian networks, which provide a compact visualization of probabilistic dependencies and interdependencies. The structure of the network was inferred from the care pathway, a schematic representation of the journey of a patient in the healthcare system. After the network was inferred and reduced with arc reversal techniques, it was populated using expert judgement elicitation. The Bayesian network was then queried to evaluate whether the introduction of the test could reduce unnecessary hospital admissions. Uncertainty analyses were used to determine credible intervals for the comparison between the current and new pathway, and to identify influential parameters of the decision problem.

Results
We found that the adoption of the diagnostic test had the potential to reduce the number of missed COPD exacerbations of symptoms that could lead to late hospital admissions, and of unnecessary visits to A&E. The model inputs that most influenced the posterior distribution were identified as the probability that a patient would go to A&E if an exacerbation was suspected, the probability that the healthcare professionals in primary care refer patients to the hospital, and the sensitivity of the test.

Conclusion
These results are useful to companies to inform the choice of the target population, of potential early adopters and the identification of the technological focus to guide development of the test.

Background
The methodological challenge in low prevalence situations is that a classical diagnostic accuracy design requires large sample sizes to estimate sensitivity with adequate precision. Reducing sample sizes without introducing risk of bias is challenging.

Objectives
To collate and discuss designs and methods of diagnostic accuracy studies which can be used in low prevalence situations.

Methods
We performed a literature search in four electronic databases (Cochrane Library, Embase, Medline, Web of Science), used backward citation tracking, and invited experts to identify studies with relevant designs or methods. Two reviewers independently included studies describing a study design or method for estimating diagnostic accuracy in a low prevalence situation. Studies on prognostic tests or impact studies of diagnostic tests were excluded. During a one-day meeting with the expert group, the list of methods was discussed and recommendations were formulated.

Results
We identified four designs for single binary tests, one design for multiple conditions, and one design for comparing two tests without verification of double negatives. The four designs for single binary tests were stratification design, two-phase design, case-control design, and nested case-control design. Figure 1 shows the classical diagnostic accuracy design and the six designs that could reduce the total number of patients or the number of patients undergoing the reference standard or index test. Conclusion: There was clear evidence of overdiagnosis, but the degree to which we could quantify this was constrained. We will reflect on whether we could improve on this in future up-dates of the systematic review and what data would be required in order to do this. We will also consider how claims from screening advocates that overdiagnosis can be easily mitigated by improved radiological techniques can be tested. Background: Although there are a variety of approaches to evaluating the accuracy of tests, the terms used to describe these approaches are limited and lack standardization. In parallel with ongoing research to develop a more rational and informative set of study design labels for test accuracy, we are investigating the use made of study design labels in the diagnostic guidance of one national policy making body, NICE.
Objectives: To describe the range of study design terms used and to investigate whether different weight is given to different study designs in the final guidance.
Methods: We will extend the approach used in past analysis of the methodological features of NICE guidance. All NICE Diagnostics Guidance and underpinning summaries of the evidence will be interrogated, focusing on tests used for diagnosis. We will abstract data on: the policy question addressed; the accuracy evidence found and the inclusion criteria for the reviews of it; the study design terms used to describe the evidence; the quality assessment process; whether the evidence was sub-divided by different study designs; and whether the final guidance recognized any differences in study design. Analysis will be qualitative. Results: Earlier investigations suggest little use of study design terms to recognize differences in accuracy study design. We will extend these initial observations. Conclusion: The lack of a series of study design terms which quickly and reliably convey study designs which have different levels of intrinsic bias is an important barrier to good reporting of accuracy studies. However it is also critical for good secondary research. Without such terms all accuracy studies may be considered equal with quality assessment tools being the only means to recognize varying threat to validity arising from different study designs. These tools have not usually been designed for this purpose. Background: Since the most appropriate threshold at which to operate a test is usually a key clinical question, there is a need to move beyond standard meta-analysis methods which: (i) do not provide summary estimates of accuracy at each threshold and (ii) can only synthesise a single pair of sensitivity and specificity from each study, despite studies often reporting data at more than one threshold. Some more advanced methods have recently been proposed, notably that of Steinhauser et al (2016), but a limitation is the need to pre-specify the distributional form of test results in the diseased and disease-free populations.

Objectives:
To develop a meta-analysis model which (i) provides estimates of the sensitivity and specificity of a test across all thresholds, (ii) makes use of all available data, (iii) makes less restrictive assumptions about the distributional form of test results than recently proposed approaches, (iv) works directly with count data (numbers of patients with test results above each threshold), rather than requiring normal approximations.

Methods:
We describe a multivariate meta-analysis model for count data, that can take any number of counts from each study and explicitly quantifies how accuracy depends on threshold. The model allows for a flexible range of distributions of underlying test results by estimating a transformation parameter as part of the model. We fit the model in Bayesian statistical software such as WinBUGS or JAGS.

Results:
We demonstrate with a case study meta-analysis, quantifying the accuracy of B type natriuretic peptide in diagnosing acute heart failure.

Conclusion:
Our new meta-analysis model estimates the sensitivity and specificity of a continuous test at all thresholds, and does not require the analyst to pre-specify the distributional form of underlying test results. Further, the model does not require normal approximations, which can perform poorly in the presence of small counts. Background Prediction models are often developed on a single data set, therefore performance in different settings and populations is frequently poor. If so, one may validate and update or tailor the model to the validation situation at hand, but this is not always feasible if performance is too poor and the validation data set is too small. We propose to use measures of generalizability in the development process of prediction models already, in the case of using large clustered development data sets.

Objectives
The aim of our methodology is to produce developed models that are more robust when applied across different settings and populations, and to prevent the need for constant validation and tailoring to local settings.

Methods
We apply several measures, namely existing measures such as the coefficient of variation, GINI's mean difference and the pooled variance as well as newly developed measures, in a variable selection procedure for developing a prediction model until it attains optimal performance within and across different settings and populations.

Results
We illustrate our proposed approach by modelling 30-day mortality of patients in critical care units. Using independent validation samples for the developed models, we assess the Brier score, calibration slope and c-statistic of the models. We perform a meta-analysis of these performance statistics to assess generalizability of the prediction model (e.g. as quantified by the between-cluster heterogeneity).

Conclusion
Our new approaches can be used for prediction model development in large clustered data sets, to develop better generalizable prediction models.

O17
Understanding , compared with no screening. A Gaussian Process metamodel was fitted to this sample and discrete evolutionary programming was applied to determine the optimal screening strategy (GP-DEP approach) for different colonoscopy capacity constraints. Sample size of predefined strategies was varied (n=25-200) to assess GP-DEP performance using bootstrapping, brute force exhaustive search, and comparison with ASCCA outcomes. Results GP-DEP provided stable optimal screening strategies for sample sizes n>=100. Compared with ASCCA, LYG and costs of the optimal strategies from GP-DEP were accurate and slightly too high, respectively. However, performance ranking of strategies was similar according to ASCCA and GP-DEP. GP-DEP resulted in better screening strategies (higher number of LYG) compared to just evaluating predefined strategies, for different capacity constraints (see Fig. 1). For sample size n=100 average predicted benefit of the optimal strategy identified by GP-DEP compared to the best strategy identified by ASCCA equalled 0.028 LYG (95%CI 0.013-0.043) per individual.

Conclusion
It is feasible and beneficial to optimize rather than evaluate test impact. Optimization using a meta-model of the ASCCA model allowed fast identification of the optimal screening strategy, even when constraints apply, and outperformed the best screening strategy as typically identified from a limited sample of predefined strategies.  Background: Transportability of prediction models can be hampered when predictors are measured differently at development and (external) validation. This may occur, for instance, when predictors are measured using different cut-off points or when tests are produced by different manufacturers. While such heterogeneity in predictor measurement across development and validation seems very common, little is known about the impact it may have on the performance of prediction models at external validation. Objectives: To define effects of predictor measurement heterogeneity on external performance of prediction models, by taking a measurement error perspective to describe measurement heterogeneity. Methods: Using analytical and simulation approaches, we examined the external predictive performance of a clinical prediction model under different scenarios of heterogeneous predictor measurement, using a well-known taxonomy of measurement error models to recreate heterogeneity in measurement procedures. Results: Heterogeneity in measurements of predictors can have a large impact on the external predictive performance of a prediction model, often leading to worse but possibly to improved external predictive performance. This may result in either overfitted or underfitted prediction models, to extents that the prediction model may no longer be clinically useful. Furthermore, our simulation study showed that commonly recommended shrinkage strategies (e.g. Ridge regression) may both improve or worsen the impact of heterogeneity in measurement procedures on the external predictive performance. Conclusion: Our work highlights measurement heterogeneity as an important explanation of unanticipated out-of-sample performance of clinical prediction models, as dissimilarities in the measurements of tests and markers between development and validation deteriorate the actual predictive power of the model at external validation.
Background: Trials of medical tests present a series of challenges in their set-up and management that differ from randomised controlled trials (RCTs) of interventions. Birmingham Clinical Trials Unit (BCTU) manages and provides statistical support for a wide range of test evaluation trials as well as RCTs of interventions.
Objective: To identify unique challenges in the set-up and management of trials of tests in order to improve future trial design and management.
Method: Within the CTU we set up a working group to review experience of ten trials of tests for diagnosis, staging, screening and monitoring. We identified themes where particular challenges were noted which did not occur or were different for RCTs of interventions.
Results: The ten studies covered bladder overactivity, chronic kidney disease, thyroid nodules, neoplasia in chronic colitis, maternal group B streptococcal colonisation, causes of pelvic pain, ovarian cancer, extent and activity of Crohn's disease, staging of lung & colorectal cancer, and staging and management in ovarian cancer. Tests included: PET-CT, CT, MRI and ultrasound, biomarker measurements, development and evaluation of biomarker panels and near patient and laboratory based IVDs. Ten topics were identified that appear unique or to have higher impact on test studies than intervention RCTs including specific issues in: ethics and governance, patient selection, recruitment, uncertainty of diagnostic results, test processes and pathways, sample preparation and measurements, reference standards, follow up, adverse effects and diagnostic impact. Discussion: While some of these themes also occur in RCTs, the relative importance or risks differ from those in test studies. These themes will be presented in more depth using examples from the ten trials and strategies used to resolve or minimise the impact in specific trials will be reviewed. Identifying challenges in these studies is important to enhance the design and conduct of future test studies. The incorporation of early Health Technology Assessment (HTA) might be beneficial for Medical Device (MD) industry; however, evidence that industry is conducting early HTA remains scarce. Objectives This study aims to develop an evidence-based framework to understand whether, and to which extent, early HTA might drive product success of small and large enterprises (SEs and LEs). Methods This research encompassed four stages (Fig. 1). We conducted a keyinformant process (stage 1) where 25 international experts identified a list of emergent HTA themes that they believed were important to company success. A sample of 22 European and US selected companies then reached consensus on a list of key themes through a robust Delphi process (stage 2). Finally, in stage 3, we constructed the 'MEDKET' checklist for SEs and LEs by defining and prioritizing key themes using comments and ratings from stage 1 and 2.

Results
We found out that SEs perceived success as business continuity, whereas LEs identified success as large-scale utilization and patient/ user value. 'MEDKET' for SEs and for LEs included, respectively, 21 and 15 items, with 9 overlapping themes. In both groups, success was driven by three item categories: (i) R&D processes (e.g. starting time of assessment activities); (ii) device outcome-measures (e.g.  Background: This study compares former obstetric care as usual (Expect I) with risk-dependent care using a prediction tool (Expect II). The Expect I study externally validated 39 prediction models using data of 2,614 women prospectively included from 2013 to 2015. Clinically useful models were embedded in a web-based prediction tool. Additionally, risk-dependent care paths were developed, resulting in antenatal care tailored to the outcomes of individual risk assessments. Risk-dependent care was embraced by a consortium of obstetric healthcare professionals in the Dutch province of Limburg. Methods: Women receiving risk-dependent care are being enrolled in a prospective multicenter cohort (Expect II). Primary outcomes are adherence of healthcare professionals and compliance of women to key recommendations; e.g. adequate calcium intake in all women (Expect I, adequate calcium intake in 34% of women) and low-dose aspirin treatment to women at increased risk of preeclampsia (Expect I, actual use in the high-risk group: 1.5%).
Preliminary results: Ten months after introduction our prediction tool is being used in an estimated 24-40% of pregnant women (Fig. 1) Background: The United States Food and Drug Administration (FDA) granted accelerated approval for the check-point inhibitor, pembrolizumab, to treat patients with locally-advanced or metastatic solid tumours of any origin that are mismatch-repair deficient (dMMR) or microsatellite instability-high who have progressed after prior treatment and have no satisfactory alternative treatment options. The FDA has referred to this indication as "tissue/site agnostic", whereas in Australia, the Medical Services Advisory Committee has referred to this as a "pan-tumour" approach. This pan-tumour approach is new for health technology assessment groups. To date, evaluation of the (cost-)effectiveness and safety of both the targeted cancer drug and the companion diagnostic test have been assessed for specific biomarkers, such as HER2, EGFR, and BRAF, in patients with common cancer types, such as melanoma, breast, colorectal or lung cancer. For these applications, the evidence base would generally include at least one randomised trial comparing the effectiveness of the targeted treatment in either the testpositive population (including falsely-positive patients), or the whole cohort (including patients with either false-positive or false-negative results).

Objective:
To provide guidance on the evidence needed to evaluate pantumour applications.

Method:
We examined the effectiveness of dMMR testing for access to pembrolizumab in tumours of diverse origin.

Results and conclusion:
Pan-tumour populations include rare tumour types that are supported by minimal clinical evidence, such as single arm studies. There are differences in the standard of care for tumours arising from diverse sites of origin, and there are limited data for determining the accuracy of the diagnostic test in these different tumour types. Furthermore, the prevalence of dMMR was highly variable across tumour types, greatly affecting the clinical validity (PPV and/or NPV) of the test. We caution that the proportion of patients with false-positive and false-negative test resultsand consequent adverse treatment outcomes -per cancer must be considered.   Background: Diagnostic and prognostic prediction models often perform poorly when externally validated. The reasons for variation in performance across data samples are not fully understood. Objectives: We investigate how differences in the measurement of predictors across settings affect the discriminative power and transportability of a prediction model. Methods: Differences in predictor measurement between data sets can be described formally using a "measurement error" taxonomy. Using this taxonomy, we derive an expression relating variation in the measurement of a continuous predictor to the area under the curve (AUC) of a logistic regression prediction model. This expression is then used to demonstrate how variation in measurements across samples affects the out-of-sample discriminative ability of a prediction model. We illustrate these findings with a diagnostic model using example data of patients suspected of having deep vein thrombosis. Results: When a predictor, such as D-dimer, is measured with more noise in one setting compared to another, which we conceptualize as a difference in "classical measurement error", the AUC decreases (Fig. 1a). In contrast, constant, "structural", error does not impact on the AUC of a logistic regression model, providing the magnitude of the error is the same among cases and non-cases (Fig. 1b). As the differences in measurement methods (and in turn differences in measurement error) become more complex, it becomes increasingly difficult to predict how the AUC will be affected. Conclusion: When a prediction model is applied to a new sample, its discriminative ability can change if the magnitude or structure of the measurement error is not exchangeable between the two settings. This provides an important starting point for researchers to better understand how differences in measurement methods can affect the performance of a prediction model when externally validating or implementing it in practice. Background: When designing a study to develop a new risk prediction model, researchers should ensure their sample size is adequate in terms of the number of participants (n) and events (E) relative to the number of predictor parameters (p) considered for inclusion in the model. Current sample size calculations are based on "rules of thumb", such as at least 10 events per predictor parameter (EPP), which receive much debate and criticism.

O29
Objectives: To produce a new sample size formula for studies developing a prediction model with either binary or time-to-event outcomes. Specifically, to identify in advance of data collection, the sample size needed to minimize the expected optimism in predictor effect estimates, and thus the expected shrinkage required after model development.
Methods: We derive a closed-form sample size formula, based on utilizing the heuristic uniform shrinkage factor of Van Houwlingen and Le Cessie. The formula allows researchers to identify n, p and EPP that correspond to an expected shrinkage factor close to 1, such as 0.9, that reflects low overfitting. It requires researchers to pre-specify the anticipated Cox-Snell R 2 of the model, and we show how to identify realistic values of R 2 based on published information (e.g. C statistic) for existing models in the same field. A suitable margin of error in other relevant estimates (e.g. overall risks) is also recommended. Results: We illustrate the approach using examples of diagnostic and prognostic prediction models. This shows that, to target an expected shrinkage factor of 0.9, a new diagnostic model for Chagas disease requires an EPP of 3.9 and a new prognostic model for recurrent venous thromboembolism requires an EPP of 23.
Conclusion: Blanket rules of thumb for sample size are inappropriate, and our alternative proposal allows sample size and EPP to be tailored to the particular model and setting of interest. Background: We developed an R package diagmeta that implements our model for meta-analysis of diagnostic test accuracy (DTA) studies allowing for multiple cutoffs (Steinhauser 2016).
Objectives: To make this statistical method accessible to users with a background in statistics, psychology, medicine, or public health.
Methods: The parametric model assumes that the values of the underlying biomarker follow two correlated distributions for individuals with/ without the target condition. Data can be entered either as study label, cutoff, TP (true positive), TN (true negative), FP (false positive), FN (false negative), or as individual participant data (study label, individual's measurement, status). Users can choose between several mixed linear models and specify the type of distribution (logistic or normal), and the weighting method for studies (e.g., inverse variance weighting). For determination of an optimal cutoff, weights for sensitivity and specificity can be specified.
Results: The output of diagmeta includes basic information such as the number of studies and cutoffs, the empirical distribution of cutoffs, the optimal cutoff, sensitivity and specificity at this cutoff, and the area under the summary ROC curve. For given cutoffs, pairs of sensitivity and specificity with confidence intervals can be tabulated. If a prevalence is specified, predicted values are calculated. In addition, a flexible plot function is provided to produce cumulative distribution plots, density plots, Youden index curves, study-specific ROC curves, the summary ROC curve, and the summary operating point, optionally with a corresponding confidence region. Conclusion: The R package diagmeta implements one of few available statistical methods for meta-analysis of DTA studies with multiple cutoffs and is now readily accessible. We plan to continuously extend and update diagmeta, and possibly to include competing methods. Background New predictors (e.g. biomarkers) are often assessed for their incremental value on top of existing prediction models in a new dataset, but methods to assess this incremental value differ, and may actually answer different research questions.
Objectives To describe various approaches to assess the incremental value of a new predictor and show that they differ in the research questions they address, and the (magnitude of the) estimated incremental value they identify. Methods We distinguish three approaches: assessment of incremental value with respect to 1) an existing model ("existing model"); 2) individual predictors of an existing model ("model revision"); and 3) a selection of individual predictors which may be part of an existing model ("new model development"). Using these three approaches we assessed the incremental value of the D-dimer test to a deep venous thrombosis prediction model.

Results
The approach influences the research question that is actually addressed and influences the (magnitude of the) estimated incremental value ( Table 1). The incremental value of the D-dimer test decreased with increasing adjustments of the existing model to the new dataset. In the "existing model" approach, the misfit of the existing model in the new dataset allows room for the apparent incremental value of a new predictor. The "model revision" approach solves this and has been recommended as the preferred way to assess the incremental value of a new predictor in a new dataset. In the "new model development" approach, the primary interest is not incremental value, but rather which combination of existing predictors and a new predictor best predicts the outcome.
Conclusion We advise investigators in incremental value studies to more explicitly consider using an approach that is in line with the research question they aim to answer, and to be aware that the approach influences the incremental value that can be identified. Background Biological variability (BV) studies measure the natural variability in test results occurring between and within individuals. BV estimates can guide appropriate use of tests for monitoring and diagnosis. Analysis of these studies routinely involves detecting and eliminating outliers. The risk of outlier removal inappropriately reducing estimates of variability is not known.

Objectives
To estimate the impact of commonly used methods to remove outliers in BV studies.

Results
With outlier detection and removal used, in the absence of outliers, analytical, within-individual and between-individual variability are underestimated. Unnecessarily removal of measures varied between methods; median(Q1,Q3)[min,max] removed for 5,000 simulations using Cochran C test 2(0,4)[0,30] and Dixon's Q test 0(0,0)[0,0]. Cochran C test and Tukey's IQR rule created the greatest bias ( − 10.6 × 10 −4 , −15.5 × 10 −4 and −85.5 × 10 −4 for analytical, withinindividual and between-individual standard deviations respectively). There were differences in the ability of outlier detection methods to detect real outliers dependent on the number present. Outliers correctly identified and removed ranged from a median of 0% to 100%. Conclusion Identification of outliers in BV studies should lead to data checking and correction where necessary. However, outlier detection methods should be used as sensitivity analyses as they may lead to underestimation of measures of variation. Background Binary logistic regression is one of the most frequently applied statistical models for developing clinical prediction models. Developers of such models often rely on an Events Per Variable criterion (EPV), notably EPV ≥ 10, to determine the minimal sample size required and/or the maximum number of candidate predictors that can be examined.

Objectives
To improve upon the existing sample size guidance for binary logistic prediction models. Methods I present an extensive simulation study in which the influence of: EPV, events fraction, number of candidate predictors, the correlations and distributions of candidate predictor variables, area under the ROC curve, and predictor effects on out-of-sample predictive performance of prediction models were studied. The out-of-sample performance (calibration, discrimination and probability prediction error) of developed prediction models was evaluated before and after regression shrinkage and variable selection.

Results
The results indicate that EPV fails to have a strong relation with metrics of predictive performance, and is not an appropriate criterion for (binary) prediction model development studies. Out-ofsample predictive performance can better be approximated by considering the number of predictors, the total sample size and the events fraction.

Conclusion
Prediction modeling studies should not only consider EPV to determine sample size. Instead, new sample size criteria for prediction models should be developed that take into account: the number of candidate predictors, the total sample size and the events fraction. A simple-to-apply formula for such sample size calculations is presented. In precision medicine, new biomarkers and genomic tests may contribute to decision-making and improve outcomes by targeting therapy who benefit most. There is discussion about the evidence base required to make recommendations about their use. Do we need trials, observational studies, models or a combination? Objectives To explore how evidence on the prognostic strength of a genomic signature can contribute to individualized decision making on starting adjuvant chemotherapy for women with breast cancer.

Methods
The MINDACT trial was a randomized trial that enrolled 6,693 women with early-stage breast cancer. A 70-gene signature (Mammaprint) was used to estimate genomic risk, and clinical risk was estimated by a dichotomized version of the Adjuvant!Online risk calculator. 2,187 women with discordant risk results were randomly allocated to chemotherapy or no chemotherapy. We simulated the full risk distribution of these women and estimated individual benefit, assuming a constant relative effect of chemotherapy.

Background:
Receiver operating characteristic (ROC) curves are widely used in reports on clinical risk prediction models. Although the intent to demonstrate the ability of a model to discriminate between patients with and without a certain condition might be sincere, their presentation and interpretation is often inadequate. Objectives: ROC curves yield an improvement over selective reporting at identified optimal thresholds in early stage studies. However, we argue that most published ROC curves contain little useful information and are used erroneously to evaluate clinical prediction models. At the bare minimum classification thresholds should be displayed on the ROC plots.

Methods:
We encourage the use of classification plots, which plot sensitivity and specificity separately by threshold. Such classification plots can be supplemented with measures of clinical utility such as net benefit. We illustrate the usefulness of classification plots with a case study on residual mass diagnosis in metastatic testicular cancer patients. Results: ROC curves are common in the medical literature to evaluate the performance of clinical prediction models. Our pragmatic search revealed 62% of ROC curves were presented uninformatively. ROC curves provide little information over and above the area under the curve (AUC) as a summary of discriminatory ability when threshold information is not plotted.

Conclusion:
We recommend to focus on the AUC, sensitivity and specificity at clinically relevant thresholds, and, if a visualization of discriminatory ability is desired, classification plots where sensitivity and specificity is presented by threshold. Classification plots can be readily augmented with standardized net benefit to assess the potential clinical utility of a model. Background Meta-analysis may produce estimates that are unrepresentative of a test's performance in practice. Tailored meta-analysis circumvents this by deriving an applicable region for the practice and selecting the studies compatible with the region. It requires the test positive rate, r and prevalence, p being estimated for the setting but previous studies have assumed their independence.

Objective
The aim is to investigate the effects a correlation between test positive rate and prevalence has on estimating the applicable region and how this affects tailored meta-analysis. Method Four methods for estimating 99% confidence intervals for r and p were investigated: Wilson's score, Clopper-Pearson's exact interval, the Bonferroni correction and Hotelling's T 2 statistic. These were analysed in terms of the coverage probability using simulation trials over different correlations, sample sizes, and values for r and p. The methods were then applied to two published meta-analyses with associated practice data and the effects on the applicable region, studies selected and summary estimates evaluated.

Results
Hotelling's T 2 statistic with a continuity correction had the highest median coverage (0.9971). This and the Clopper-Pearson method with a Bonferroni correction both had coverage consistently above 0.99. The coverage of Hotelling's T 2 statistic intervals varied the least across different correlations. For both meta-analyses, the number of studies selected was largest when Hotelling's T 2 statistic was used to derive the applicable region. In one instance this increased the sensitivity by over 4% compared with tailored meta-analysis estimates using other methods.

Conclusion
Tailored meta-analysis returns estimates which are tailored to practice providing the applicable region is accurately defined. This is most likely when the 99% confidence interval for test positive rate and prevalence are estimated using Hotelling's T 2 statistic with a continuity correction. Potentially, the applicable region may be obtained using routine electronic health data.