Skip to main content

Table 1 Summary of performance measures for quantifying added value

From: Quantifying the added value of new biomarkers: how and how not

Measure

Advantages

Disadvantages

Likelihood-based measures

Reflects probability of obtaining the observed data

Based on assumed model

 Likelihood ratio (LR), change in AIC or BIC

The LR test is the uniformly most powerful test for nested models. The AIC and BIC can be used to assess non-nested models.

While powerful, statistical association or model improvement may not be of clinical importance.

Discrimination

Assesses separation of cases and non-cases

Only one component of model fit

 Difference in ROC curves, AUC, c-statistic

Assesses discrimination between those with and without outcome of interest across the whole range of a continuous predictor or score. Useful for classification

Based on ranks only. Does not assess calibration. Differences may not be of clinical importance.

Clinical risk reclassification

Examines difference in assigning to clinically important risk strata

Strata should be pre-defined. Loses information if strata are not clinically important

 Reclassification calibration statistic

Assesses calibration within cross-classified risk strata

A test for each model is needed

 Categorical NRI

Can assess changes in important risk strata. Cases and non-cases can be considered separately

Depends on the number of categories and cut points used

 NRI(p)

Nice statistical properties. Does not vary by event rate in the data

May not be clinically relevant

 Conditional NRI

Indicates improvement within clinically important risk subgroups

Biased in its crude form, and a correction based on the full data is needed.

Category-free measures

Does not require cut points

May lose clinical intuition

 Brier score

Proper scoring rule

May be difficult to interpret; the maximum value depends on incidence of the outcome.

 NRI(0)

Continuous, does not depend on categories

Based on ranks only. Measure of association rather than model improvement. Behavior may be erratic if the new predictor is not normally distributed.

 IDI

Nice statistical properties. Related to the difference in model R2

Depends on event rate. Values are low and may be difficult to interpret.

Decision analytics

Estimates clinical impact of using model

Not a direct estimate of model fit or improvement. Need reasonable estimates of decision thresholds

 Decision curve

Displays the net benefit across a range of thresholds

Does not compare model improvement directly but clinical consequences of using the models for treatment decisions

 Cost-benefit analysis

Compares costs and benefits of one models or treatment strategy vs. another

Need detailed estimates of costs and benefits of misclassification, including further diagnostic workup and treatments