Skip to main content

Diabetes after pregnancy: a study protocol for the derivation and validation of a risk prediction model for 5-year risk of diabetes following pregnancy



Pregnancy offers a unique opportunity to identify women at higher future risk of type 2 diabetes mellitus (DM). In pregnancy, a woman has greater engagement with the healthcare system, and certain conditions are more apt to manifest, such as gestational DM (GDM) that are important markers for future DM risk. This study protocol describes the development and validation of a risk prediction model (RPM) for estimating a woman’s 5-year risk of developing type 2 DM after pregnancy.


Data will be obtained from existing Ontario population-based administrative datasets. The derivation cohort will consist of all women who gave birth in Ontario, Canada between April 2006 and March 2014. Pre-specified predictors will include socio-demographic factors (age at delivery, ethnicity), maternal clinical factors (e.g., body mass index), pregnancy-related events (gestational DM, hypertensive disorders of pregnancy), and newborn factors (birthweight percentile). Incident type 2 DM will be identified by linkage to the Ontario Diabetes Database. Weibull accelerated failure time models will be developed to predict 5-year risk of type 2 DM. Measures of predictive accuracy (Nagelkerke’s R2), discrimination (C-statistics), and calibration plots will be generated. Internal validation will be conducted using a bootstrapping approach in 500 samples with replacement, and an optimism-corrected C-statistic will be calculated. External validation of the RPM will be conducted by applying the model in a large population-based pregnancy cohort in Alberta, and estimating the above measures of model performance. The model will be re-calibrated by adjusting baseline hazards and coefficients where appropriate.


The derived RPM may help identify women at high risk of developing DM in a 5-year period after pregnancy, thus facilitate lifestyle changes for women at higher risk, as well as more frequent screening for type 2 DM after pregnancy.

Peer Review reports


Type 2 diabetes mellitus (DM) is a serious metabolic condition that affects over 400 million people and accounted for 1.6 million deaths worldwide in 2016 [1]. DM and its complications are major contributors to reductions in life expectancy and quality of life [2,3,4]. Costs to healthcare systems attributed to DM are substantial with estimated annual global costs of US$1.3 trillion or 1.8% of global gross domestic product [5]. These costs are expected to rise further with growing type 2 DM prevalence. Some of the highest increases in DM prevalence have occurred among young women [6, 7]. Developing type 2 DM at younger ages is associated with worse morbidity and mortality compared to developing the condition at older ages [8]. Identifying opportunities for reversing the rising prevalence in young women is therefore essential to reducing the burden of DM.

Pregnancy provides a unique opportunity for estimating future type 2 DM risk, and then implementing potentially risk reducing strategies. Due to the intense physiological demands of pregnancy, some women develop temporary conditions such as GDM and hypertensive disorders of pregnancy that are important markers for future DM risk [9, 10]. GDM occurs in 5 to 20% of pregnancies and is associated with a seven-fold increased risk of type 2 DM [9,10,11,12]. Pregnancy is also a time when women are more engaged with the healthcare system and may be more motivated to implement behavioral changes to improve their health for the sake of their family.

Despite strong evidence among non-pregnant, high-risk populations that type 2 DM can be prevented with lifestyle modifications [13, 14], there remains limited post-partum follow-up relating to DM risk among women at high risk of developing DM [15]. Evidence also suggests that many women with GDM perceive their 10-year risk of developing DM to be low [16, 17]. Yet, 10–40% of women with GDM progress to DM within the first 5 years after delivery [18,19,20], underscoring the need for better risk communication.

A risk prediction model (RPM) for estimating future DM risk following delivery will facilitate risk communication and will support clinicians to identify women at high risk who might benefit from preventative interventions. Existing DM RPMs currently include only a small number of pregnant women and do not include important predictors of DM that develop during pregnancy [21,22,23,24,25,26]. Models developed among postpartum women are limited to women with a previous history of GDM [27,28,29]. These models would not be applicable to women who do not develop GDM during pregnancy but who may still be at high risk of future DM. Furthermore, each of these models were derived in small samples (N = ≤ 395) with limited ethnic variation.

To address the need for a population-based DM RPM that may be applied to all postpartum women, we propose the development and validation of a novel model using unique administrative datasets that capture all births within the population covered under a single-payer health system for over 200,000 deliveries.


Data sources

The RPM will be derived and validated using population-based administrative data collected in Ontario, Canada. These datasets are held at ICES, an independent, non-profit research institute whose legal status under Ontario’s health information privacy law allows it to collect and analyze health care and demographic data, without consent, for health system evaluation and improvement. The datasets are linked using unique encoded identifiers and analyzed at ICES.

The major source of data for this project will be Ontario’s perinatal registry, the Better Outcomes Registry and Network (BORN). This dataset was established in 2009 and captures information from hospitals, midwifery practice groups, specialized antenatal clinics, prenatal screening laboratories, and fertility clinics for all deliveries occurring within hospitals in Ontario. Data quality of BORN data elements was recently assessed and was found to have good agreement with data from patient charts [30]. Of the 29 data elements assessed, 23 elements had greater than 90% agreement, including maternal weight and height. Prior to BORN, pre- and peri-natal data were collated in the Niday database, now known as the BORN legacy data [31]. This earlier database was first created in 1997 and by 2008 captured data for 96% of deliveries in Ontario. While data quality for Niday data is lower, the percentage agreement between Niday and patient chart for important predictors including infant birthweight and gestational age at delivery exceed 90% [31]. Data collated in both databases include maternal demographic, behavioral, and health status characteristics. Obstetric complications and delivery outcomes are also captured. BORN is funded by the Ontario Ministry of Health and Long Term Care and administered by The Children’s Hospital of Eastern Ontario. Since the administrative data do not capture information on ethnicity, maternal ethnicity will be ascertained using a validated algorithm which uses surnames to assign an ethnicity (South Asians, Chinese, Other) to all residents of Ontario [32]. Linkage to the Ontario Laboratories Information System (OLIS) will be used to capture 50 g oral glucose challenge test (OGCT) values. OLIS was established in 2006 and holds data relating to medical laboratory test orders and results from community, hospital and public laboratories across Ontario. The OGCT is used to screen for GDM, is currently offered to all pregnant women and is typically administered between 24 and 28 weeks’ gestation [33]. Data relating to comorbidities such as cardiovascular disease and prior pregnancy history will be obtained from the Discharge Abstract Database (DAD). This database captures detailed clinical and administrative data for hospital admissions and day surgeries in Ontario.

To obtain outcome data, we will use the Ontario Diabetes Database (ODD). This database was established in 1991 and contains all individuals with DM in Ontario. This database captures data from hospital discharge abstracts, Ontario drug benefit claims and physician service claims and data from this database are currently available until 31st March 2019. The use of data in this project was authorized under section 45 of Ontario’s Personal Health Information Protection Act, which does not require review by a Research Ethics Board.


The derivation cohort will consist of all women aged between 16 and 50 years (at index pregnancy) whose pregnancy resulted in a live birth or still birth in Ontario between 1st April 2006 and 31st March 2014. No minimum duration of pregnancy was required for inclusion into the derivation cohort. For women who had multiple pregnancies during this period, the first pregnancy occurring during the accrual window will be selected. The influence of this choice of pregnancy will be examined in sensitivity analyses involving the development and validation of models in separate cohorts containing a randomized choice of pregnancy. Women with pre-pregnancy DM as ascertained using the ODD and available variables in BORN indicating pre-existing DM will be excluded from the derivation cohort. Women who were ineligible for the Ontario Health Insurance Plan or were a non-Ontario resident in the 2 years prior to index date will be excluded since predictor and outcome data for these women will be incomplete. Index date is defined as 6 months post-partum. Women who had a second pregnancy prior to the index date and women who died prior to index will also be excluded.

Main predictors

Table 1 lists the predictors and the pre-specified functional forms for each.

Table 1 Pre-specification of predictors for risk prediction model

Choice of the candidate predictors was informed by a systematic literature review, consultation with the project advisory board, clinicians, and the availability of the predictors within the study’s data sources. To enhance usability, predictors of the model were also chosen based on their likely availability to intended users of the model. Continuous variables with limited variation and binary variables with small counts will be excluded. While categorical variables have been pre-specified, frequency distributions will be examined and categories may be combined where there are small numbers. Interactions between all predictors and each of age, GDM and ethnicity will be considered for inclusion into the model due to expected differences in the effect of these predictors by the selected variables. Interactions which improve model fit will be included in the final model.


Women will be followed-up for the incidence of type 2 diabetes. The incident onset of type 2 DM will be ascertained using a validated definition [34]. That algorithm requires one hospital record including a diabetes-specific International Classification of Disease code (ICD) OR two physician claim records relating to diabetes treatment within 1 year of each other, OR a dispensing record for an anti-diabetic drug from the Ontario Drug Benefit. According to this definition, diabetes records occurring between 120 days before or 180 after a hospital or primary care record of pregnancy care are considered to relate to gestational diabetes and are excluded.

Where physician claim records are used to ascertain DM status, DM diagnosis date is set to the date of the first visit. The validated definition has a sensitivity of 90.0%, a specificity of 97.7%, and a positive predictive value of 82.6%. Women will be followed up from index date until the earliest of incident diabetes, death, new pregnancy conception date, or study end-date (31st March 2019).

Sample size

For the period between 2014 and 2016, there were 231,618 women who delivered a baby and 756,956 person-years of follow-up. Overall, 2294 women developed incident type 2 DM during follow-up (unpublished data). Using the criteria specified by Riley et al. and the pmsampsize R package, we calculated the minimum sample size required to be 6193 women to minimize overfitting and ensure the estimation of precise model coefficients [35, 36].

Analysis plan

All data manipulations will be carried out using Wickham’s tidyverse package [37] and modelling will be conducted using Harrell’s rms package in R [38]. The TRIPOD statement was used to devise the analysis plan and will be used to guide the reporting of the development and validation of the proposed model [39].

Data cleaning and coding of predictors

Continuous predictors will be assessed using descriptive summaries and histograms. Implausible values will be set to missing, and highly skewed predictors will be truncated at the 99.5 percentile. For example, if a woman has a pre-pregnancy BMI value of 51 kg/m2, which surpasses the calculated 99.5th percentile BMI of 50 kg/m2, then her BMI will be assigned as 50 kg/m2. Continuous predictors will be centered on their mean values. Continuous variables will be included in the model as linear or non-linear terms using restricted cubic splines, depending on model fit. Knot placement in restricted cubic splines will be based on the percentile distribution of the continuous variable. The definitions of all variables have been pre-specified to minimize risk of over-fitting.

Missing data

Predictors with missingness exceeding 40% will be excluded from the RPM [40]. Missing data among the remaining predictors will be assumed missing at random, conditional on the available variables, and will be addressed using multiple imputation [41, 42]. Logistic regression models will be used to identify predictors of missingness that should be included in the imputation model. To identify the likely ability of the imputation model to accurately impute missing data, we will conduct exploratory analyses to examine associations between variables with missing data and available predictors.

The imputation model will contain all time to event, censoring, predictor, and auxiliary variables [43]. Predictor variables will be included in the imputation model in the same functional form as they will appear in the pre-specified prediction model (i.e., continuous, categorical). Auxiliary variables which can provide information on the missing values will be included to improve the accuracy of the imputations. For example, to facilitate the imputation of BMI, pre-pregnancy weight will be included in the imputation model as an auxiliary variable. The mice package in R will be used to generate 10 imputed datasets. The model will be generated in each imputed dataset and combined according to Rubin’s rules [41].

At model deployment when the outcome is unknown and when predictors may be unavailable, it will be necessary for the tool to impute missing predictor data in real-time using single or multiple imputation approaches, where feasible. To emulate multiple imputations in this setting, the model’s performance will be assessed in two sets of imputed datasets. The first set of imputed datasets will be derived using imputation models that include the outcome variables while the second set will be derived in imputation models that exclude the outcome variables. The assessment of model performance in the second dataset will more accurately describe likely performance at model deployment [44]. Should multiple imputations at deployment not be feasible, single imputation methods will be explored. The chosen single imputation approach for handling missing data at model deployment will be replicated during external validation to obtain an accurate assessment of likely model performance at deployment.

Model estimation

Model coefficients will be estimated using Weibull accelerated failure time models to calculate 5-year risk of type 2 DM. This parametric model was chosen since it allows for the calculation of more clinically meaningful parameters, including predicted survival time for different follow-up periods.

A full model, including all predictors and important interactions, will be developed in the first instance. Since the practical application of the full model may be time-consuming for intended users, we will derive a less complicated model by applying Ambler’s step-down approach [45]. This approach involves the regression of the derived prognostic index from the full model in the predictors. Predictors are subsequently dropped that produce the smallest reduction in R2. This procedure will be repeated until the exclusion of any further predictors would lead to a R2 value below 0.95. This approach will ensure that variables that contribute very limited information to the model are removed. We will verify our model building approach in exploratory analyses by applying least absolute shrinkage and selection operator (LASSO) to conduct variable selection and regularization.

Assessment of RPM discrimination

Discrimination, which describes the models ability to distinguish between women who did and did not develop type 2 DM, will be assessed using C-statistics. C-statistics will be calculated at various time points (e.g., 1 year, 2 years, 5 years). The clinical usefulness of the RPM will be quantified using the net benefit approach [46].

Assessment of RPM calibration

Overall model calibration describes the agreement between observed and predicted risks. Calibration will be assessed using calibration plots and estimation of calibration slopes and calibration-in-the-large values in each imputed dataset. Calibration curves will be generated in each imputed dataset and combined into a single plot. Calibration slopes will be estimated at fixed time-points by regressing the observed risk of type 2 DM on the predicted prognostic index. Calibration-in-the-large will be estimated by comparing the mean observed risk estimated using the Kaplan-Meier method with the mean predicted risk. Calibration will be assessed within predefined groups, including by GDM status, ethnicity and age. Formal statistical testing of calibration using Hosmer-Lemeshow goodness of fit tests will not be performed, due to the large sample size.

RPM predictive accuracy

Overall predictive accuracy will be assessed with the Brier score and by estimating explained variation using Nagelkerke’s R.

RPM internal validation

To assess the degree of optimism in the estimated performance, internal validation using a bootstrapping approach, with 500 resamples will be performed. Using this approach, the model will be derived in each of the 500 bootstrap samples. Each of the bootstrap sample models will then be applied within the original dataset and measures of performance will be calculated for each bootstrap sample model. The difference in model performance between bootstrap sample models and the original model will be used to estimate the optimism corrected measures of performance, including Nagelkerke’s R2 and C-statistics. Over-fitting will be quantified by the calculation of a uniform shrinkage factor. Where necessary, the uniform shrinkage factor will be used to adjust the mode coefficients.

Model presentation

The final regression model and validation results will be published in full in a peer-reviewed journal and according to the TRIPOD guidelines. The regression formula will be subsequently incorporated into a web-based calculator and will be integrated into electronic healthcare records. By integrating a calculator into electronic healthcare records, general practitioners will be able to readily access the necessary input data to estimate diabetes risk in postpartum women.

External validation

We will externally validate the algorithm using data from a population-based pregnancy cohort from Alberta, the Alberta Vital Statistics–Birth database [47]. This database is populated by maternal, pregnancy, and neonatal data and is linked to hospital admissions records (Discharge Abstract Database), emergency room/outpatient clinic visits records (the Ambulatory Care Classification System (ACCS)), and physician office visit records (Fee-for-Service Claims (CLAIM)). Linkage to the Alberta Health Care Insurance Population registry provides demographic data including ethnicity and rurality. Diabetes will be ascertained using ICD-9 and ICD-10 codes in hospitalization, ACCS, or CLAIM records. The RPM will be applied in this cohort to estimate 5-year risk of RPM in this external population. The predictive performance of the RPM will be assessed by estimating the previously described measures of calibration and discrimination. Where appropriate, the RPM model will be re-calibrated by adjusting the baseline hazard and the mean values of the predictors to that of the external validation cohort [48].


We have described a protocol for the development and validation of a novel RPM using a large population-based cohort derived from validated databases. This will be the first RPM derived for use among all women following pregnancy and not just for those that developed GDM. The RPM may be embedded in electronic health records to help to guide clinical decision-making in family practices and to communicate risk. A web-based calculator will also be generated to enable women to calculate their own risk of developing diabetes and potentially motivate them to make lifestyle changes. The development of the web-based calculator will be informed by data collected during a qualitative study to identify optimal approaches to communicating type 2 diabetes risk among post-partum women.


A key limitation of this study will be our inability to be certain about a woman having type 2 DM vs. type 1 DM; therefore, the minority of diagnosed with type 1 and type 2 DM following pregnancy will be included as an outcome event, however unlikely. These women are therefore unlikely to benefit from postpartum lifestyle preventative interventions. However, since approximately 90 to 95% of diagnosed type 2 DM cases are type 2 DM and this number increases with increasing age at diagnosis, only a very small number of subsequent type 2 DM cases are likely to be type 1 DM. A second limitation is the presence of missing data in the derivation cohort. We will address this problem by only including predictors that are at least 60% complete and by using multiple imputations, a widely accepted approach to handling missing data. Some predictors, such as OGCT values are unlikely to be available for every woman since OLIS does not capture results from all laboratories across the province. Since the OGCT is offered to all pregnant women and screening rates are high in Canada [49], missing data in this variable should predominantly relate to whether the attending site was contributing submissions to OLIS. Given that evidence suggests no differences in characteristics of people attending OLIS contributing and non-contributing sites [50], it may be reasonable to assume that these data are missing completely at random, in which case multiple imputation is an appropriate missing data handling approach [41]. However, exploratory analyses will be undertaken to determine whether OGCT values can be reasonably imputed using available data.

Thirdly, while we intend to validate our RPM in a temporal validation cohort consisting of women with GDM, it will be necessary to further assess the generalizability of the model to other non-overlapping populations of postpartum women. Finally, we do not have data relating to family history of diabetes, an established predictor of type 2 diabetes risk. However, it is unknown to what extent family history of diabetes predicts risk of diabetes over and above the available predictors, specifically oral glucose tolerance test values.


The global burden of type 2 DM is increasing and identifying opportunities to reduce risk of type 2 DM is a key priority. Pregnancy offers one such opportunity due to increased healthcare engagement and the development of conditions associated with type 2 DM risk. An accurate type 2 DM RPM may be used by clinicians to identify women at greatest risk and who would benefit from lifestyle interventions. Our proposed model will be the first type 2 DM RPM for use among all pregnant women. It will be derived in a diverse population using large administrative data sources and will be used to enhance post-partum care of women at high risk of developing type 2 DM.

Availability of data and materials

The data that support the findings of this study are available from ICES and the Manitoba Centre for Health Policy, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of the Institute for Clinical Evaluative Sciences or the Manitoba Centre for Health Policy. Where possible, the analytic code will be uploaded to a free, open platform.



Ambulatory Care Classification System


Body mass index


Better Outcomes Registry and Network


Confidence intervals


Diabetes mellitus


Gestational diabetes mellitus


International Classification of Disease


Ontario Diabetes Database


Ontario Health Insurance Plan


Ontario Laboratories Information System


Risk prediction model


Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis


  1. Zhou B, Lu Y, Hajifathalian K, et al. Worldwide trends in diabetes since 1980: a pooled analysis of 751 population-based studies with 4·4 million participants. The Lancet 2016;387(10027):1513-30. doi:

  2. Huo L, Shaw JE, Wong E, et al. Burden of diabetes in Australia: life expectancy and disability-free life expectancy in adults with diabetes. Diabetologia. 2016;59(7):1437–45.

    Article  PubMed  Google Scholar 

  3. Rawshani A, Rawshani A, Franzén S, et al. Mortality and cardiovascular disease in type 1 and type 2 diabetes. New England Journal of Medicine. 2017;376(15):1407–18.

    Article  Google Scholar 

  4. Walker J, Colhoun H, Livingstone S, et al. Type 2 diabetes, socioeconomic status and life expectancy in Scotland (2012–2014): a population-based observational study. Diabetologia. 2018;61(1):108–16.

    Article  CAS  PubMed  Google Scholar 

  5. Bommer C, Heesemann E, Sagalova V, et al. The global economic burden of diabetes in adults aged 20–79 years: a cost-of-illness study. The Lancet Diabetes & Endocrinology 2017;5(6):423-30. doi:

  6. Lipscombe LL, Hux JE. Trends in diabetes prevalence, incidence, and mortality in Ontario, Canada 1995-2005: a population-based study. Lancet (London, England) 2007;369(9563):750-56. doi: 10.1016/s0140-6736(07)60361-4 [published Online First: 2007/03/06]

  7. Menke A, Casagrande S, Geiss L, et al. Prevalence of and Trends in Diabetes Among Adults in the United States, 1988-2012. JAMA. 2015;314(10):1021–9.

    Article  CAS  PubMed  Google Scholar 

  8. Al-Saeed AH, Constantino MI, Molyneaux L, et al. An Inverse Relationship Between Age of Type 2 Diabetes Onset and Complication Risk and Mortality: The Impact of Youth-Onset Type 2 Diabetes. Diabetes Care. 2016;39(5):823–9.

    Article  CAS  PubMed  Google Scholar 

  9. Bellamy L, Casas J-P, Hingorani AD, et al. Type 2 diabetes mellitus after gestational diabetes: a systematic review and meta-analysis. The Lancet. 2009;373(9677):1773–9.

    Article  CAS  Google Scholar 

  10. Feig DS, Shah BR, Lipscombe LL, et al. Preeclampsia as a risk factor for diabetes: a population-based cohort study. PLoS medicine 2013;10(4):e1001425. doi: 10.1371/journal.pmed.1001425 [published Online First: 2013/04/24]

  11. Feig DS, Hwee J, Shah BR, et al. Trends in incidence of diabetes in pregnancy and serious perinatal outcomes: a large, population-based study in Ontario, Canada, 1996–2010. Diabetes Care 2014:DC_132717. doi: 10.2337/dc13-2717

  12. Zhu Y, Zhang C. Prevalence of gestational diabetes and risk of progression to type 2 diabetes: a global perspective. Curr Diab Rep. 2016;16(1):7–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Diabetes Prevention Program Outcomes Study. Long-term effects of lifestyle intervention or metformin on diabetes development and microvascular complications over 15-year follow-up: the Diabetes Prevention Program Outcomes Study. The lancet Diabetes & endocrinology 2015;3(11):866-75. doi: 10.1016/s2213-8587(15)00291-0 [published Online First: 2015/09/18]

  14. Li G, Zhang P, Wang J, et al. The long-term effect of lifestyle interventions to prevent diabetes in the China Da Qing Diabetes Prevention Study: a 20-year follow-up study. Lancet (London, England) 2008;371(9626):1783-9. doi: 10.1016/s0140-6736(08)60766-7 [published Online First: 2008/05/27]

  15. McGovern A, Butler L, Jones S, et al. Diabetes screening after gestational diabetes in England: a quantitative retrospective cohort study. British Journal of General Practice. 2014;64(618):e17–23.

    Article  Google Scholar 

  16. Kim C, McEwen LN, Piette JD, et al. Risk perception for diabetes among women with histories of gestational diabetes mellitus. Diabetes Care 2007;30(9):2281-6. doi: 10.2337/dc07-0618 [published Online First: 2007/06/19]

  17. Mukerji G, Kainth S, Pendrith C, et al. Predictors of low diabetes risk perception in a multi-ethnic cohort of women with gestational diabetes mellitus. Diabetic medicine : a journal of the British Diabetic Association 2016;33(10):1437-44. doi: 10.1111/dme.13009 [published Online First: 2015/10/27]

  18. Kaul P, Savu A, Nerenberg KA, et al. Impact of gestational diabetes mellitus and high maternal weight on the development of diabetes, hypertension and cardiovascular disease: a population-level analysis. Diabet Med. 2015;32(2):164–73.

    Article  CAS  PubMed  Google Scholar 

  19. Feig DS, Zinman B, Wang X, et al. Risk of development of diabetes mellitus after diagnosis of gestational diabetes. CMAJ 2008;179(3):229-34. doi: 10.1503/cmaj.080012 [published Online First: 2008/07/30]

  20. Mukerji G, Chiu M, Shah BR. Impact of gestational diabetes on the risk of diabetes following pregnancy among Chinese and South Asian women. Diabetologia. 2012;55(8):2148–53.

    Article  CAS  PubMed  Google Scholar 

  21. Hippisley-Cox J, Coupland C. Development and validation of QDiabetes-2018 risk prediction algorithm to estimate future risk of type 2 diabetes: cohort study. BMJ. 2017;359:j5019.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Rosella LC, Manuel DG, Burchill C, et al. A population-based risk algorithm for the development of diabetes: development and validation of the Diabetes Population Risk Tool (DPoRT). J Epidemiol Community Health. 2011;65(7):613–20.

    Article  PubMed  Google Scholar 

  23. Kahn HS, Cheng YJ, Thompson TJ, et al. Two risk-scoring systems for predicting incident diabetes mellitus in U.S. adults age 45 to 64 years. Annals of internal medicine 2009;150(11):741-51. doi: 10.7326/0003-4819-150-11-200906020-00002 [published Online First: 2009/06/03]

  24. Chien K, Cai T, Hsu H, et al. A prediction model for type 2 diabetes risk among Chinese people. Diabetologia 2009;52(3):443-50. doi: 10.1007/s00125-008-1232-4 [published Online First: 2008/12/06]

  25. Schmidt MI, Duncan BB, Bang H, et al. Identifying individuals at high risk for diabetes: The Atherosclerosis Risk in Communities study. Diabetes Care 2005;28(8):2013-8. doi: 10.2337/diacare.28.8.2013 [published Online First: 2005/07/27]

  26. Schulze MB, Hoffmann K, Boeing H, et al. An accurate risk score based on anthropometric, dietary, and lifestyle factors to predict the development of type 2 diabetes. Diabetes Care. 2007;30(3):510–5.

    Article  PubMed  Google Scholar 

  27. Barden A, Singh R, Walters B, et al. A simple scoring method using cardiometabolic risk measurements in pregnancy to determine 10-year risk of type 2 diabetes in women with gestational diabetes. Nutrition & diabetes 2013;3:e72. doi: 10.1038/nutd.2013.15 [published Online First: 2013/06/05]

  28. Köhler M, Ziegler AG, Beyerlein A. Development of a simple tool to predict the risk of postpartum diabetes in women with gestational diabetes mellitus. Acta Diabetologica. 2016;53(3):433–7.

    Article  PubMed  Google Scholar 

  29. Kwak SH, Choi SH, Kim K, et al. Prediction of type 2 diabetes in women with a history of gestational diabetes using a genetic risk score. Diabetologia. 2013;56(12):2556–63.

    Article  PubMed  Google Scholar 

  30. Dunn S, Lanes A, Sprague AE, et al. Data accuracy in the Ontario birth registry: a chart re-abstraction study. BMC Health Services Research. 2019;19(1):1001.

    Article  PubMed  PubMed Central  Google Scholar 

  31. Dunn S, Bottomley J, Ali A, et al. 2008 Niday perinatal database quality audit: report of a quality assurance project. Chronic diseases and injuries in Canada 2011;32(1):32-42. [published Online First: 2011/12/14]

  32. Shah BR, Chiu M, Amin S, et al. Surname lists to identify South Asian and Chinese ethnicity from secondary data in Ontario, Canada: a validation study. BMC Medical Research Methodology. 2010;10(1):42.

    Article  PubMed  PubMed Central  Google Scholar 

  33. Feig DS, Berger H, Donovan L, et al. Diabetes and pregnancy. Canadian Journal of Diabetes. 2018;42:S255–S82.

    Article  PubMed  Google Scholar 

  34. Lipscombe LL, Hwee J, Webster L, et al. Identifying diabetes cases from administrative data: a population-based validation study. BMC Health Services Research. 2018;18(1):316.

    Article  PubMed  PubMed Central  Google Scholar 

  35. Riley RD, Ensor J, Snell KIE, et al. Calculating the sample size required for developing a clinical prediction model. BMJ 2020;368:m441. doi: 10.1136/bmj.m441 [published Online First: 2020/03/20]

  36. Riley RD, Snell KI, Ensor J, et al. Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes. Stat Med 2019;38(7):1276-96. doi: 10.1002/sim.7992 [published Online First: 2018/10/26]

  37. tidyverse [program]. Spring-Verlag New York, 2017.

  38. rms: Regression modeling strategies [program]., 2015.

  39. Collins GS, Reitsma JB, Altman DG, et al. Transparent reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement. Annals of internal medicine. 2015;162(1):55–63.

    Article  PubMed  Google Scholar 

  40. Jakobsen JC, Gluud C, Wetterslev J, et al. When and how should multiple imputation be used for handling missing data in randomised clinical trials - a practical guide with flowcharts. BMC Med Res Methodol 2017;17(1):162. doi: 10.1186/s12874-017-0442-1 [published Online First: 2017/12/07]

  41. Rubin DB. Multiple imputation after 18+ years. Journal of the American Statistical Association. 1996;91(434):473–89.

    Article  Google Scholar 

  42. Schafer JL. Multiple imputation: a primer. Statistical methods in medical research 1999;8(1):3-15. doi: 10.1177/096228029900800102 [published Online First: 1999/05/29]

  43. Moons KG, Donders RA, Stijnen T, et al. Using the outcome for imputation of missing predictor values was preferred. Journal of clinical epidemiology 2006;59(10):1092-101. doi: 10.1016/j.jclinepi.2006.01.009 [published Online First: 2006/09/19]

  44. Wood AM, Royston P, White IR. The estimation and use of predictions for the assessment of model performance using large samples with multiply imputed data. Biometrical Journal 2015;57(4):614-32. doi:

  45. Ambler G, Brady AR, Royston P. Simplifying a prognostic model: a simulation study based on clinical data. Statistics in medicine. 2002;21(24):3803–22.

    Article  PubMed  Google Scholar 

  46. Vickers AJ, Van Calster B, Steyerberg EW. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ. 2016;352:i6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Bowker SL, Savu A, Yeung RO, et al. Patterns of glucose-lowering therapies and neonatal outcomes in the treatment of gestational diabetes in Canada, 2009–2014. Diabetic Medicine. 2017;34(9):1296–302.

    Article  CAS  PubMed  Google Scholar 

  48. Crowson CS, Atkinson EJ, Therneau TM. Assessing calibration of prognostic risk scores. Statistical methods in medical research 2016;25(4):1692-706. doi: 10.1177/0962280213497434 [published Online First: 2013/08/03]

  49. Donovan LE, Savu A, Edwards AL, et al. Prevalence and timing of screening and diagnostic testing for gestational diabetes: a population-based study in Alberta, Canada. Diabetes Care 2015:dc151421. doi: 10.2337/dc15-1421

  50. Iskander C, McArthur E, Nash DM, et al. Identifying Ontario geographic regions to assess adults who present to hospital with laboratory-defined conditions: a descriptive study. CMAJ Open. 2019;7(4):E624–E29.

    Article  PubMed  PubMed Central  Google Scholar 

Download references


The proposed datasets will be linked using unique encoded identifiers and analyzed at ICES. Parts of the material are based on data and/or information compiled and provided by CIHI. However, the statements expressed in the material are those of the author(s), and not necessarily those of CIHI.


This project is supported by ICES, an independent, not-for-profit entity funded by an annual grant from the Ontario Ministry of Health and Long-Term Care (MOHLTC). SHR is supported by a Diabetes Action Canada fellowship. This project is also supported by a PSI Foundation grant. LLL is supported by a Diabetes Investigator Award from Diabetes Canada and LCR is supported by a Canada Research Chair in Population Health Analytics. The analyses, conclusions, opinions, and statements expressed herein are solely those of the authors and do not reflect those of the funding or data sources; no endorsement is intended or should be inferred. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations



This project was devised by LL and SHR. SHR drafted the manuscript. All authors contributed to the study design and the paper’s critical revision. All authors have read and approved the final version of the manuscript. SHR is responsible for the integrity of the work as a whole.

Corresponding author

Correspondence to Stephanie H. Read.

Ethics declarations

Ethics approval and consent to participate

ICES is a prescribed entity under section 45 of Ontario’s Personal Health Information Protection Act. Section 45 authorizes ICES to collect personal health information, without consent, for the purpose of analysis or compiling statistical information with respect to the management of, evaluation or monitoring of, the allocation of resources to or planning for all or part of the health system. Projects conducted under section 45, by definition, do not require review by a Research Ethics Board. This project was conducted under section 45 and approved by ICES’ Privacy and Legal Office.

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Read, S.H., Rosella, L.C., Berger, H. et al. Diabetes after pregnancy: a study protocol for the derivation and validation of a risk prediction model for 5-year risk of diabetes following pregnancy. Diagn Progn Res 5, 5 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: