Here, we describe a joint modelling approach for the analysis of multistage IVF data. The approach includes distinct submodels for each of the response variables considered in the cycle. We include six response variables for patient *j =* 1,…,*n* and their embryos *i* = 1,…,*n*_{j}, and hence six submodels, in the current presentation: the number of oocytes (eggs) obtained from ovarian stimulation (a count,\( {y}_j^O \)); the fertilisation rate when the oocytes are mixed with sperm (\( {y}_j^M \)); two measures of embryo quality (cell evenness and degree of fragmentation \( {y}_{ij}^E \)and \( {y}_{ij}^F \), both measured using ordinal grading scales); an indicator denoting whether one or two embryos were transferred to the patient (denoted by a binary variable \( {y}_j^D \)) and another (\( {y}_j^L \)) indicating whether or not the transfer of embryos resulted in the live birth of one or more babies (a live birth event, or LBE) (Fig. 1). These are listed in temporal order, with the exception of the two embryo quality scales, which are coincident. We include the decision to transfer two embryos (known as double embryo transfer, or DET) in the model because it is an important predictor of transfer success which is partially determined by the outcomes of the earlier stages. A second feature is that once a patient has dropped out of the cycle, she does not appear in the submodels corresponding to the downstream responses.

### Joint model

The joint model requires the use of latent variable representations for the various submodels constituting the larger model. Each patient *j* has associated vectors of responses *y*_{j}= (\( {y}_j^O,{y}_j^M,{y}_{ij}^E,{y}_{ij}^F,{y}_j^D,{y}_j^L \)) and of underlying latent variables *z*_{j} = (\( {z}_j^O,{z}_j^M,{z}_j^E,{z}_j^F,{z}_j^D,{z}_j^L \)). Both of these vectors may be partially observed due to drop-out or outright failure before completion of the treatment. We then posit a multivariate normal distribution for the latent variables and estimate the elements of the correlation and variance-covariance matrices. We prefer to use distinct latent variables in each submodel to an approach based on a common latent variable which is scaled by factor loadings in each submodel (e.g. [12, 13]), as the linearity assumption required for the latter is too restrictive for present purposes [14]. The submodels for each stage are presented below, followed by the multivariate distribution of latent variables.

#### Stimulation phase submodel

For patient *j*, we assume the number of oocytes (eggs) obtained \( {y}_j^O \) follows a Poisson distribution and model the log of the rate parameter \( {\lambda}_j^o \) in the usual way:

$$ \log \left({\lambda}_j^o\right)={\boldsymbol{X}}_j^o{\boldsymbol{\beta}}^o+{z}_j^o $$

(1)

where \( {\boldsymbol{X}}_j^o \) is a row-vector of cycle-level covariates for patient *j*, *β*^{o} is a corresponding vector of regression parameters and \( {z}_j^o \) is a patient-specific latent variable that captures overdispersion in the oocyte yield. This submodel is fitted to all patients who start the cycle.

#### Fertilisation submodel

We model the number of embryos obtained when oocytes are mixed with sperm\( {y}_j^M \) in terms of its rate parameter \( {\lambda}_j^M \), again using a Poisson submodel:

$$ \log \left({\lambda}_j^M\right)=\log \left({y}_j^O\right)+{\boldsymbol{X}}_j^M{\boldsymbol{\beta}}^M+{z}_j^M $$

(2)

where \( {\boldsymbol{X}}_j^M \)_{,} *β*^{M} and \( {z}_j^M \) are analogous to the corresponding terms in the stimulation model. We now include an offset term corresponding to the logarithm of the number of oocytes obtained in the linear predictor. This submodel is fitted to all patients who have oocytes mixed with sperm. In some cycles, the number of oocytes mixed with sperm is less than the number obtained, so there is an implicit assumption in the model that any oocytes which were not mixed could not have been successfully fertilised. The assumption is reasonable, since the decision not to mix an oocyte with sperm is almost always based on the fact that the oocyte has been identified as being degenerate.

#### Embryo quality submodels

We include two measures of embryo quality: cell evenness (*y*^{E}) and degree of fragmentation (*y*^{F}). These are ordinal 1 to 4 grading scales measured at the level of individual embryos. We model these using cumulative logit submodels. For embryo *i* (where *i =* 1,2,…,*n*_{j}) nested in patient *j*, we have, for *k* = 1,2,3:

$$ {\displaystyle \begin{array}{c}\log \mathrm{it}\left({\gamma}_{kij}^E\right)={\alpha}_k^E-{\boldsymbol{X}}_{ij}^E{\boldsymbol{\beta}}_k^E-{z}_j^E\\ {}\log \mathrm{it}\left({\gamma}_{kij}^F\right)={\alpha}_k^F-{\boldsymbol{X}}_{ij}^F{\boldsymbol{\beta}}_k^F-{z}_j^F\end{array}} $$

(3)

where \( {\boldsymbol{X}}_{ij}^E \) and \( {\boldsymbol{X}}_{ij}^F \) are row-vectors of covariates, \( {\boldsymbol{\beta}}_k^E \) and \( {\boldsymbol{\beta}}_k^F \)are vectors of regression coefficients which may vary across the levels of *k* (relaxing the proportional odds assumption) and \( {z}_j^E \) and \( {z}_j^F \) are patient-level random effects (latent variables) which are identified due to the clustering of embryos within patients.\( {\gamma}_{kij}^E \) and \( {\gamma}_{kij}^F \) are cumulative probabilities of embryo *i* in patient *j* having a grade of *k* or lower for evenness and fragmentation degree respectively and \( {\alpha}_k^E \)and \( {\alpha}_k^F \) are threshold parameters, corresponding to the log-odds of the embryo having grade *k* or lower. The negative signs in these equations are used so that positive effects of a covariate may be interpreted as increasing the ordinal measure. These submodels are fitted to all embryos.

#### Double embryo transfer submodel

In order to jointly model the binary response DET, denoting the number of embryos transferred, with the other response variables, we use a latent variable representation of a probit regression model [22]. Let \( {y}_j^D=1 \) or 0 if patient *j* does or does not have DET, respectively. We define \( {y}_j^{D\ast } \) as a latent continuous variable underlying the binary \( {y}_j^D \), such that:

$$ {y}_j^D=\left\{\begin{array}{c}1\kern0.5em if\kern0.5em {y}_j^{D\ast}\kern0.5em \ge 0\\ {}0\kern0.5em if\kern0.5em {y}_j^{D\ast}\kern0.5em <\kern0.5em 0\end{array}\right. $$

(4)

A linear regression submodel for the latent \( {y}_j^{D\ast } \) is then used to estimate covariate effects:

$$ {\displaystyle \begin{array}{c}{y}_j^{D\ast }={\boldsymbol{X}}_j^D{\beta}^D+{z}_j^D\\ {}{z}_j^D\sim N\left(0,\kern0.5em 1\right)\end{array}} $$

(5)

where \( {\boldsymbol{X}}_j^D \) is a row-vector of patient-level covariates and *β*^{D} is a vector of regression coefficients. Fixing the variance of \( {z}_j^D \) to be 1 is mathematically equivalent to specifying a probit model for the probability that a patient will have DET. This formulation is used for convenience; it allows us to use \( {z}_j^D \) to link the DET submodel to the others by assuming an underlying multivariate normal distribution for the latent variables.

#### Live birth event submodel

In the same manner as for DET, we use a latent probit representation for \( {y}_j^L=1 \) or 0 corresponding to whether or not LBE obtains, with an underlying latent variable \( {y}_j^{L\ast } \):

$$ {y}_j^L=\left\{\begin{array}{c}1\kern0.5em if\kern0.5em {y}_j^{L\ast}\kern0.5em \ge \kern0.5em 0\\ {}0\kern0.5em if\kern0.5em {y}_j^{L\ast}\kern0.5em <0\end{array}\right. $$

(6)

Again, a linear regression submodel for the latent \( {y}_j^{L\ast } \) is then used to estimate covariate effects:

$$ {\displaystyle \begin{array}{c}{y}_j^{L\ast }={\boldsymbol{X}}_j^L{\boldsymbol{\beta}}^L+{z}_j^L\\ {}{z}_j^L\sim N\left(0,\kern0.5em 1\right)\end{array}} $$

(7)

with row vector of patient-level covariates \( {\boldsymbol{X}}_j^L \) and vector of regression coefficients *β*^{L}. The error term \( {z}_j^L \) again has a variance of 1 and is used to link the LBE submodel to the others. The DET and LBE submodels are fitted to patients who undergo the transfer procedure.

#### Covariates

Different covariates may be included in each of the covariate vectors \( {\boldsymbol{X}}_j^O \)_{,} \( {\boldsymbol{X}}_j^M \)_{,} \( {\boldsymbol{X}}_{ij}^E \)_{,} \( {\boldsymbol{X}}_{ij}^F \)_{,} \( {\boldsymbol{X}}_j^D \)_{,} \( {\boldsymbol{X}}_j^L \). If interest is in dynamically predicting the outcome of subsequent stages, conditional on responses at previous stages, this could include any of the response variables *y*_{j} occurring prior to this stage. For example, the number of fertilised eggs could be included in the submodels for embryo evenness and fragmentation, or these measures of embryo quality could be included in submodels for DET and LBE.

When including outcomes as covariates in a joint model, model identification can pose a challenge [23, 24]. Standard strategies include fixing parameters in the model (for example, fixing elements of the latent correlation matrix to be zero) and including non-identical sets of covariates in the submodels. When these approaches are used in a causal inference setting, this requirement translates to including instrumental variables in the submodels [21, 25, 26]. Where our goal is prediction rather than causal inference however, these strong structural assumptions are not required.

#### Latent variable distribution

We specify a multivariate normal distribution for the latent variables to connect the submodels, with variances of 1 for the probit DET and LBE submodels:

$$ \left[\begin{array}{c}{z}_j^O\\ {}{z}_j^M\\ {}{z}_j^E\\ {}{z}_j^F\\ {}{z}_j^D\\ {}{z}_j^L\end{array}\right]\sim MVN\left(\left[\begin{array}{c}0\\ {}0\\ {}0\\ {}0\\ {}0\\ {}0\end{array}\right],\left[\begin{array}{cccccc}{\theta}_O^2& {\eta}_1{\theta}_O{\theta}_M& {\eta}_2{\theta}_O{\theta}_E& {\eta}_3{\theta}_O{\theta}_F& {\eta}_4{\theta}_O& {\eta}_5{\theta}_O\\ {}.& {\theta}_M^2& {\eta}_6{\theta}_M{\theta}_E& {\eta}_7{\theta}_M{\theta}_F& {\eta}_8{\theta}_M& {\eta}_9{\theta}_M\\ {}.& .& {\theta}_E^2& {\eta}_{10}{\theta}_E{\theta}_F& {\eta}_{11}{\theta}_E& {\eta}_{12}{\theta}_E\\ {}.& .& .& {\theta}_F^2& {\eta}_{13}{\theta}_F& {\eta}_{14}{\theta}_F\\ {}.& .& .& .& 1& {\eta}_{15}\\ {}{\eta}_5{\theta}_O& \dots & \dots & \dots & \dots & 1\end{array}\right]\ \right) $$

(8)

where *θ*_{O}, *θ*_{M}, *θ*_{E}, *θ*_{F} are standard deviations for the latent variables corresponding to the first four submodels and *η*_{1}, *η*_{2}, …, *η*_{15} are off-diagonal elements of the correlation matrix. We note that this framework estimates the relationships between patient and embryo-level responses.