Interval censored survival analysis - an example with the Royston-Parmar flexible parametric model

The do file with all the code from this tutorial is available to download here

Interval censoring occurs when we only know an event occurred within a particular time window, rather than the exact time at which it happened. Such data is common in opthalmology and dentistry, where events are only picked up at scheduled appointments, but they actually occurred at some point since the previous visit. In this example, I will first show how to simulate interval censored survival times, and then show how to use merlin to fit an interval censored flexible parametric survival model. There are two obvious ways (could be more…) to simulate interval censored survival times - either under a discrete time setting, assuming particular event probabilities within time windows, or a continuous time setting. I will focus on the second, which I would always argue is more biologically plausible, and also as I wish to fit a continuous time survival model.

First we’ll simulate a dataset of 1000 observations, and include a binary variable which we will assume is a treatment group indicator coded 0 for control, and 1 for active treatment. I’ll assume allocation to each group is 50% at random. And of course I set a seed for reproducibility (we can’t be friends if you don’t do that).

. clear

. set seed 72549

. set obs 1000
number of observations (_N) was 0, now 1,000

. gen trt = runiform()>0.5

To keep things nice and simple I will use my survsim Stata command to simulate survival times from a Weibull distribution with shape and scale parameters, 1.2 and 0.1. I’ll assume a log hazard ratio due to treatment of -0.5. Let’s apply administrative censoring at 5 years.

. survsim stime event , dist(weib) lambda(0.1) gamma(1.2) covariates(trt -0.5) maxt(5)

I’ve simulated continuous, exactly observed survival times, which is what we would observe in an ideal world. Sadly, this isn’t always the case, and we may only observe events at specific times, such as scheduled doctor appointments. We will now apply interval censoring. I’m going to assume observations were only made at annual intervals, i.e. every year, so any events will be rounded up to the nearest year and the left interval taken as the previous year. This code is easily adapted to widen or shorten the intervals (feel free to go wild). The left interval will be stored in st1,

. gen st1 = floor(stime) if stime < 5
(590 missing values generated)

. replace stime = st1 + 1 if st1 < 5
(410 real changes made)

merlin identifies interval censored observations through the failure() indicator being coded as a 2, rather than a 1 for exactly observed events or a 0 for right censored observations, so we make that change (as survsim codes all events by default as 1).

. replace event = 2 if event == 1
(410 real changes made)

In this example, I’m assuming all observations are either interval censored events, or right censored. merlin allows any combination of exactly observed events, interval censoring, right censoring and left truncation.

Let’s have a quick look at the dataset (always recommended),

. list stime st1 event in 1/5

     | stime   st1   event |
  1. |     5     .       0 |
  2. |     5     .       0 |
  3. |     5     .       0 |
  4. |     5     4       2 |
  5. |     1     0       2 |

The first three observations are right censored at 5 years, and hence their st1 is missing. Observation 4 is an interval censored event which occurred between 4 and 5 years, and observation 5 is an event which occurred between 0 and 1 years.

Now we are all setup and ready to fit an interval censored survival model with merlin. As an aside, my approach to methods development focuses on general code, in that by adding interval censoring to merlin, it means that it will work with any of the inbuilt survival distributions, including the user-defined ones. However, here I will keep it relatively simple and show how to fit a Royston-Parmar flexible parametric survival model which uses restricted cubic splines to model the baseline log cumulative hazard function, but importantly now allowing for interval censoring.

The important option to focus on is the linterval() option, which lets you pass the variable which contains the left interval time for those observations which are interval censored, which must have their associated event indicator specified as a 2. The right side of the interval is simply that in your outcome variable, stime in our case. I specify my family() to be rp and use 3 degrees of freedom for the baseline log cumulative hazard function (1 df is equivalent to a Weibull - the true model), and add in the failure() and linterval() options,

. merlin  (stime trt , family(rp, df(3) failure(event) linterval(st1))) 
variables created: _rcs1_1 to _rcs1_3

Fitting full model:

Iteration 0:   log likelihood = -1883.2234  
Iteration 1:   log likelihood = -1553.3417  
Iteration 2:   log likelihood = -1323.3832  
Iteration 3:   log likelihood = -1316.8441  
Iteration 4:   log likelihood = -1316.8434  
Iteration 5:   log likelihood = -1316.8434  

Fixed effects regression model                  Number of obs     =      1,000
Log likelihood = -1316.8434
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
stime:       |            
         trt |  -.6211561   .1019953    -6.09   0.000    -.8210632    -.421249
       _cons |  -.6726925   .0631678   -10.65   0.000    -.7964992   -.5488858
    Warning: Baseline spline coefficients not shown - use ml display

Our model fits smoothly and we estimate our log hazard ratio due to treatment (and let’s face it, the syntax is elegant). I really am keeping things simple, since we could assess proportional hazards in our treatment effect (which we should always do…) by including an interaction between a function of time and trt, and we can do any number of other things (non-linear effects, frailties, joint models, etc…)

For comparison, we can also fit the same model using Patrick Royston’s stpm (type findit stpm in Stata) command (this has been superseded by stpm2, but stpm2 does not support interval censoring). stpm requires that for non interval censored observations, the left interval contains the event time, so we make that adjustment first. Our data must also be stset for stpm, so I create a new event indicator called died which is 1 for any events, rather than a 2 as was required for merlin.

. replace st1 = stime if st1==.
(590 real changes made)

. gen died = event>0

. stset stime, f(died)

     failure event:  died != 0 & died < .
obs. time interval:  (0, stime]
 exit on or before:  failure

      1,000  total observations
          0  exclusions
      1,000  observations remaining, representing
        410  failures in single-record/single-failure data
      4,200  total analysis time at risk and under observation
                                                at risk from t =         0
                                     earliest observed entry t =         0
                                          last observed exit t =         5

. stpm trt, scale(h) df(3) left(st1)

initial:       log likelihood = -1317.5787
rescale:       log likelihood = -1317.5787
rescale eq:    log likelihood = -1317.5787
Iteration 0:   log likelihood = -1317.5787  
Iteration 1:   log likelihood = -1316.8442  
Iteration 2:   log likelihood = -1316.8434  
Iteration 3:   log likelihood = -1316.8434  

                                                  Number of obs   =       1000
                                                  Wald chi2(1)    =      37.09
Log likelihood = -1316.8434                       Prob > chi2     =     0.0000

          _t |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
s0           |
       _cons |   1.164719    .061942    18.80   0.000     1.043315    1.286123
s1           |
       _cons |  -.0349766   .0150964    -2.32   0.021    -.0645651   -.0053882
s2           |
       _cons |  -.0043584   .0091708    -0.48   0.635    -.0223329     .013616
xb           |
         trt |  -.6211561   .1019953    -6.09   0.000    -.8210632    -.421249
       _cons |  -2.242726   .1107818   -20.24   0.000    -2.459854   -2.025597
Deviance =  2633.687 (1000 observations.)

We obtain identical estimates for the effect of treatment, and log-likelihood (which is reassuring). I also did a time comparison between merlin and stpm, and found merlin to be about 10 times faster. There’s a huge amount of extensions that you could fit with merlin, allowing for interval censoring, but hopefully this simple example gives a primer for both data setup and syntax.

This example used merlin version 2.0.2.


comments powered by Disqus