The do file with all the code from this tutorial is available to download here

Interval censoring occurs when we only know an event occurred within a particular time window, rather than the exact time at which it happened. Such data is common in opthalmology and dentistry, where events are only picked up at scheduled appointments, but they actually occurred at some point since the previous visit. In this example, I will first show how to simulate interval censored survival times, and then show how to use `merlin`

to fit an interval censored flexible parametric survival model. There are two obvious ways (could be more…) to simulate interval censored survival times - either under a discrete time setting, assuming particular event probabilities within time windows, or a continuous time setting. I will focus on the second, which I would always argue is more biologically plausible, and also as I wish to fit a continuous time survival model.

First we’ll simulate a dataset of 1000 observations, and include a binary variable which we will assume is a treatment group indicator coded 0 for control, and 1 for active treatment. I’ll assume allocation to each group is 50% at random. And of course I set a seed for reproducibility (we can’t be friends if you don’t do that).

```
. clear
. set seed 72549
. set obs 1000
number of observations (_N) was 0, now 1,000
. gen trt = runiform()>0.5
```

To keep things nice and simple I will use my `survsim`

Stata command to simulate survival times from a Weibull distribution with shape and scale parameters, 1.2 and 0.1. I’ll assume a log hazard ratio due to treatment of -0.5. Let’s apply administrative censoring at 5 years.

```
. survsim stime event , dist(weib) lambda(0.1) gamma(1.2) covariates(trt -0.5) maxt(5)
```

I’ve simulated continuous, exactly observed survival times, which is what we would observe in an ideal world. Sadly, this isn’t always the case, and we may only observe events at specific times, such as scheduled doctor appointments. We will now apply interval censoring. I’m going to assume observations were only made at annual intervals, i.e. every year, so any events will be rounded up to the nearest year and the left interval taken as the previous year. This code is easily adapted to widen or shorten the intervals (feel free to go wild). The left interval will be stored in `st1`

,

```
. gen st1 = floor(stime) if stime < 5
(590 missing values generated)
. replace stime = st1 + 1 if st1 < 5
(410 real changes made)
```

`merlin`

identifies interval censored observations through the `failure()`

indicator being coded as a 2, rather than a 1 for exactly observed events or a 0 for right censored observations, so we make that change (as `survsim`

codes all events by default as 1).

```
. replace event = 2 if event == 1
(410 real changes made)
```

In this example, I’m assuming all observations are either interval censored events, or right censored. `merlin`

allows any combination of exactly observed events, interval censoring, right censoring and left truncation.

Let’s have a quick look at the dataset (always recommended),

```
. list stime st1 event in 1/5
+---------------------+
| stime st1 event |
|---------------------|
1. | 5 . 0 |
2. | 5 . 0 |
3. | 5 . 0 |
4. | 5 4 2 |
5. | 1 0 2 |
+---------------------+
```

The first three observations are right censored at 5 years, and hence their `st1`

is missing. Observation 4 is an interval censored event which occurred between 4 and 5 years, and observation 5 is an event which occurred between 0 and 1 years.

Now we are all setup and ready to fit an interval censored survival model with `merlin`

. As an aside, my approach to methods development focuses on general code, in that by adding interval censoring to `merlin`

, it means that it will work with any of the inbuilt survival distributions, including the user-defined ones. However, here I will keep it relatively simple and show how to fit a Royston-Parmar flexible parametric survival model which uses restricted cubic splines to model the baseline log cumulative hazard function, but importantly now allowing for interval censoring.

The important option to focus on is the `linterval()`

option, which lets you pass the variable which contains the left interval time for those observations which are interval censored, which must have their associated event indicator specified as a 2. The right side of the interval is simply that in your outcome variable, `stime`

in our case. I specify my `family()`

to be `rp`

and use 3 degrees of freedom for the baseline log cumulative hazard function (1 df is equivalent to a Weibull - the true model), and add in the `failure()`

and `linterval()`

options,

```
. merlin (stime trt , family(rp, df(3) failure(event) linterval(st1)))
variables created: _rcs1_1 to _rcs1_3
Fitting full model:
Iteration 0: log likelihood = -1883.2234
Iteration 1: log likelihood = -1553.3417
Iteration 2: log likelihood = -1323.3832
Iteration 3: log likelihood = -1316.8441
Iteration 4: log likelihood = -1316.8434
Iteration 5: log likelihood = -1316.8434
Fixed effects regression model Number of obs = 1,000
Log likelihood = -1316.8434
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
stime: |
trt | -.6211561 .1019953 -6.09 0.000 -.8210632 -.421249
_cons | -.6726925 .0631678 -10.65 0.000 -.7964992 -.5488858
------------------------------------------------------------------------------
Warning: Baseline spline coefficients not shown - use ml display
```

Our model fits smoothly and we estimate our log hazard ratio due to treatment (and let’s face it, the syntax is elegant). I really am keeping things simple, since we could assess proportional hazards in our treatment effect (which we should always do…) by including an interaction between a function of time and `trt`

, and we can do any number of other things (non-linear effects, frailties, joint models, etc…)

For comparison, we can also fit the same model using Patrick Royston’s `stpm`

(type `findit stpm`

in `Stata`

) command (this has been superseded by `stpm2`

, but `stpm2`

does not support interval censoring). `stpm`

requires that for non interval censored observations, the left interval contains the event time, so we make that adjustment first. Our data must also be `stset`

for `stpm`

, so I create a new event indicator called `died`

which is 1 for any events, rather than a 2 as was required for `merlin`

.

```
. replace st1 = stime if st1==.
(590 real changes made)
. gen died = event>0
. stset stime, f(died)
failure event: died != 0 & died < .
obs. time interval: (0, stime]
exit on or before: failure
------------------------------------------------------------------------------
1,000 total observations
0 exclusions
------------------------------------------------------------------------------
1,000 observations remaining, representing
410 failures in single-record/single-failure data
4,200 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 5
. stpm trt, scale(h) df(3) left(st1)
initial: log likelihood = -1317.5787
rescale: log likelihood = -1317.5787
rescale eq: log likelihood = -1317.5787
Iteration 0: log likelihood = -1317.5787
Iteration 1: log likelihood = -1316.8442
Iteration 2: log likelihood = -1316.8434
Iteration 3: log likelihood = -1316.8434
Number of obs = 1000
Wald chi2(1) = 37.09
Log likelihood = -1316.8434 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
s0 |
_cons | 1.164719 .061942 18.80 0.000 1.043315 1.286123
-------------+----------------------------------------------------------------
s1 |
_cons | -.0349766 .0150964 -2.32 0.021 -.0645651 -.0053882
-------------+----------------------------------------------------------------
s2 |
_cons | -.0043584 .0091708 -0.48 0.635 -.0223329 .013616
-------------+----------------------------------------------------------------
xb |
trt | -.6211561 .1019953 -6.09 0.000 -.8210632 -.421249
_cons | -2.242726 .1107818 -20.24 0.000 -2.459854 -2.025597
------------------------------------------------------------------------------
Deviance = 2633.687 (1000 observations.)
```

We obtain identical estimates for the effect of treatment, and log-likelihood (which is reassuring). I also did a time comparison between `merlin`

and `stpm`

, and found `merlin`

to be about 10 times faster. There’s a huge amount of extensions that you could fit with `merlin`

, allowing for interval censoring, but hopefully this simple example gives a primer for both data setup and syntax.

This example used `merlin`

version 2.0.2.