# Using merlin for survival analysis in Stata

merlin provides many different survival models for the analysis of time to event data; however, it differs in one core thing compared to other survival analysis commands in Stata - it does not require you to stset your data. As such, it doesn’t use the internal _t _t0 _d _st variables, created by stset. This is both good and bad.

Let’s see why.

$~$

### Survival analysis in Stata

Let’s simulate some data (I think I hide it well that I favour simulated data than real…I prefer a simple life). So let’s assume a Weibull baseline, and a binary treatment group, with a beneficial treatment reducing the mortality rate due to a hazard ratio of 0.607. My hazard rate is as follows,

$$h(t) = \lambda \gamma t^{\gamma - 1} \exp (\beta_{1} X)$$

where $X$ is my binary treatment group variable, with 50% allocated to each arm, and $\beta_{1} = -0.5$ is the associated log hazard ratio. survsim does this nicely (subtle compliment of my own command there),

. clear

. set seed 98798                                         // reproducibility

. set obs 1000                                           // sample size
number of observations (_N) was 0, now 1,000

. gen trt = runiform()>0.5                               // binary treatment variable

. survsim stime died, dist(weib) lambda(0.01) gamma(1.5) /// baseline hazard
>                     covariates(trt -0.5)               //  effect of treatment



So I have my survival times in stime, which are either the event time or the censoring time, which is reflected in whether my event indicator, died, has been coded a 1 or a 0, respectively. We’re doing survival analysis in Stata, so the natural next step is to stset our data.

. stset stime, failure(died)

failure event:  died != 0 & died < .
obs. time interval:  (0, stime]
exit on or before:  failure

------------------------------------------------------------------------------
1,000  total observations
0  exclusions
------------------------------------------------------------------------------
1,000  observations remaining, representing
219  failures in single-record/single-failure data
9,079.168  total analysis time at risk and under observation
at risk from t =         0
earliest observed entry t =         0
last observed exit t =        10



Now stset is awesome. It provides extensive testing and validation of survival time data, preventing a multitude of mistakes, and allowing all sorts of flexibility including efficient handling of dates. Any st command in Stata requires you to stset your data, hence once it has been stset, Stata can trust the input. The stset above is relatively simple (deliberately - see above on a simple life), but regardless of that, stset creates the same four variables,

. list _t0 _t _d _st in 1/5

+----------------------------+
| _t0          _t   _d   _st |
|----------------------------|
1. |   0          10    0     1 |
2. |   0   .27256038    1     1 |
3. |   0          10    0     1 |
4. |   0          10    0     1 |
5. |   0          10    0     1 |
+----------------------------+


• _t0 the time at which each observation becomes at risk
• _t the failure or censoring time
• _d the event indicator (1 for an event, 0 for censored)
• _st the sample indicator - whether each observation should be used in subsequent commands, coded 1 for yes, 0 for no

Once we have these internal variables, it’s almost trivial to create Kaplan-Meier curves, fit survival models, estimate rates etc., since we don’t have to specify the outcome variables. Of course, we can re-stset at any time if we want to make any changes to our sample.

So, for example, to a fit a Weibull model, adjusting for treatment, we use

. streg trt, distribution(weibull) nohr

failure _d:  died
analysis time _t:  stime

Fitting constant-only model:

Iteration 0:   log likelihood = -702.64756
Iteration 1:   log likelihood = -693.68358
Iteration 2:   log likelihood = -693.51296
Iteration 3:   log likelihood =  -693.5129

Fitting full model:

Iteration 0:   log likelihood =  -693.5129
Iteration 1:   log likelihood = -684.37417
Iteration 2:   log likelihood = -684.17429
Iteration 3:   log likelihood =  -684.1742
Iteration 4:   log likelihood =  -684.1742

Weibull PH regression

No. of subjects =        1,000                  Number of obs    =       1,000
No. of failures =          219
Time at risk    =  9079.167592
LR chi2(1)       =       18.68
Log likelihood  =    -684.1742                  Prob > chi2      =      0.0000

------------------------------------------------------------------------------
_t |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
trt |  -.5919618   .1393104    -4.25   0.000    -.8650052   -.3189185
_cons |  -4.249019   .2170205   -19.58   0.000    -4.674371   -3.823667
-------------+----------------------------------------------------------------
/ln_p |   .3008501   .0657086     4.58   0.000     .1720636    .4296366
-------------+----------------------------------------------------------------
p |   1.351007   .0887728                      1.187753    1.536699
1/p |   .7401887   .0486368                      .6507455    .8419256
------------------------------------------------------------------------------



and we simply don’t have to worry about our survival times, our entry times, or our event indicator.

I can’t say the same for merlin.

### Using merlin with st data

merlin is not a survival analysis command…but merlin is also a survival analysis command. With the first statement I mean that merlin does not require your dataset to be stset and hence it won’t care if you’ve stset anything. By the second statement I mean that merlin can fit a wide range of survival models (and a whole lot more), but you need to tell it the outcome variables manually. This is because merlin caters for a lot more than just survival models, so it has to cover a lot of other outcome types and syntax combinations. Fitting a Weibull model with merlin is still pretty simple though. Ignoring our stset data, we can fit the same$^*$ Weibull model as follows,

. merlin (stime trt, family(weibull, failure(died)))

Fitting full model:

Iteration 0:   log likelihood = -9079.1676
Iteration 1:   log likelihood = -1116.9133  (not concave)
Iteration 2:   log likelihood = -1033.9946
Iteration 3:   log likelihood = -1016.8322
Iteration 4:   log likelihood = -1016.2367
Iteration 5:   log likelihood = -1016.2285
Iteration 6:   log likelihood = -1016.2285

Mixed effects regression model                  Number of obs     =      1,000
Log likelihood = -1016.2285
------------------------------------------------------------------------------
|      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
stime:       |
trt |  -.5919619   .1393104    -4.25   0.000    -.8650053   -.3189186
_cons |  -4.249018   .2170208   -19.58   0.000    -4.674371   -3.823665
log(gamma) |   .3008499   .0657087     4.58   0.000     .1720632    .4296366
------------------------------------------------------------------------------



Now the first variable of any model in merlin must be your outcome, and in the survival case it’s our survival/censoring time variable stime. Since we have censoring, we also need to tell it the name of the failure indicator variable through the failure() option within family(), hence including died in there.

So we can do survival analysis in merlin. However, as I said above, stset is awesome for many reasons, so why not make the best of both worlds. The point of this post has probably occured to you by now…let’s stset our data, and simply use the _* variables in our merlin model specification. In other words, our above merlin model becomes,

. merlin (_t trt, family(weibull, failure(_d)))

Fitting full model:

Iteration 0:   log likelihood = -9079.1676
Iteration 1:   log likelihood = -1116.9133  (not concave)
Iteration 2:   log likelihood = -1033.9946
Iteration 3:   log likelihood = -1016.8322
Iteration 4:   log likelihood = -1016.2367
Iteration 5:   log likelihood = -1016.2285
Iteration 6:   log likelihood = -1016.2285

Mixed effects regression model                  Number of obs     =      1,000
Log likelihood = -1016.2285
------------------------------------------------------------------------------
|      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_t:          |
trt |  -.5919619   .1393104    -4.25   0.000    -.8650053   -.3189186
_cons |  -4.249018   .2170208   -19.58   0.000    -4.674371   -3.823665
log(gamma) |   .3008499   .0657087     4.58   0.000     .1720632    .4296366
------------------------------------------------------------------------------



which is of course identical to the above model, but we get some reassurance from the use of the _* variables because they’ve made it out the other side of stset. Now I didn’t use all of them, since I had no delayed entry in my model (i.e. _t0 = 0 for all observations) we didn’t need _t0, but we of course can use it anyway through the ltruncated() option,

. merlin (_t trt, family(weibull, failure(_d) ltruncated(_t0)))

Fitting full model:

Iteration 0:   log likelihood = -9079.1676
Iteration 1:   log likelihood = -1116.9133  (not concave)
Iteration 2:   log likelihood = -1033.9946
Iteration 3:   log likelihood = -1016.8322
Iteration 4:   log likelihood = -1016.2367
Iteration 5:   log likelihood = -1016.2285
Iteration 6:   log likelihood = -1016.2285

Mixed effects regression model                  Number of obs     =      1,000
Log likelihood = -1016.2285
------------------------------------------------------------------------------
|      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_t:          |
trt |  -.5919619   .1393104    -4.25   0.000    -.8650053   -.3189186
_cons |  -4.249018   .2170208   -19.58   0.000    -4.674371   -3.823665
log(gamma) |   .3008499   .0657087     4.58   0.000     .1720632    .4296366
------------------------------------------------------------------------------



Finally, the sample indicator _st variable also wasn’t needed because all observations were used, i.e. there were no exclusions reported in the stset above, but to make sure we cover that in the case that there were issues, our full model specification is,

. merlin (_t trt if _st==1, family(weibull, failure(_d) ltruncated(_t0)))

Fitting full model:

Iteration 0:   log likelihood = -9079.1676
Iteration 1:   log likelihood = -1116.9133  (not concave)
Iteration 2:   log likelihood = -1033.9946
Iteration 3:   log likelihood = -1016.8322
Iteration 4:   log likelihood = -1016.2367
Iteration 5:   log likelihood = -1016.2285
Iteration 6:   log likelihood = -1016.2285

Mixed effects regression model                  Number of obs     =      1,000
Log likelihood = -1016.2285
------------------------------------------------------------------------------
|      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_t:          |
trt |  -.5919619   .1393104    -4.25   0.000    -.8650053   -.3189186
_cons |  -4.249018   .2170208   -19.58   0.000    -4.674371   -3.823665
log(gamma) |   .3008499   .0657087     4.58   0.000     .1720632    .4296366
------------------------------------------------------------------------------



which now binds together the benefits of both commands.

Let’s take it one step further to really show the potential benefits of this combined approach. Often we want to restrict follow-up for a certain time period, especially in observational studies. In our simulated data, let’s say we want to restrict follow-up to the first 5 years. Well if I was using my stime and died variables directly in merlin, I’d have to do some data manipulation.

. gen stime5 = stime if stime <= 5
(915 missing values generated)

. replace stime5 = 5 if stime > 5

. gen died5 = died if stime <= 5
(915 missing values generated)

. replace died5 = 0 if stime > 5



I’ve had to create new survival time and event indicator variables to account for the imposed censoring at 5 years, which is easy to screw up. We can now call merlin again, remembering to use the new variables,

. merlin (stime5 trt, family(weibull, failure(died5)))

Fitting full model:

Iteration 0:   log likelihood = -4798.4847
Iteration 1:   log likelihood = -446.47651
Iteration 2:   log likelihood = -442.35172
Iteration 3:   log likelihood = -425.88576
Iteration 4:   log likelihood = -424.66484
Iteration 5:   log likelihood = -424.66277
Iteration 6:   log likelihood = -424.66277

Mixed effects regression model                  Number of obs     =      1,000
Log likelihood = -424.66277
------------------------------------------------------------------------------
|      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
stime5:      |
trt |  -.5064117    .222564    -2.28   0.023    -.9426291   -.0701943
_cons |  -3.994034    .235494   -16.96   0.000    -4.455594   -3.532475
log(gamma) |   .1121091   .1072638     1.05   0.296    -.0981242    .3223423
------------------------------------------------------------------------------



Now if we’d stuck with stset, restricting follow-up is trivial by using the exit() option,

. stset stime, failure(died) exit(time 5)

failure event:  died != 0 & died < .
obs. time interval:  (0, stime]
exit on or before:  time 5

------------------------------------------------------------------------------
1,000  total observations
0  exclusions
------------------------------------------------------------------------------
1,000  observations remaining, representing
85  failures in single-record/single-failure data
4,798.485  total analysis time at risk and under observation
at risk from t =         0
earliest observed entry t =         0
last observed exit t =         5



and of course since it still creates the appropriate _* variables, our combined merlin model using the _* variables from above (going back to the simplest one), doesn’t need to change at all:

. merlin (_t trt, family(weibull, failure(_d)))

Fitting full model:

Iteration 0:   log likelihood = -4798.4847
Iteration 1:   log likelihood = -446.47651
Iteration 2:   log likelihood = -433.80837
Iteration 3:   log likelihood = -429.72891
Iteration 4:   log likelihood =  -424.7134
Iteration 5:   log likelihood = -424.66281
Iteration 6:   log likelihood = -424.66277

Mixed effects regression model                  Number of obs     =      1,000
Log likelihood = -424.66277
------------------------------------------------------------------------------
|      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_t:          |
trt |  -.5064117   .2225639    -2.28   0.023     -.942629   -.0701943
_cons |  -3.994033   .2354939   -16.96   0.000    -4.455593   -3.532474
log(gamma) |   .1121088   .1072638     1.05   0.296    -.0981244     .322342
------------------------------------------------------------------------------



which gives us a pretty robust and flexible approach for survival analysis with merlin.

$^*$ Note, the log likelihoods disagree between streg and merlin because streg adds log(_t) to the log hazard function contribution of events, and merlin does not.