`merlin`

provides many different survival models for the analysis of time to event data; however, it differs in one core thing compared to other survival analysis commands in Stata - it does not require you to `stset`

your data. As such, it doesn’t use the internal `_t _t0 _d _st`

variables, created by `stset`

. This is both good and bad.

Let’s see why.

$~$

### Survival analysis in Stata

Let’s simulate some data (I think I hide it well that I favour simulated data than real…I prefer a simple life). So let’s assume a Weibull baseline, and a binary treatment group, with a beneficial treatment reducing the mortality rate due to a hazard ratio of 0.607. My hazard rate is as follows,

$$h(t) = \lambda \gamma t^{\gamma - 1} \exp (\beta_{1} X)$$

where $X$ is my binary treatment group variable, with 50% allocated to each arm, and $\beta_{1} = -0.5$ is the associated log hazard ratio. `survsim`

does this nicely (subtle compliment of my own command there),

```
. clear
. set seed 98798 // reproducibility
. set obs 1000 // sample size
number of observations (_N) was 0, now 1,000
. gen trt = runiform()>0.5 // binary treatment variable
. survsim stime died, dist(weib) lambda(0.01) gamma(1.5) /// baseline hazard
> maxt(10) /// admin. censoring
> covariates(trt -0.5) // effect of treatment
```

So I have my survival times in `stime`

, which are either the event time or the censoring time, which is reflected in whether my event indicator, `died`

, has been coded a 1 or a 0, respectively. We’re doing survival analysis in Stata, so the natural next step is to `stset`

our data.

```
. stset stime, failure(died)
failure event: died != 0 & died < .
obs. time interval: (0, stime]
exit on or before: failure
------------------------------------------------------------------------------
1,000 total observations
0 exclusions
------------------------------------------------------------------------------
1,000 observations remaining, representing
219 failures in single-record/single-failure data
9,079.168 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 10
```

Now `stset`

is awesome. It provides extensive testing and validation of survival time data, preventing a multitude of mistakes, and allowing all sorts of flexibility including efficient handling of dates. Any `st`

command in Stata requires you to `stset`

your data, hence once it has been `stset`

, Stata can trust the input. The `stset`

above is relatively simple (deliberately - see above on a simple life), but regardless of that, `stset`

creates the same four variables,

```
. list _t0 _t _d _st in 1/5
+----------------------------+
| _t0 _t _d _st |
|----------------------------|
1. | 0 10 0 1 |
2. | 0 .27256038 1 1 |
3. | 0 10 0 1 |
4. | 0 10 0 1 |
5. | 0 10 0 1 |
+----------------------------+
```

`_t0`

the time at which each observation becomes at risk`_t`

the failure or censoring time`_d`

the event indicator (1 for an event, 0 for censored)`_st`

the sample indicator - whether each observation should be used in subsequent commands, coded 1 for yes, 0 for no

Once we have these internal variables, it’s almost trivial to create Kaplan-Meier curves, fit survival models, estimate rates etc., since we don’t have to specify the outcome variables. Of course, we can re-`stset`

at any time if we want to make any changes to our sample.

So, for example, to a fit a Weibull model, adjusting for treatment, we use

```
. streg trt, distribution(weibull) nohr
failure _d: died
analysis time _t: stime
Fitting constant-only model:
Iteration 0: log likelihood = -702.64756
Iteration 1: log likelihood = -693.68358
Iteration 2: log likelihood = -693.51296
Iteration 3: log likelihood = -693.5129
Fitting full model:
Iteration 0: log likelihood = -693.5129
Iteration 1: log likelihood = -684.37417
Iteration 2: log likelihood = -684.17429
Iteration 3: log likelihood = -684.1742
Iteration 4: log likelihood = -684.1742
Weibull PH regression
No. of subjects = 1,000 Number of obs = 1,000
No. of failures = 219
Time at risk = 9079.167592
LR chi2(1) = 18.68
Log likelihood = -684.1742 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
trt | -.5919618 .1393104 -4.25 0.000 -.8650052 -.3189185
_cons | -4.249019 .2170205 -19.58 0.000 -4.674371 -3.823667
-------------+----------------------------------------------------------------
/ln_p | .3008501 .0657086 4.58 0.000 .1720636 .4296366
-------------+----------------------------------------------------------------
p | 1.351007 .0887728 1.187753 1.536699
1/p | .7401887 .0486368 .6507455 .8419256
------------------------------------------------------------------------------
```

and we simply don’t have to worry about our survival times, our entry times, or our event indicator.

I can’t say the same for `merlin`

.

### Using `merlin`

with `st`

data

`merlin`

is not a survival analysis command…but `merlin`

is also a survival analysis command. With the first statement I mean that `merlin`

does not require your dataset to be `stset`

and hence it won’t care if you’ve `stset`

anything. By the second statement I mean that `merlin`

can fit a wide range of survival models (and a whole lot more), but you need to tell it the outcome variables manually. This is because `merlin`

caters for a lot more than just survival models, so it has to cover a lot of other outcome types and syntax combinations. Fitting a Weibull model with `merlin`

is still pretty simple though. Ignoring our `stset`

data, we can fit the same$^*$ Weibull model as follows,

```
. merlin (stime trt, family(weibull, failure(died)))
Fitting full model:
Iteration 0: log likelihood = -9079.1676
Iteration 1: log likelihood = -1116.9133 (not concave)
Iteration 2: log likelihood = -1033.9946
Iteration 3: log likelihood = -1016.8322
Iteration 4: log likelihood = -1016.2367
Iteration 5: log likelihood = -1016.2285
Iteration 6: log likelihood = -1016.2285
Mixed effects regression model Number of obs = 1,000
Log likelihood = -1016.2285
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
stime: |
trt | -.5919619 .1393104 -4.25 0.000 -.8650053 -.3189186
_cons | -4.249018 .2170208 -19.58 0.000 -4.674371 -3.823665
log(gamma) | .3008499 .0657087 4.58 0.000 .1720632 .4296366
------------------------------------------------------------------------------
```

Now the first variable of any model in `merlin`

must be your outcome, and in the survival case it’s our survival/censoring time variable `stime`

. Since we have censoring, we also need to tell it the name of the failure indicator variable through the `failure()`

option within `family()`

, hence including `died`

in there.

So we can do survival analysis in `merlin`

. However, as I said above, `stset`

is awesome for many reasons, so why not make the best of both worlds. The point of this post has probably occured to you by now…let’s `stset`

our data, and simply use the `_*`

variables in our `merlin`

model specification. In other words, our above `merlin`

model becomes,

```
. merlin (_t trt, family(weibull, failure(_d)))
Fitting full model:
Iteration 0: log likelihood = -9079.1676
Iteration 1: log likelihood = -1116.9133 (not concave)
Iteration 2: log likelihood = -1033.9946
Iteration 3: log likelihood = -1016.8322
Iteration 4: log likelihood = -1016.2367
Iteration 5: log likelihood = -1016.2285
Iteration 6: log likelihood = -1016.2285
Mixed effects regression model Number of obs = 1,000
Log likelihood = -1016.2285
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_t: |
trt | -.5919619 .1393104 -4.25 0.000 -.8650053 -.3189186
_cons | -4.249018 .2170208 -19.58 0.000 -4.674371 -3.823665
log(gamma) | .3008499 .0657087 4.58 0.000 .1720632 .4296366
------------------------------------------------------------------------------
```

which is of course identical to the above model, but we get some reassurance from the use of the `_*`

variables because they’ve made it out the other side of `stset`

. Now I didn’t use all of them, since I had no delayed entry in my model (i.e. `_t0 = 0`

for all observations) we didn’t need `_t0`

, but we of course can use it anyway through the `ltruncated()`

option,

```
. merlin (_t trt, family(weibull, failure(_d) ltruncated(_t0)))
Fitting full model:
Iteration 0: log likelihood = -9079.1676
Iteration 1: log likelihood = -1116.9133 (not concave)
Iteration 2: log likelihood = -1033.9946
Iteration 3: log likelihood = -1016.8322
Iteration 4: log likelihood = -1016.2367
Iteration 5: log likelihood = -1016.2285
Iteration 6: log likelihood = -1016.2285
Mixed effects regression model Number of obs = 1,000
Log likelihood = -1016.2285
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_t: |
trt | -.5919619 .1393104 -4.25 0.000 -.8650053 -.3189186
_cons | -4.249018 .2170208 -19.58 0.000 -4.674371 -3.823665
log(gamma) | .3008499 .0657087 4.58 0.000 .1720632 .4296366
------------------------------------------------------------------------------
```

Finally, the sample indicator `_st`

variable also wasn’t needed because all observations were used, i.e. there were no exclusions reported in the `stset`

above, but to make sure we cover that in the case that there were issues, our full model specification is,

```
. merlin (_t trt if _st==1, family(weibull, failure(_d) ltruncated(_t0)))
Fitting full model:
Iteration 0: log likelihood = -9079.1676
Iteration 1: log likelihood = -1116.9133 (not concave)
Iteration 2: log likelihood = -1033.9946
Iteration 3: log likelihood = -1016.8322
Iteration 4: log likelihood = -1016.2367
Iteration 5: log likelihood = -1016.2285
Iteration 6: log likelihood = -1016.2285
Mixed effects regression model Number of obs = 1,000
Log likelihood = -1016.2285
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_t: |
trt | -.5919619 .1393104 -4.25 0.000 -.8650053 -.3189186
_cons | -4.249018 .2170208 -19.58 0.000 -4.674371 -3.823665
log(gamma) | .3008499 .0657087 4.58 0.000 .1720632 .4296366
------------------------------------------------------------------------------
```

which now binds together the benefits of both commands.

Let’s take it one step further to really show the potential benefits of this combined approach. Often we want to restrict follow-up for a certain time period, especially in observational studies. In our simulated data, let’s say we want to restrict follow-up to the first 5 years. Well if I was using my `stime`

and `died`

variables directly in `merlin`

, I’d have to do some data manipulation.

```
. gen stime5 = stime if stime <= 5
(915 missing values generated)
. replace stime5 = 5 if stime > 5
(915 real changes made)
. gen died5 = died if stime <= 5
(915 missing values generated)
. replace died5 = 0 if stime > 5
(915 real changes made)
```

I’ve had to create new survival time and event indicator variables to account for the imposed censoring at 5 years, which is easy to screw up. We can now call `merlin`

again, remembering to use the new variables,

```
. merlin (stime5 trt, family(weibull, failure(died5)))
Fitting full model:
Iteration 0: log likelihood = -4798.4847
Iteration 1: log likelihood = -446.47651
Iteration 2: log likelihood = -442.35172
Iteration 3: log likelihood = -425.88576
Iteration 4: log likelihood = -424.66484
Iteration 5: log likelihood = -424.66277
Iteration 6: log likelihood = -424.66277
Mixed effects regression model Number of obs = 1,000
Log likelihood = -424.66277
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
stime5: |
trt | -.5064117 .222564 -2.28 0.023 -.9426291 -.0701943
_cons | -3.994034 .235494 -16.96 0.000 -4.455594 -3.532475
log(gamma) | .1121091 .1072638 1.05 0.296 -.0981242 .3223423
------------------------------------------------------------------------------
```

Now if we’d stuck with `stset`

, restricting follow-up is trivial by using the `exit()`

option,

```
. stset stime, failure(died) exit(time 5)
failure event: died != 0 & died < .
obs. time interval: (0, stime]
exit on or before: time 5
------------------------------------------------------------------------------
1,000 total observations
0 exclusions
------------------------------------------------------------------------------
1,000 observations remaining, representing
85 failures in single-record/single-failure data
4,798.485 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 5
```

and of course since it still creates the appropriate `_*`

variables, our combined `merlin`

model using the `_*`

variables from above (going back to the simplest one), doesn’t need to change at all:

```
. merlin (_t trt, family(weibull, failure(_d)))
Fitting full model:
Iteration 0: log likelihood = -4798.4847
Iteration 1: log likelihood = -446.47651
Iteration 2: log likelihood = -433.80837
Iteration 3: log likelihood = -429.72891
Iteration 4: log likelihood = -424.7134
Iteration 5: log likelihood = -424.66281
Iteration 6: log likelihood = -424.66277
Mixed effects regression model Number of obs = 1,000
Log likelihood = -424.66277
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_t: |
trt | -.5064117 .2225639 -2.28 0.023 -.942629 -.0701943
_cons | -3.994033 .2354939 -16.96 0.000 -4.455593 -3.532474
log(gamma) | .1121088 .1072638 1.05 0.296 -.0981244 .322342
------------------------------------------------------------------------------
```

which gives us a pretty robust and flexible approach for survival analysis with `merlin`

.

$^*$ Note, the log likelihoods disagree between `streg`

and `merlin`

because `streg`

adds log(`_t`

) to the log hazard function contribution of events, and `merlin`

does not.