I promise there are 436 other reasons to use `merlin`

…and more, of course.

In this example I’m going to keep things simple. There will be a single non-linear effect, representing age, and it will impact survival. I’m going to simulate some survival data with a non-linear effect of age, fit some models accounting for the non-linear effect of age, and make predictions for specified values of age. The point of this post is to show you how we would do the last two of the list, first using `stpm2`

, and then with `merlin`

.

$~$

### Fitting a model with non-linear effects and obtaining predictions

To simulate survival data we use my `survsim`

command. I’ll simulate from this model,

$$h(t) = \lambda \gamma t^{\gamma - 1} \exp (\beta_{1} X + \beta_{2} X^2)$$

i.e. a Weibull baseline, and both age (X) and it’s square impacting survival, under proportional hazards. I choose sensible values for my baseline parameters, and for the coefficients for age and age$^2$. Coding it up, generating a dataset of 1000 observations, simulating age from $N(30,5^2)$, and applying administrative censoring at 10 years, we have:

```
. clear
. set seed 98798
. set obs 1000
number of observations (_N) was 0, now 1,000
. gen age = rnormal(30,5)
. gen age2 = age^2
. survsim stime died, dist(weib) lambda(0.01) gamma(1.5) /// -baseline hazard-
> maxt(10) /// -admin. censoring-
> cov(age 0.01 age2 0.001) // -effect of age and age^2
```

we `stset`

our data,

```
. stset stime, failure(died)
failure event: died != 0 & died < .
obs. time interval: (0, stime]
exit on or before: failure
------------------------------------------------------------------------------
1,000 total observations
0 exclusions
------------------------------------------------------------------------------
1,000 observations remaining, representing
653 failures in single-record/single-failure data
6,756.636 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 10
```

We can now start fitting models. I’m going to fit Royston-Parmar flexible parametric survival models using the `stpm2`

command. For arguments sake, I will assume 1 degree of freedom (spline terms) for the baseline hazard (equivalent to a Weibull model). We can of course adjust for age and age squared, but I tend to favour using splines to model non-linear relationships. As with all standard estimation commands in Stata, we have to generate our basis functions (either for splines or fractional polynomials, say), before we fit our model, since the basis functions must exist in our dataset. I will create a three degrees of freedom spline of age (orthogonalising the splines, which often improves convergence), using the `rcsgen`

function,

```
. rcsgen age, gen(agesp) df(3) orthog
Variables agesp1 to agesp3 were created
```

This generates my basis functions with the stub name of `agesp`

, and adds them to our dataset,

```
. list age agesp* in 1/5
+-------------------------------------------------+
| age agesp1 agesp2 agesp3 |
|-------------------------------------------------|
1. | 31.52036 .31932984 .6334313 .6667398 |
2. | 34.74015 .96970184 -.07238199 .94019475 |
3. | 37.66811 1.5611261 -1.1346309 -.09415333 |
4. | 30.68798 .15119764 .71965446 .34161031 |
5. | 27.73592 -.44509645 .64863397 -.8348252 |
+-------------------------------------------------+
```

This way, we can now fit any model, adjusting for the splines of age in the main `varlist`

. With `stpm2`

, we can use

```
. stpm2 agesp?, scale(h) df(1)
Iteration 0: log likelihood = -1149.8816
Iteration 1: log likelihood = -1149.8319
Iteration 2: log likelihood = -1149.8319
Log likelihood = -1149.8319 Number of obs = 1,000
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
xb |
agesp1 | .4146441 .0408094 10.16 0.000 .3346591 .494629
agesp2 | -.0780577 .0408809 -1.91 0.056 -.1581828 .0020673
agesp3 | .0452994 .0395197 1.15 0.252 -.0321577 .1227565
_rcs1 | 1.11596 .0385797 28.93 0.000 1.040346 1.191575
_cons | -.7999409 .0445594 -17.95 0.000 -.8872757 -.712606
------------------------------------------------------------------------------
```

Now estimates of spline term coefficients are arguably meaningless - we can’t interpret them directly. Hence, we need to be able to obtain predictions/fitted values, from our model. The post-estimation tools in the `predict`

function of `stpm2`

is one of the most impressive features of `stpm2`

. However, it’s not so easy to obtain predictions for specific covariate patterns, when we have things like spline functions in the model. Let’s say I want to predict the survival function for a patient aged 50…well, the `predict`

function comes with the `at()`

option, which lets the user specify fixed values of variables in your model that you want to predict at. So I want to predict for a patient aged 50…this would be easy if I’d just adjusted for age, we could use `at(age 50)`

- but what is the value of my spline functions, i.e. the basis function variables included in my model, `agesp1 agesp2 agesp3`

, for an age of 50? To get the right numbers we have to go back to `rcsgen`

and start over.

We first need to re-calculate our original basis functions, and this time store the knot locations and the orthogonalisation matrix, in order to calculate the correct values of the basis functions for an age of 50.

```
. drop agesp*
. rcsgen age, gen(agesp) df(3) orthog
Variables agesp1 to agesp3 were created
. local ageknots `r(knots)'
. matrix agermat = r(R)
```

I’ve stored the age knots in a local macro, and the orthogonalisation matrix in a matrix called `agermat`

. Now we can calculate the values of our basis functions for an age of 50, safe that we are using the same knot locations and orthog. matrix, so our spline is comparable. Since we only want the basis function values for a single value of age, we can use the `scalar(#)`

option, where `#`

is our age of interest, which means `rcsgen`

then saves the basis function values in scalars, with the stub given in `gen()`

. So,

```
. rcsgen, scalar(50) gen(predage) knots(`ageknots') rmatrix(agermat)
Scalars predage1 to predage3 were created
```

gives us the basis function values for an age of 50, stored in scalars `predage1 predage2 predage3`

. We can (finally) now predict our survival function. To evaluate the numbers in the `at()`

statement, I use the shortcut ``='`

to evaluate the number stored in the scalar.

```
. predict s1, survival at(agesp1 `=predage1' agesp2 `=predage2' agesp3 `=predage3')
```

Which gives us our predicted survival function for a patient aged 50.

Phew. Still with me? Let’s see if `merlin`

can make life easier.

One of the differences of `merlin`

to other estimation commands, is in how non-linear effects *can* be modelled. I should say, we could create our basis functions first, and simply include them in the main `varlist`

, just as with `stpm2`

, but there is an alternative. `merlin`

has element types, which do not just have to be a variable name. One of the types is the `rcs()`

element type, which essentially brings in the `rcsgen`

functionality directly within the model specification. It’s easiest to just see and translate it back to what we fitted above, so in `merlin`

we have,

```
. merlin (stime rcs(age, df(3) orthog) /// main effect of age
> , family(rp, df(1) failure(died)))
variables created: _rcs1_1 to _rcs1_1
variables created for model 1, component 1: _cmp_1_1_1 to _cmp_1_1_3
Fitting full model:
Iteration 0: log likelihood = -2362.2946
Iteration 1: log likelihood = -2079.2237
Iteration 2: log likelihood = -2073.1993
Iteration 3: log likelihood = -2073.1912
Iteration 4: log likelihood = -2073.1912
Mixed effects regression model Number of obs = 1,000
Log likelihood = -2073.1912
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
stime: |
rcs():1 | .4146476 .0408083 10.16 0.000 .3346649 .4946303
rcs():2 | -.078111 .0408836 -1.91 0.056 -.1582414 .0020193
rcs():3 | .045297 .0395261 1.15 0.252 -.0321728 .1227667
_cons | -.7999446 .0445595 -17.95 0.000 -.8872797 -.7126095
------------------------------------------------------------------------------
Warning: Baseline spline coefficients not shown - use ml display
```

`merlin`

creates the basis functions for you, tells you what they are called, and leaves them behind in the dataset. Importantly, it labels them in the output so you can see what’s what. The elegant thing is that we only have to specify the variable `age`

to be expanded in splines, and as such changing the degrees of freedom etc., is extremely simple to do. No `rcsgen`

necessary.

But it get’s better. Since `merlin`

creates the spline variables internally, when we want predictions at different values of `age`

, the spline gets automatically recreated, and as such, getting a prediction for someone aged 50 is as simple as…

```
. predict s2, survival at(age 50)
```

Which is pretty frikking simple. Just to convince you, we can compare our predictions from the two models:

```
. list s1 s2 in 1/10
+-----------------------+
| s1 s2 |
|-----------------------|
1. | .00363344 .00362434 |
2. | .00694324 .00692784 |
3. | .71584372 .71573725 |
4. | .69920439 .69909306 |
5. | .00144058 .00143638 |
|-----------------------|
6. | .47862698 .4784699 |
7. | .00239235 .00238591 |
8. | .00144058 .00143638 |
9. | .08789838 .08780305 |
10. | .04960264 .04953618 |
+-----------------------+
```

There’s some minor differences after the 4th decimal place, but the agreement is pretty good$^{*}$. Now just imagine if you had a second spline…and then a spline-spline interaction…and then time-dependent effects on them… Convinced `merlin`

is easier…? I’ll do a follow-up post with a more complex example soon.

$*$ Looks like the differences are because `rcsgen`

and `merlin`

’s `rcs()`

elements use very slightly different percentiles for the default knot locations. There is also the fact that `stpm2`

and `merlin`

have some subtle differences under the hood, but this is too boring to write about here.