Reason no. 437 to use merlin: Modelling and prediction with non-linear effects

I promise there are 436 other reasons to use merlin…and more, of course.

In this example I’m going to keep things simple. There will be a single non-linear effect, representing age, and it will impact survival. I’m going to simulate some survival data with a non-linear effect of age, fit some models accounting for the non-linear effect of age, and make predictions for specified values of age. The point of this post is to show you how we would do the last two of the list, first using stpm2, and then with merlin.


Fitting a model with non-linear effects and obtaining predictions

To simulate survival data we use my survsim command. I’ll simulate from this model,

$$h(t) = \lambda \gamma t^{\gamma - 1} \exp (\beta_{1} X + \beta_{2} X^2)$$

i.e. a Weibull baseline, and both age (X) and it’s square impacting survival, under proportional hazards. I choose sensible values for my baseline parameters, and for the coefficients for age and age$^2$. Coding it up, generating a dataset of 1000 observations, simulating age from $N(30,5^2)$, and applying administrative censoring at 10 years, we have:

. clear

. set seed 98798

. set obs 1000
number of observations (_N) was 0, now 1,000

. gen age = rnormal(30,5)

. gen age2 = age^2

. survsim stime died, dist(weib) lambda(0.01) gamma(1.5) /// -baseline hazard-
>                     maxt(10)                           /// -admin. censoring-
>                     cov(age 0.01 age2 0.001)           //  -effect of age and age^2

we stset our data,

. stset stime, failure(died)

     failure event:  died != 0 & died < .
obs. time interval:  (0, stime]
 exit on or before:  failure

      1,000  total observations
          0  exclusions
      1,000  observations remaining, representing
        653  failures in single-record/single-failure data
  6,756.636  total analysis time at risk and under observation
                                                at risk from t =         0
                                     earliest observed entry t =         0
                                          last observed exit t =        10

We can now start fitting models. I’m going to fit Royston-Parmar flexible parametric survival models using the stpm2 command. For arguments sake, I will assume 1 degree of freedom (spline terms) for the baseline hazard (equivalent to a Weibull model). We can of course adjust for age and age squared, but I tend to favour using splines to model non-linear relationships. As with all standard estimation commands in Stata, we have to generate our basis functions (either for splines or fractional polynomials, say), before we fit our model, since the basis functions must exist in our dataset. I will create a three degrees of freedom spline of age (orthogonalising the splines, which often improves convergence), using the rcsgen function,

. rcsgen age, gen(agesp) df(3) orthog
Variables agesp1 to agesp3 were created

This generates my basis functions with the stub name of agesp, and adds them to our dataset,

. list age agesp* in 1/5

     |      age       agesp1       agesp2       agesp3 |
  1. | 31.52036    .31932984     .6334313     .6667398 |
  2. | 34.74015    .96970184   -.07238199    .94019475 |
  3. | 37.66811    1.5611261   -1.1346309   -.09415333 |
  4. | 30.68798    .15119764    .71965446    .34161031 |
  5. | 27.73592   -.44509645    .64863397    -.8348252 |

This way, we can now fit any model, adjusting for the splines of age in the main varlist. With stpm2, we can use

. stpm2 agesp?, scale(h) df(1)

Iteration 0:   log likelihood = -1149.8816  
Iteration 1:   log likelihood = -1149.8319  
Iteration 2:   log likelihood = -1149.8319  

Log likelihood = -1149.8319                     Number of obs     =      1,000

             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
xb           |
      agesp1 |   .4146441   .0408094    10.16   0.000     .3346591     .494629
      agesp2 |  -.0780577   .0408809    -1.91   0.056    -.1581828    .0020673
      agesp3 |   .0452994   .0395197     1.15   0.252    -.0321577    .1227565
       _rcs1 |    1.11596   .0385797    28.93   0.000     1.040346    1.191575
       _cons |  -.7999409   .0445594   -17.95   0.000    -.8872757    -.712606

Now estimates of spline term coefficients are arguably meaningless - we can’t interpret them directly. Hence, we need to be able to obtain predictions/fitted values, from our model. The post-estimation tools in the predict function of stpm2 is one of the most impressive features of stpm2. However, it’s not so easy to obtain predictions for specific covariate patterns, when we have things like spline functions in the model. Let’s say I want to predict the survival function for a patient aged 50…well, the predict function comes with the at() option, which lets the user specify fixed values of variables in your model that you want to predict at. So I want to predict for a patient aged 50…this would be easy if I’d just adjusted for age, we could use at(age 50) - but what is the value of my spline functions, i.e. the basis function variables included in my model, agesp1 agesp2 agesp3, for an age of 50? To get the right numbers we have to go back to rcsgen and start over.

We first need to re-calculate our original basis functions, and this time store the knot locations and the orthogonalisation matrix, in order to calculate the correct values of the basis functions for an age of 50.

. drop agesp*

. rcsgen age, gen(agesp) df(3) orthog
Variables agesp1 to agesp3 were created

. local ageknots `r(knots)'

. matrix agermat = r(R)

I’ve stored the age knots in a local macro, and the orthogonalisation matrix in a matrix called agermat. Now we can calculate the values of our basis functions for an age of 50, safe that we are using the same knot locations and orthog. matrix, so our spline is comparable. Since we only want the basis function values for a single value of age, we can use the scalar(#) option, where # is our age of interest, which means rcsgen then saves the basis function values in scalars, with the stub given in gen(). So,

. rcsgen, scalar(50) gen(predage) knots(`ageknots') rmatrix(agermat)
Scalars predage1 to predage3 were created

gives us the basis function values for an age of 50, stored in scalars predage1 predage2 predage3. We can (finally) now predict our survival function. To evaluate the numbers in the at() statement, I use the shortcut `=' to evaluate the number stored in the scalar.

. predict s1, survival at(agesp1 `=predage1' agesp2 `=predage2' agesp3 `=predage3')

Which gives us our predicted survival function for a patient aged 50.

Phew. Still with me? Let’s see if merlin can make life easier.

One of the differences of merlin to other estimation commands, is in how non-linear effects can be modelled. I should say, we could create our basis functions first, and simply include them in the main varlist, just as with stpm2, but there is an alternative. merlin has element types, which do not just have to be a variable name. One of the types is the rcs() element type, which essentially brings in the rcsgen functionality directly within the model specification. It’s easiest to just see and translate it back to what we fitted above, so in merlin we have,

. merlin (stime   rcs(age, df(3) orthog)      /// main effect of age
>        , family(rp, df(1) failure(died)))   
variables created: _rcs1_1 to _rcs1_1
variables created for model 1, component 1: _cmp_1_1_1 to _cmp_1_1_3

Fitting full model:

Iteration 0:   log likelihood = -2362.2946  
Iteration 1:   log likelihood = -2079.2237  
Iteration 2:   log likelihood = -2073.1993  
Iteration 3:   log likelihood = -2073.1912  
Iteration 4:   log likelihood = -2073.1912  

Mixed effects regression model                  Number of obs     =      1,000
Log likelihood = -2073.1912
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
stime:       |            
     rcs():1 |   .4146476   .0408083    10.16   0.000     .3346649    .4946303
     rcs():2 |   -.078111   .0408836    -1.91   0.056    -.1582414    .0020193
     rcs():3 |    .045297   .0395261     1.15   0.252    -.0321728    .1227667
       _cons |  -.7999446   .0445595   -17.95   0.000    -.8872797   -.7126095
    Warning: Baseline spline coefficients not shown - use ml display

merlin creates the basis functions for you, tells you what they are called, and leaves them behind in the dataset. Importantly, it labels them in the output so you can see what’s what. The elegant thing is that we only have to specify the variable age to be expanded in splines, and as such changing the degrees of freedom etc., is extremely simple to do. No rcsgen necessary.

But it get’s better. Since merlin creates the spline variables internally, when we want predictions at different values of age, the spline gets automatically recreated, and as such, getting a prediction for someone aged 50 is as simple as…

. predict s2, survival at(age 50)

Which is pretty frikking simple. Just to convince you, we can compare our predictions from the two models:

. list s1 s2 in 1/10

     |        s1          s2 |
  1. | .00363344   .00362434 |
  2. | .00694324   .00692784 |
  3. | .71584372   .71573725 |
  4. | .69920439   .69909306 |
  5. | .00144058   .00143638 |
  6. | .47862698    .4784699 |
  7. | .00239235   .00238591 |
  8. | .00144058   .00143638 |
  9. | .08789838   .08780305 |
 10. | .04960264   .04953618 |

There’s some minor differences after the 4th decimal place, but the agreement is pretty good$^{*}$. Now just imagine if you had a second spline…and then a spline-spline interaction…and then time-dependent effects on them… Convinced merlin is easier…? I’ll do a follow-up post with a more complex example soon.

$*$ Looks like the differences are because rcsgen and merlin’s rcs() elements use very slightly different percentiles for the default knot locations. There is also the fact that stpm2 and merlin have some subtle differences under the hood, but this is too boring to write about here.


comments powered by Disqus