merlin version 2

I’ve just pushed merlin version 2 to this website. There lots of new features and improvements, all documented in the full development version history, but below I’ll take you briefly through some of the headlines.

In case you missed it, merlin has recently been published in the Stata Journal (and yes, it’s now already out of date):

Crowther MJ. merlin - a unified framework for data analysis and methods development in Stata. Stata Journal 2020;20(4):763-784. (Pre-print: https://arxiv.org/abs/1806.01615).

$~$


merlin version 2.0.0 is now released


The development version is now up on my website, and you can install it by typing:

net install merlin, from("https://www.mjcrowther.co.uk/code/merlin")

It’ll be on SSC very soon as well.

Here’s what’s new…

Speed gains

merlin has always been slow. From its inception, it used a gf0 evaluator, which meant both the score and Hessian were calculated by Stata’s inbuilt numerical derivative engine (see deriv() in Mata). This meant many (many) calls to the log likelihood function, tweaking the parameter estimates to obtain the score and Hessian elements through finite differences. All well and good, but slow. With version 2, I’ve started to address this. If you fit a univariate or multivariate model with only fixed effects (no random effects), then the most commonly used families now use a gf2 evaluator, and hence can be in the region of 10 to hundreds of times faster, depending on the model specification. Importantly, they will also be on a par with setting-specific commands, such as streg for parametric survival models, and stpm2 for flexible parametric survival models. Examples:

  • 100,000 observations, 3 covariates, Royston-Parmar model with 3 degrees of freedom:
    • merlin 1.15.3: 28.2 seconds
    • merlin 2.0.0: 1.3 seconds
    • stmerlin (using merlin 2.0.0): 1.7 second
    • stpm2: 1.7 seconds

Ironically, in this case merlin is faster than stmerlin - this won’t usually be the case. stmerlin fits a Weibull model first for bettng starting values, but it’s a simple model, so that actually adds more comp. time than it saves. Similar with stpm2, it fits a Cox model first.

  • 100,000 observations, 3 covariates, Weibull model:
    • merlin 1.15.3: 5.2 seconds
    • merlin 2.0.0: 0.87 seconds
    • streg: 0.45 seconds

The details:

  • The following models now use a gf2 evaluator (analytic score and Hessian), resulting in substantial speed gains:
    • Univariate or multivariate (stacked/stratified) fixed effects models:
      • family(rp)
      • family(loghazard)
      • family(hazard)
      • family(exponential)
      • family(weibull)
      • family(cox)
      • family(logchazard)
      • family(gaussian)
      • family(poisson)
      • family(pwexponential)
      • family(gompertz)
      • note, this applies to all models in stmerlin
      • survival models maintain this speed gain even in the presence of complex time-time interactions, due to an internal chain rule function
      • note, if parameters are linked across models in a multivariate model, then gf0 will be used
    • The following univariate models will use a gf1 evaluator (analytic score only):
      • family(rp) with interval-censoring or as a relative survival model (bhazard() option)
      • family(loghazard) with interval-censoring
      • family(exponential) with interval-censoring
      • family(weibull) with interval-censoring
      • family(pwexponential) with interval-censoring
      • family(gompertz) with interval-censoring
    • The following univariate models will still use a gf0 evaluator:
      • family(exponential) with bhazard() i.e. a relative survival model
      • family(weibull) with bhazard() i.e. a relative survival model
      • family(pwexponential) with bhazard() i.e. a relative survival model
      • family(gompertz) with bhazard() i.e. a relative survival model

Predict-orama

In terms of new features, the predict command has inadvertently stolen the show. I’ve added:

  • standardised/study population-averaged predictions are now available through the standardise option of predict. This specifies that the predicted statistic be computed marginally with respect to the independent variables, implemented by calculating the statistic separately for all observed covariate patterns, and taking the average. standardise can be used in combination with at(), or at1() and at2() to obtain estimates of marginal estimands. Currently only available with fixed effects models.
  • predict, totalsurvival added to allow prediction of the all-cause survival function, following the estimation of a competing risks survival model.
  • conditional survival and cumulative incidence predictions are now available through the ltruncated() option of predict. This allows prediction of survival or cumulative incidence function, conditional on being event free at user-defined times > 0.
  • predict, reffects and predict, reses added to calculate cluster-specific posterior means (empirical Bayes estimates) of the random effects, and their standard errors, respectively. Currently only available for two-level models.
  • predict, fitted added to calculate cluster-specific predictions after fitting a two-level model. Predictions include both estimates of the fixed effects and posterior means of the random effects.
  • predict, userfunction() added to allow a user-defined prediction to be obtained, in the form of a Mata function, utilising merlin’s utility functions.

A new model

family(hazard, ...) has been added to define a general additive hazard model. Random effects are not supported yet. That’s it really.

A multilevelled behavioural change

This was one of my biggest annoyances that should’ve been fixed a long time ago. The following model

merlin (y rcs(time, df(3)) fp(time, powers(1 2))#M2[id]@1 M1[id]@1, family(gaussian))

would previously create only 2 random effects - each of the fp(time, powers(1 2)) basis functions would be interacted with the same single random effect M2[id]. Now, behaviour is much more expected. All interactions will be formed first, and then each term will have its own randon effect. So in the above example, M2a[id] and M2b[id] will be created.

. clear

. set obs 1000
number of observations (_N) was 0, now 1,000

. gen id = _n

. gen b0 = rnormal()

. gen b1 = rnormal()

. gen b2 = rnormal()

. expand 5
(4,000 observations created)

. bys id: gen time = _n-1

. gen y = rnormal((5+b0) - (0.1 + b1)*time + (0.2+b2)*time^2, 1)

. merlin (y rcs(time, df(3)) fp(time, powers(1 2))#M2[id]@1 M1[id]@1, family(gaussian))
variables created for model 1, component 1: _cmp_1_1_1 to _cmp_1_1_3
variables created for model 1, component 2: _cmp_1_2_1 to _cmp_1_2_2

Fitting fixed effects model:

Fitting full model:

Iteration 0:   log likelihood = -16936.285  
Iteration 1:   log likelihood = -11475.904  (not concave)
Iteration 2:   log likelihood = -11401.162  (not concave)
Iteration 3:   log likelihood = -11251.369  
Iteration 4:   log likelihood = -11115.513  
Iteration 5:   log likelihood = -11008.786  
Iteration 6:   log likelihood = -10984.951  
Iteration 7:   log likelihood = -10984.836  
Iteration 8:   log likelihood = -10984.836  

Mixed effects regression model                  Number of obs     =      5,000
Log likelihood = -10984.836
------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
y:           |            
     rcs():1 |   .0504887   .0633136     0.80   0.425    -.0736038    .1745812
     rcs():2 |  -.0539352   .0277131    -1.95   0.052     -.108252    .0003815
     rcs():3 |  -.0864406   .0277131    -3.12   0.002    -.1407573   -.0321239
fp()#M2[id~1 |          1          .        .       .            .           .
fp()#M2[id~2 |          1          .        .       .            .           .
      M1[id] |          1          .        .       .            .           .
       _cons |   4.913935   .0440763   111.49   0.000     4.827547    5.000323
  sd(resid.) |    .996021   .0145346                      .9679374     1.02492
-------------+----------------------------------------------------------------
id:          |            
      sd(M1) |   .9837806   .0365473                      .9146949    1.058084
     sd(M2a) |   .9265701    .043276                      .8455171    1.015393
     sd(M2b) |   1.014224   .0233859                       .969409    1.061111
------------------------------------------------------------------------------

User-friendly-er

Regardless of how complex a model you can fit with merlin, I think one of the biggest contributions is its syntax for specifying non-linear effects directly within your model statement, rather than having to derive spline or fractional polynomial terms first. This benefit grows even more so when wanting to predict from such a model. See this post for more.

In this release, I now let you use the rcs() and fp() elements (and others), when fitting a model with stmerlin, which makes life a whole lot simpler.

. clear

. set obs 1000
number of observations (_N) was 0, now 1,000

. gen age = rnormal(50,5)

. gen age2 = age^2

. gen female = runiform()>0.5

. gen bmi = 15+20*runiform()

. survsim stime died, dist(weibull) lambda(0.1) gamma(1.2) maxtime(10) ///
>                     covariates(age 0.01 age2 -0.0002 female -0.5 bmi 0.02)

. stset stime, failure(died)

     failure event:  died != 0 & died < .
obs. time interval:  (0, stime]
 exit on or before:  failure

------------------------------------------------------------------------------
      1,000  total observations
          0  exclusions
------------------------------------------------------------------------------
      1,000  observations remaining, representing
        855  failures in single-record/single-failure data
  4,679.541  total analysis time at risk and under observation
                                                at risk from t =         0
                                     earliest observed entry t =         0
                                          last observed exit t =        10

. stmerlin female rcs(age, df(3)) bmi, distribution(cox)
Obtaining initial values

variables created for model 1, component 2: _cmp_1_2_1 to _cmp_1_2_3

Fitting full model:

Iteration 0:   log likelihood = -5300.7523  
Iteration 1:   log likelihood = -5300.7518  

Survival model                                  Number of obs     =      1,000
Log likelihood = -5300.7518
------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_t:          |            
      female |  -.4703461    .069257    -6.79   0.000    -.6060873   -.3346048
     rcs():1 |   .0056165   .0368823     0.15   0.879    -.0666715    .0779044
     rcs():2 |   .0003498   .0005052     0.69   0.489    -.0006404    .0013401
     rcs():3 |  -.0003274    .000489    -0.67   0.503    -.0012859    .0006311
         bmi |   .0190829   .0059862     3.19   0.001       .00735    .0308157
------------------------------------------------------------------------------

. //predict survival for a male aged 35 (without having to re-derive any spline variables!) and with bmi of 30,
. predict s1, survival at(age 35 bmi 30) zeros

. //predict standardised survival for a female - averaging over the observed age and bmi distributions,
. predict s2, survival at(female 1) standardise

The never-ending desire for certification…

Certification has improved a lot with version 2, but merlin is worked on as part of my academic time, i.e. for my research, so it comes with no warranty etc., and hence the amount of time I spend on certification is never going to be as much as I would like. So, there will be bugs. Please test, report bugs, and I’ll fix them. Simple.

This post used survsim version 4.0.4 and merlin version 2.0.0.

Related

comments powered by Disqus