I’ve just pushed
merlin
version 2 to this website. There lots of new features and improvements, all documented in the full development version history, but below I’ll take you briefly through some of the headlines.In case you missed it,
merlin
has recently been published in the Stata Journal (and yes, it’s now already out of date):Crowther MJ. merlin - a unified framework for data analysis and methods development in Stata. Stata Journal 2020;20(4):763-784. (Pre-print: https://arxiv.org/abs/1806.01615).
$~$
merlin
version 2.0.0 is now released
The development version is now up on my website, and you can install it by typing:
net install merlin, from("https://www.mjcrowther.co.uk/code/merlin")
It’ll be on SSC very soon as well.
Here’s what’s new…
Speed gains
merlin
has always been slow. From its inception, it used a gf0
evaluator, which meant both the score and Hessian were calculated by Stata’s inbuilt numerical derivative engine (see deriv()
in Mata
). This meant many (many) calls to the log likelihood function, tweaking the parameter estimates to obtain the score and Hessian elements through finite differences. All well and good, but slow. With version 2, I’ve started to address this. If you fit a univariate or multivariate model with only fixed effects (no random effects), then the most commonly used families now use a gf2
evaluator, and hence can be in the region of 10 to hundreds of times faster, depending on the model specification. Importantly, they will also be on a par with setting-specific commands, such as streg
for parametric survival models, and stpm2
for flexible parametric survival models. Examples:
- 100,000 observations, 3 covariates, Royston-Parmar model with 3 degrees of freedom:
merlin
1.15.3: 28.2 secondsmerlin
2.0.0: 1.3 secondsstmerlin
(using merlin 2.0.0): 1.7 secondstpm2
: 1.7 seconds
Ironically, in this case merlin
is faster than stmerlin
- this won’t usually be the case. stmerlin
fits a Weibull model first for bettng starting values, but it’s a simple model, so that actually adds more comp. time than it saves. Similar with stpm2
, it fits a Cox model first.
- 100,000 observations, 3 covariates, Weibull model:
merlin
1.15.3: 5.2 secondsmerlin
2.0.0: 0.87 secondsstreg
: 0.45 seconds
The details:
- The following models now use a
gf2
evaluator (analytic score and Hessian), resulting in substantial speed gains:- Univariate or multivariate (stacked/stratified) fixed effects models:
family(rp)
family(loghazard)
family(hazard)
family(exponential)
family(weibull)
family(cox)
family(logchazard)
family(gaussian)
family(poisson)
family(pwexponential)
family(gompertz)
- note, this applies to all models in
stmerlin
- survival models maintain this speed gain even in the presence of complex time-time interactions, due to an internal chain rule function
- note, if parameters are linked across models in a multivariate model, then
gf0
will be used
- The following univariate models will use a
gf1
evaluator (analytic score only):family(rp)
with interval-censoring or as a relative survival model (bhazard() option)family(loghazard)
with interval-censoringfamily(exponential)
with interval-censoringfamily(weibull)
with interval-censoringfamily(pwexponential)
with interval-censoringfamily(gompertz)
with interval-censoring
- The following univariate models will still use a gf0 evaluator:
family(exponential)
withbhazard()
i.e. a relative survival modelfamily(weibull)
withbhazard()
i.e. a relative survival modelfamily(pwexponential)
withbhazard()
i.e. a relative survival modelfamily(gompertz)
withbhazard()
i.e. a relative survival model
- Univariate or multivariate (stacked/stratified) fixed effects models:
Predict-orama
In terms of new features, the predict
command has inadvertently stolen the show. I’ve added:
- standardised/study population-averaged predictions are now available through the
standardise
option ofpredict
. This specifies that the predicted statistic be computed marginally with respect to the independent variables, implemented by calculating the statistic separately for all observed covariate patterns, and taking the average. standardise can be used in combination with at(), or at1() and at2() to obtain estimates of marginal estimands. Currently only available with fixed effects models. predict, totalsurvival
added to allow prediction of the all-cause survival function, following the estimation of a competing risks survival model.- conditional survival and cumulative incidence predictions are now available through the
ltruncated()
option of predict. This allows prediction of survival or cumulative incidence function, conditional on being event free at user-defined times > 0. predict, reffects
andpredict, reses
added to calculate cluster-specific posterior means (empirical Bayes estimates) of the random effects, and their standard errors, respectively. Currently only available for two-level models.predict, fitted
added to calculate cluster-specific predictions after fitting a two-level model. Predictions include both estimates of the fixed effects and posterior means of the random effects.predict, userfunction()
added to allow a user-defined prediction to be obtained, in the form of aMata
function, utilisingmerlin
’s utility functions.
A new model
family(hazard, ...)
has been added to define a general additive hazard model. Random effects are not supported yet. That’s it really.
A multilevelled behavioural change
This was one of my biggest annoyances that should’ve been fixed a long time ago. The following model
merlin (y rcs(time, df(3)) fp(time, powers(1 2))#M2[id]@1 M1[id]@1, family(gaussian))
would previously create only 2 random effects - each of the fp(time, powers(1 2))
basis functions would be interacted with the same single random effect M2[id]
. Now, behaviour is much more expected. All interactions will be formed first, and then each term will have its own randon effect. So in the above example, M2a[id]
and M2b[id]
will be created.
. clear
. set obs 1000
number of observations (_N) was 0, now 1,000
. gen id = _n
. gen b0 = rnormal()
. gen b1 = rnormal()
. gen b2 = rnormal()
. expand 5
(4,000 observations created)
. bys id: gen time = _n-1
. gen y = rnormal((5+b0) - (0.1 + b1)*time + (0.2+b2)*time^2, 1)
. merlin (y rcs(time, df(3)) fp(time, powers(1 2))#M2[id]@1 M1[id]@1, family(gaussian))
variables created for model 1, component 1: _cmp_1_1_1 to _cmp_1_1_3
variables created for model 1, component 2: _cmp_1_2_1 to _cmp_1_2_2
Fitting fixed effects model:
Fitting full model:
Iteration 0: log likelihood = -16936.285
Iteration 1: log likelihood = -11475.904 (not concave)
Iteration 2: log likelihood = -11401.162 (not concave)
Iteration 3: log likelihood = -11251.369
Iteration 4: log likelihood = -11115.513
Iteration 5: log likelihood = -11008.786
Iteration 6: log likelihood = -10984.951
Iteration 7: log likelihood = -10984.836
Iteration 8: log likelihood = -10984.836
Mixed effects regression model Number of obs = 5,000
Log likelihood = -10984.836
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
y: |
rcs():1 | .0504887 .0633136 0.80 0.425 -.0736038 .1745812
rcs():2 | -.0539352 .0277131 -1.95 0.052 -.108252 .0003815
rcs():3 | -.0864406 .0277131 -3.12 0.002 -.1407573 -.0321239
fp()#M2[id~1 | 1 . . . . .
fp()#M2[id~2 | 1 . . . . .
M1[id] | 1 . . . . .
_cons | 4.913935 .0440763 111.49 0.000 4.827547 5.000323
sd(resid.) | .996021 .0145346 .9679374 1.02492
-------------+----------------------------------------------------------------
id: |
sd(M1) | .9837806 .0365473 .9146949 1.058084
sd(M2a) | .9265701 .043276 .8455171 1.015393
sd(M2b) | 1.014224 .0233859 .969409 1.061111
------------------------------------------------------------------------------
User-friendly-er
Regardless of how complex a model you can fit with merlin
, I think one of the biggest contributions is its syntax for specifying non-linear effects directly within your model statement, rather than having to derive spline or fractional polynomial terms first. This benefit grows even more so when wanting to predict from such a model. See this post for more.
In this release, I now let you use the rcs()
and fp()
elements (and others), when fitting a model with stmerlin
, which makes life a whole lot simpler.
. clear
. set obs 1000
number of observations (_N) was 0, now 1,000
. gen age = rnormal(50,5)
. gen age2 = age^2
. gen female = runiform()>0.5
. gen bmi = 15+20*runiform()
. survsim stime died, dist(weibull) lambda(0.1) gamma(1.2) maxtime(10) ///
> covariates(age 0.01 age2 -0.0002 female -0.5 bmi 0.02)
. stset stime, failure(died)
failure event: died != 0 & died < .
obs. time interval: (0, stime]
exit on or before: failure
------------------------------------------------------------------------------
1,000 total observations
0 exclusions
------------------------------------------------------------------------------
1,000 observations remaining, representing
855 failures in single-record/single-failure data
4,679.541 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 10
. stmerlin female rcs(age, df(3)) bmi, distribution(cox)
Obtaining initial values
variables created for model 1, component 2: _cmp_1_2_1 to _cmp_1_2_3
Fitting full model:
Iteration 0: log likelihood = -5300.7523
Iteration 1: log likelihood = -5300.7518
Survival model Number of obs = 1,000
Log likelihood = -5300.7518
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_t: |
female | -.4703461 .069257 -6.79 0.000 -.6060873 -.3346048
rcs():1 | .0056165 .0368823 0.15 0.879 -.0666715 .0779044
rcs():2 | .0003498 .0005052 0.69 0.489 -.0006404 .0013401
rcs():3 | -.0003274 .000489 -0.67 0.503 -.0012859 .0006311
bmi | .0190829 .0059862 3.19 0.001 .00735 .0308157
------------------------------------------------------------------------------
. //predict survival for a male aged 35 (without having to re-derive any spline variables!) and with bmi of 30,
. predict s1, survival at(age 35 bmi 30) zeros
. //predict standardised survival for a female - averaging over the observed age and bmi distributions,
. predict s2, survival at(female 1) standardise
The never-ending desire for certification…
Certification has improved a lot with version 2, but merlin
is worked on as part of my academic time, i.e. for my research, so it comes with no warranty etc., and hence the amount of time I spend on certification is never going to be as much as I would like. So, there will be bugs. Please test, report bugs, and I’ll fix them. Simple.
This post used survsim
version 4.0.4 and merlin
version 2.0.0.