A long overdue update to survsim

survsim version 3.0.0


survsim was, scarily, the first command of my PhD. A look at the version history shows I released the first version on 9th September 2011…which is a rather terrifying amount of time ago. The last updates were back in 2013, which were mainly bug fixes. I think it’s safe to say it’s a useful command - its download numbers from SSC seem perfectly respectable, and the useful commands from a development perspective tend to be the ones you can ignore for a long time, and yet still get emails about.

However, it’s been due some tender love and care for awhile now. And by that, I mean some extensions. At least it started out that way, and ended up a complete re-write of the entire package, predominantly because I can now code a lot better than 9 years ago…

In short, there are three big updates in this release, including:

  1. The ability to simulate directly from a fitted merlin survival model. This extension came directly from my work on the simulation engine in the multistate package (see my post on a major update to it). The main command in there is predictms, whose primary method to calculate predictions is through simulation. predictms takes survival model objects, saved using estimates store, as inputs, and simulates a large number of observations using the estimated coefficients (and VCV), at specified covariate values. The algorithms used in there to generate the survival times, are the same ones used in survsim. What you can do now is simulate directly from a fitted survival model that has been estimated using the merlin command. Previously, survsim could only simulated survival times from a specified distribution() or (log) (cumulative) hazard function, and you passed it variable names and associated values for the log hazard ratios, and time-dependent effects.
  2. The ability to simulate from an arbitray competing risks model, defined by parametric or user-defined cause-specific hazard functions, or combinations of them. This extension came from the inevitable re-write of the whole command. The competing risks implementation was pretty clunky, so I set about generalising and making it much more efficient and reliable.
  3. The ability to simulate from a conditional survival distribution, i.e. allowing for delayed entry/left truncation. This is so far synced with all settings, except simulating from a fitted merlin model.

You can install it using:

net install survsim, from("https://www.mjcrowther.co.uk/code/survsim/")

or it will be up on the SSC archive very soon.

With the introduction of version 3.0.0, survsim has four core settings for the simulation of survival data. There is now a top-level help file, found using help survsim, with further help nested files covering each of the four areas. Let’s go through an example of each:

$~$

1. Simulating survival times from standard parametric distributions


Let’s simulate survival times from a Weibull distribution, with a binary treatment group, trt, and a continuous covariate, age, under proportional hazards:

$$ h(t) = \lambda \gamma t^{\gamma - 1} \exp (\text{trt} \beta_{1} + \text{age} \beta_{2}) $$

I’ll simulate 300 observations, and pick some distributions for the covariates, which should be self-explanatory,

. set obs 300
number of observations (_N) was 0, now 300

. set seed 134987

. gen trt = runiform()>0.5

. gen age = rnormal(50,3)

We then call survsim,

. survsim stime, distribution(weibull) lambda(0.1) gamma(1.2) covariates(trt -0.5 age 0.01)

which stores our simulated survival times in the new variable stime. If we wanted to apply right-censoring, we could first generate some censoring times,

. gen censtime = runiform() * 5

and now add the maxtime() option to survsim, remembering to also specify a second new variable name for the event indicator,

. survsim stime2 died2, distribution(weibull) lambda(0.1) gamma(1.2)     ///
>                       covariates(trt -0.5 age 0.01) maxtime(censtime)

We could also:

  • add time-dependent effects using the tde() option
  • add left-truncation/delayed entry using the ltruncated() option
  • simulate from a 2-component mixture distribition using the mixture option

$~$

2. Simulating survival times from a user-defined (log) (cumulative) hazard function


The most flexible form of simulating survival data with survsim is by specifying a custom hazard or cumulative hazard function, such as:

$$ h(t) = h_{0}(t) \exp (\text{trt} \beta_{1}) $$ where $$ h_{0}(t) = \exp (-1 + 0.02 t - 0.03 t^2 + 0.005 t^3) $$

which can be done, on the loghazard() scale for simplicity, using

. survsim stime3 died3, loghazard(-1 :+ 0.02:*#t :- 0.03:*#t:^2 :+ 0.005:*#t:^3)   ///
>                       maxtime(1.5) covariates(trt -0.5)
Warning: 195 survival times were above the upper limit of maxtime()
         They have been set to maxtime()
         You can identify them by _survsim_rc = 3

with a common right censoring time of 1.5 years specified using maxtime(1.5). We could make the treatment effect diminish over time by,

. survsim stime4 died4, loghazard(-1 :+ 0.02:*#t :- 0.03:*#t:^2 :+ 0.005:*#t:^3)    ///
>                       covariates(trt -0.5) tde(trt 0.03) maxtime(1.5)
Warning: 194 survival times were above the upper limit of maxtime()
         They have been set to maxtime()
         You can identify them by _survsim_rc = 3

which will form an interaction between trt, its coefficient 0.03 and linear time, which is the default. We could instead simulate from a model on the cumulative hazard scale, using the logchazard() option instead.

$~$

3. Simulating survival times from a fitted merlin survival model


We can directly simulate from a fitted model, by passing an estimates object to survsim through the model() option. This is similar to Patrick Royston’s stsurvsim (see SJ paper here). Let’s fit a Weibull model to a standard survival dataset, using the merlin command:

. webuse brcancer, clear
(German breast cancer data)

. stset rectime, f(censrec=1) scale(365)

     failure event:  censrec == 1
obs. time interval:  (0, rectime]
 exit on or before:  failure
    t for analysis:  time/365

------------------------------------------------------------------------------
        686  total observations
          0  exclusions
------------------------------------------------------------------------------
        686  observations remaining, representing
        299  failures in single-record/single-failure data
  2,113.425  total analysis time at risk and under observation
                                                at risk from t =         0
                                     earliest observed entry t =         0
                                          last observed exit t =  7.284932

. merlin (_t hormon , family(weibull, failure(_d)))

Fitting full model:

Iteration 0:   log likelihood = -2113.4247  
Iteration 1:   log likelihood =  -898.2654  
Iteration 2:   log likelihood = -868.11925  
Iteration 3:   log likelihood =  -868.0269  
Iteration 4:   log likelihood = -868.02684  

Mixed effects regression model                  Number of obs     =        686
Log likelihood = -868.02684
------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_t:          |            
      hormon |  -.3932405   .1248267    -3.15   0.002    -.6378962   -.1485847
       _cons |  -2.196012   .1094092   -20.07   0.000     -2.41045   -1.981574
  log(gamma) |   .2509974   .0496958     5.05   0.000     .1535953    .3483994
------------------------------------------------------------------------------

I’m using the benefits of stset to declare the survival variables, which I can use directly within merlin for simplicity (see this post on using stset with merlin for more info.)

We then simply store the model object, calling it whatver we like, such as the imaginative name of m1,

. estimates store m1

This we then pass to survsim to simulate a dataset, of the same size, using our fitted results,

. survsim stime5 died5, model(m1) maxtime(7)

The option maxtime() is required in this case. We can then fit the same model as before, and of course get slightly different results, because we have sampling variability.

. stset stime5, failure(died5)

     failure event:  died5 != 0 & died5 < .
obs. time interval:  (0, stime5]
 exit on or before:  failure

------------------------------------------------------------------------------
        686  total observations
          0  exclusions
------------------------------------------------------------------------------
        686  observations remaining, representing
        483  failures in single-record/single-failure data
  3,041.221  total analysis time at risk and under observation
                                                at risk from t =         0
                                     earliest observed entry t =         0
                                          last observed exit t =         7

. merlin (_t hormon , family(weibull, failure(_d)))

Fitting full model:

Iteration 0:   log likelihood = -3041.2211  
Iteration 1:   log likelihood =  -1380.975  
Iteration 2:   log likelihood = -1344.9574  
Iteration 3:   log likelihood = -1344.9459  
Iteration 4:   log likelihood = -1344.9459  

Mixed effects regression model                  Number of obs     =        686
Log likelihood = -1344.9459
------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_t:          |            
      hormon |  -.3048995   .0975271    -3.13   0.002    -.4960492   -.1137498
       _cons |  -2.286831   .1059302   -21.59   0.000    -2.494451   -2.079212
  log(gamma) |   .2849967    .040284     7.07   0.000     .2060416    .3639519
------------------------------------------------------------------------------

To be convinced that everything is working, we could simulate a suitably large dataset (to essentially eliminate Monte Carlo error), say with 1,000,000 observations, fit a chosen model, simulate another suitably large dataset from that fitted model, and re-fit the true model. Parameter estimates should come out the same (ish). I leave that to you to try…if you can figure out that sentence.

Note, survsim will simulate the same number of observations as _N, so make sure any covariates in your model are filled up in all rows, otherwise missing values will be produced.

All survival models that merlin can fit, can subsequently be simulated from, using survsim. I think that’s pretty awesome.

$~$

4. Simulating competing risks data from specified cause-specific hazard functions


survsim simulates competing risk data using the cause-specific hazard setting described by Beyersmann et al. (2009), utitlising the general simulation algorithm described by Crowther and Lambert (2013). Let’s simulate from a competing risk model with 2 competing events. The first cause-specific hazard has a Weibull distribution, with no covariates. The second cause-specific hazard model has an exponential distribution, with a beneficial treatment effect. Right censoring is applied at 10 years.

. clear

. set obs 1000
number of observations (_N) was 0, now 1,000

. gen trt = runiform()>0.5

. survsim stime6 event6 , hazard1(dist(weibull) lambda(0.1) gamma(1.2))                ///
>                         hazard2(dist(exponential) lambda(0.01) covariates(trt -0.5)) ///
>                         maxtime(10)
Warning: 212 survival times were above the upper limit of maxtime()
         They have been set to maxtime()
         You can identify them by _survsim_rc = 3

We can see which events occurred with

. tabulate event6

     event6 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        212       21.20       21.20
          1 |        747       74.70       95.90
          2 |         41        4.10      100.00
------------+-----------------------------------
      Total |      1,000      100.00

Now let’s simulate from a competing risk model with 3 competing events. The first cause-specific hazard has a user defined baseline hazard function, with a harmful treatment effect. The second cause-specific hazard model has a Weibull distribition, with a beneficial treatment effect. The third cause-specific hazard has a user-defined baseline hazard function, with an initially beneficial treatment effect that reduces linearly with respect to log time. Right censoring is applied at 3 years. Phew.

. survsim stime7 event7 , hazard1(user(exp(-2 :+ 0.2:* log(#t) :+ 0.1:*#t)) covariates(trt 0.1))  ///
>                         hazard2(dist(weibull) lambda(0.01) gamma(1.3) covariates(trt -0.5))     ///
>                         hazard3(user(0.1 :* #t :^ 1.5) covariates(trt -0.5) tde(trt 0.1)        ///
>                                 tdefunction(log(#t)))                                           ///
>                         maxtime(3)
Warning: 345 survival times were above the upper limit of maxtime()
         They have been set to maxtime()
         You can identify them by _survsim_rc = 3

. tabulate event7

     event7 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        345       34.50       34.50
          1 |        342       34.20       68.70
          2 |         24        2.40       71.10
          3 |        289       28.90      100.00
------------+-----------------------------------
      Total |      1,000      100.00

You can see that this can get as complex as necessary. I currently let you use up to 50 cause-specific hazards, just in case you’re feeling particularly adventurous.

Related

comments powered by Disqus