`survsim`

version 3.0.0

`survsim`

was, scarily, the first command of my PhD. A look at the version history shows I released the first version on 9th September 2011…which is a rather terrifying amount of time ago. The last updates were back in 2013, which were mainly bug fixes. I think it’s safe to say it’s a useful command - its download numbers from SSC seem perfectly respectable, and the useful commands from a development perspective tend to be the ones you can ignore for a long time, and yet still get emails about.

However, it’s been due some tender love and care for awhile now. And by that, I mean some extensions. At least it started out that way, and ended up a complete re-write of the entire package, predominantly because I can now code a lot better than 9 years ago…

In short, there are three big updates in this release, including:

- The ability to simulate directly from a fitted
`merlin`

survival model. This extension came directly from my work on the simulation engine in the`multistate`

package (see my post on a major update to it). The main command in there is`predictms`

, whose primary method to calculate predictions is through simulation.`predictms`

takes survival model objects, saved using`estimates store`

, as inputs, and simulates a large number of observations using the estimated coefficients (and VCV), at specified covariate values. The algorithms used in there to generate the survival times, are the same ones used in`survsim`

. What you can do now is simulate directly from a fitted survival model that has been estimated using the`merlin`

command. Previously,`survsim`

could only simulated survival times from a specified`distribution()`

or (log) (cumulative) hazard function, and you passed it variable names and associated values for the log hazard ratios, and time-dependent effects. - The ability to simulate from an arbitray competing risks model, defined by parametric or user-defined cause-specific hazard functions, or combinations of them. This extension came from the inevitable re-write of the whole command. The competing risks implementation was pretty clunky, so I set about generalising and making it much more efficient and reliable.
- The ability to simulate from a conditional survival distribution, i.e. allowing for delayed entry/left truncation. This is so far synced with all settings, except simulating from a fitted
`merlin`

model.

You can install it using:

```
net install survsim, from("https://www.mjcrowther.co.uk/code/survsim/")
```

or it will be up on the SSC archive very soon.

With the introduction of version 3.0.0, `survsim`

has four core settings for the simulation of survival data. There is now a top-level help file, found using `help survsim`

, with further help nested files covering each of the four areas. Let’s go through an example of each:

$~$

#### 1. Simulating survival times from standard parametric distributions

Let’s simulate survival times from a Weibull distribution, with a binary treatment group, `trt`

, and a continuous covariate, `age`

, under proportional hazards:

$$ h(t) = \lambda \gamma t^{\gamma - 1} \exp (\text{trt} \beta_{1} + \text{age} \beta_{2}) $$

I’ll simulate 300 observations, and pick some distributions for the covariates, which should be self-explanatory,

```
. set obs 300
number of observations (_N) was 0, now 300
. set seed 134987
. gen trt = runiform()>0.5
. gen age = rnormal(50,3)
```

We then call `survsim`

,

```
. survsim stime, distribution(weibull) lambda(0.1) gamma(1.2) covariates(trt -0.5 age 0.01)
```

which stores our simulated survival times in the new variable `stime`

. If we wanted to apply right-censoring, we could first generate some censoring times,

```
. gen censtime = runiform() * 5
```

and now add the `maxtime()`

option to `survsim`

, remembering to also specify a second new variable name for the event indicator,

```
. survsim stime2 died2, distribution(weibull) lambda(0.1) gamma(1.2) ///
> covariates(trt -0.5 age 0.01) maxtime(censtime)
```

We could also:

- add time-dependent effects using the
`tde()`

option - add left-truncation/delayed entry using the
`ltruncated()`

option - simulate from a 2-component mixture distribition using the
`mixture`

option

$~$

#### 2. Simulating survival times from a user-defined (log) (cumulative) hazard function

The most flexible form of simulating survival data with `survsim`

is by specifying a custom hazard or cumulative hazard function, such as:

$$ h(t) = h_{0}(t) \exp (\text{trt} \beta_{1}) $$ where $$ h_{0}(t) = \exp (-1 + 0.02 t - 0.03 t^2 + 0.005 t^3) $$

which can be done, on the `loghazard()`

scale for simplicity, using

```
. survsim stime3 died3, loghazard(-1 :+ 0.02:*#t :- 0.03:*#t:^2 :+ 0.005:*#t:^3) ///
> maxtime(1.5) covariates(trt -0.5)
Warning: 195 survival times were above the upper limit of maxtime()
They have been set to maxtime()
You can identify them by _survsim_rc = 3
```

with a common right censoring time of 1.5 years specified using `maxtime(1.5)`

. We could make the treatment effect diminish over time by,

```
. survsim stime4 died4, loghazard(-1 :+ 0.02:*#t :- 0.03:*#t:^2 :+ 0.005:*#t:^3) ///
> covariates(trt -0.5) tde(trt 0.03) maxtime(1.5)
Warning: 194 survival times were above the upper limit of maxtime()
They have been set to maxtime()
You can identify them by _survsim_rc = 3
```

which will form an interaction between `trt`

, its coefficient `0.03`

and linear time, which is the default. We could instead simulate from a model on the cumulative hazard scale, using the `logchazard()`

option instead.

$~$

#### 3. Simulating survival times from a fitted `merlin`

survival model

We can directly simulate from a fitted model, by passing an `estimates`

object to `survsim`

through the `model()`

option. This is similar to Patrick Royston’s `stsurvsim`

(see SJ paper here). Let’s fit a Weibull model to a standard survival dataset, using the `merlin`

command:

```
. webuse brcancer, clear
(German breast cancer data)
. stset rectime, f(censrec=1) scale(365)
failure event: censrec == 1
obs. time interval: (0, rectime]
exit on or before: failure
t for analysis: time/365
------------------------------------------------------------------------------
686 total observations
0 exclusions
------------------------------------------------------------------------------
686 observations remaining, representing
299 failures in single-record/single-failure data
2,113.425 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 7.284932
. merlin (_t hormon , family(weibull, failure(_d)))
Fitting full model:
Iteration 0: log likelihood = -2113.4247
Iteration 1: log likelihood = -898.2654
Iteration 2: log likelihood = -868.11925
Iteration 3: log likelihood = -868.0269
Iteration 4: log likelihood = -868.02684
Mixed effects regression model Number of obs = 686
Log likelihood = -868.02684
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_t: |
hormon | -.3932405 .1248267 -3.15 0.002 -.6378962 -.1485847
_cons | -2.196012 .1094092 -20.07 0.000 -2.41045 -1.981574
log(gamma) | .2509974 .0496958 5.05 0.000 .1535953 .3483994
------------------------------------------------------------------------------
```

I’m using the benefits of `stset`

to declare the survival variables, which I can use directly within `merlin`

for simplicity (see this post on using `stset`

with `merlin`

for more info.)

We then simply store the model object, calling it whatver we like, such as the imaginative name of `m1`

,

```
. estimates store m1
```

This we then pass to `survsim`

to simulate a dataset, of the same size, using our fitted results,

```
. survsim stime5 died5, model(m1) maxtime(7)
```

The option `maxtime()`

is required in this case. We can then fit the same model as before, and of course get slightly different results, because we have sampling variability.

```
. stset stime5, failure(died5)
failure event: died5 != 0 & died5 < .
obs. time interval: (0, stime5]
exit on or before: failure
------------------------------------------------------------------------------
686 total observations
0 exclusions
------------------------------------------------------------------------------
686 observations remaining, representing
483 failures in single-record/single-failure data
3,041.221 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 7
. merlin (_t hormon , family(weibull, failure(_d)))
Fitting full model:
Iteration 0: log likelihood = -3041.2211
Iteration 1: log likelihood = -1380.975
Iteration 2: log likelihood = -1344.9574
Iteration 3: log likelihood = -1344.9459
Iteration 4: log likelihood = -1344.9459
Mixed effects regression model Number of obs = 686
Log likelihood = -1344.9459
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_t: |
hormon | -.3048995 .0975271 -3.13 0.002 -.4960492 -.1137498
_cons | -2.286831 .1059302 -21.59 0.000 -2.494451 -2.079212
log(gamma) | .2849967 .040284 7.07 0.000 .2060416 .3639519
------------------------------------------------------------------------------
```

To be convinced that everything is working, we could simulate a suitably large dataset (to essentially eliminate Monte Carlo error), say with 1,000,000 observations, fit a chosen model, simulate another suitably large dataset from that fitted model, and re-fit the true model. Parameter estimates should come out the same (ish). I leave that to you to try…if you can figure out that sentence.

Note, `survsim`

will simulate the same number of observations as `_N`

, so make sure any covariates in your model are filled up in all rows, otherwise missing values will be produced.

All survival models that `merlin`

can fit, can subsequently be simulated from, using `survsim`

. I think that’s pretty awesome.

$~$

#### 4. Simulating competing risks data from specified cause-specific hazard functions

`survsim`

simulates competing risk data using the cause-specific hazard setting described by Beyersmann et al. (2009), utitlising the general simulation algorithm described by Crowther and Lambert (2013). Let’s simulate from a competing risk model with 2 competing events. The first cause-specific hazard has a Weibull distribution, with no covariates. The second cause-specific hazard model has an exponential distribution, with a beneficial treatment effect. Right censoring is applied at 10 years.

```
. clear
. set obs 1000
number of observations (_N) was 0, now 1,000
. gen trt = runiform()>0.5
. survsim stime6 event6 , hazard1(dist(weibull) lambda(0.1) gamma(1.2)) ///
> hazard2(dist(exponential) lambda(0.01) covariates(trt -0.5)) ///
> maxtime(10)
Warning: 212 survival times were above the upper limit of maxtime()
They have been set to maxtime()
You can identify them by _survsim_rc = 3
```

We can see which events occurred with

```
. tabulate event6
event6 | Freq. Percent Cum.
------------+-----------------------------------
0 | 212 21.20 21.20
1 | 747 74.70 95.90
2 | 41 4.10 100.00
------------+-----------------------------------
Total | 1,000 100.00
```

Now let’s simulate from a competing risk model with 3 competing events. The first cause-specific hazard has a user defined baseline hazard function, with a harmful treatment effect. The second cause-specific hazard model has a Weibull distribition, with a beneficial treatment effect. The third cause-specific hazard has a user-defined baseline hazard function, with an initially beneficial treatment effect that reduces linearly with respect to log time. Right censoring is applied at 3 years. Phew.

```
. survsim stime7 event7 , hazard1(user(exp(-2 :+ 0.2:* log(#t) :+ 0.1:*#t)) covariates(trt 0.1)) ///
> hazard2(dist(weibull) lambda(0.01) gamma(1.3) covariates(trt -0.5)) ///
> hazard3(user(0.1 :* #t :^ 1.5) covariates(trt -0.5) tde(trt 0.1) ///
> tdefunction(log(#t))) ///
> maxtime(3)
Warning: 345 survival times were above the upper limit of maxtime()
They have been set to maxtime()
You can identify them by _survsim_rc = 3
. tabulate event7
event7 | Freq. Percent Cum.
------------+-----------------------------------
0 | 345 34.50 34.50
1 | 342 34.20 68.70
2 | 24 2.40 71.10
3 | 289 28.90 100.00
------------+-----------------------------------
Total | 1,000 100.00
```

You can see that this can get as complex as necessary. I currently let you use up to 50 cause-specific hazards, just in case you’re feeling particularly adventurous.