In recent years there has been substantial growth in data infrastructure, and a concerted drive to improve the availability and quality of electronic health records, as a primary source of data for medical research. This project seeks to take advantage of this, through developing novel and appropriate advanced statistical techniques which are required to analyse such data. The methods will then be applied to answer clinically relevant questions in the areas of cardiovascular disease and cancer research, with the potential to have substantial impact in both areas, and more widely across a diverse range of clinical areas.

The first project aims to decompose the pathways of cardiovascular disease to increase the understanding of which risk factors are associated with different outcomes, using data from linked electronic health records in the UK. For example, a patient begins healthy, then experiences a first heart attack, and then a subsequent stroke. By fitting models to each of these transitions from state to state, we can identify important risk factors which could be used to identify patients at an increased risk of subsequent cardiovascular events. By modelling the whole profile of a patient, we make most efficient use of the available data, and will also be able to develop predictions for future events, tailored to individual patients. This aspect is crucially important in communicating information to patients, and ensuring such information is both understandable and meaningful.

The second substantive project will further develop and apply joint models of longitudinal and survival data, which allow the modelling of a biomarker, measured with error and repeatedly over time, such as blood pressure, and how changes in the biomarker are related to the rate of an event of interest, such as death. We will investigate the relationship between changes in haemoglobin levels over time, and the rate of cancer diagnoses, using Swedish and Danish registry data. This may lead to identifying important trajectories of haemoglobin which can allow targeted monitoring, or indeed interventions to be applied sooner, or cases be diagnosed earlier. Methodology will be developed to allow these computationally intensive methods to be applied to such a large database, providing a widely applicable methodological framework.

The third project will investigate how changes in blood pressure over time are associated with the risk of experiencing cardiovascular events, such as a heart attack or stroke. We have available a vast resource of data from electronic health records, collected at GP practices across the country. Such data exhibits a hierarchical structure, with biomarker repeated measures nested within patients, nested within GP practice. It is important to account for this hierarchical structure in our analyses. This project will extend joint longitudinal-survival models to enable us to account for such structures, which will enable us to investigate factors such as deprivation status measured at the practice level.

Alongside the above projects, we believe it is crucial to simultaneously develop freely available user friendly software which implements the methodology, subsequently release it to the research community and run courses aimed to help transfer methods into practice. This is particularly important as the increased availability of large datasets means methods must be able to handle ‘big data’ efficiently, which requires advanced programming skills.

To conclude, in this project we aim to utilise the methods in the areas of both cardiovascular and cancer epidemiology; however, through releasing software, it means that the methodological work can be applied to any number of different disease areas, to help answer a variety of relevant and clinically important questions, across the range of health research.