Statistical methods are crucial to understand reality and foster progress of scientific knowledge, providing rigorous models and methods to analyse data and come up with correct conclusions. In practice, a huge number of research questions are posed from different fields such as, health studies, epidemiology, biology, environmental and social sciences.
In all these fields, the data exhibit complex patterns that require sophisticated statistical tools for their analysis. Therefore, the mathematical statistician is in front of a variety of challenges. The first one is to propose a model together with a statistical method that may properly solve the given problem. Secondly, the theoretical properties of the proposed method must be studied to identify the situations in which it will lead to consistent answers. Particular cases are then to be explored by simulation studies, where the method performance is investigated in real life scenarios. Third, code or friendly-use software should be packaged to help the practitioner in the application of the method to a specific data set. Finally, the collaboration with researchers in other areas in order to provide statistical expertise in the application of the methods is also a goal.
This project aims to cover all these edges of statistical research. In survival analysis researchers are interested in modelling and analysing the time until an event happens. It often occurs that the available data are censored and/or truncated, which means that some constraints may appear on the event times. Literature on survival data has proposed suitable methods to work with incomplete information on the event times like, for instance, censored data and truncated data. These features complicate substantially the statistical analysis of the data.
The aim of this project is to solve a number of open problems related to time-to-event data, univariate and multivariate, that would represent a major step forward in the area of survival analysis and multi-state models. Real problems that we aim to solve appear in EPIPorto, Lisbon cohort of MSM and COVID-Scope cohorts. It is of interest to estimate a joint distribution of successive times (e.g., age at disease onset, time from disease onset to death) in a three-state progressive model in which various types of censoring and truncation must be taken into account.
The information of the cohort is obtained through intermittent visits or successive cross-sections or follow-ups and then special combinations of left truncated, right-censored and interval censored data appear. Due to the complex nature of these models, many problems are still open and rigorous theory is rather scarce in this area. In many applications doubly truncated data are encountered, and this phenomenon is much less known and much more difficult to solve than one-sided truncation.
The analysis of doubly truncated data is relevant in e.g., epidemiological applications, when the observation of the time of interest is limited to events between two specific calendar dates. Since the seminal paper by Efron and Petrosian on nonparametric estimation of a doubly truncated distribution, some contributions have appeared, including the ones of our research team. An important assumption in the random double truncation model is that of independence between the truncation times and target time; however, in practice, the target time may depend on the truncation time, leading to possible inconsistencies of the nonparametric maximum likelihood estimator (NMPLE). The extension of the NPMLE for dependent truncation, by assuming a suitable copula structure for the involved times was introduced in but the study of important other targets such as the hazard function is still open, including the bandwidth selection issue. The choice of the copula family, which describes the dependence structure, is important since it has an impact in the final estimator. A possible approach for copula selection is an information criterion such as the AIC, but a formal goodness-of-fit tests for the copula model under double truncation are still missing and is a challenging task that we intend to solve.
During the current project, we will update an existing R package, devoted to doubly truncated analysis, including bandwidth selectors for the hazard function and we will develop a new one to include the proposed advances in the nonparametric estimation of curves functions under dependence. SUMcohort project will contribute in fields of survival analysis under double truncation and multi-state models by considering complicated combinations of censoring and truncation and weakly dependent lifetimes, copula models, goodness-of-fit tests, bandwidth selection methods, among others. Also, practical solution for real problems stated by the epidemiologists and physicians, coming from cohort-based studies, of a participant institution is included.