The analysis of count data within the framework of regression models plays a crucial role in many applied research fields. Due to this widespread use, there is a large scope of count data models considering various features of these data. This dissertation focuses on Bayesian modelling of count data which are subject to potential underreporting.
Underreporting is a common problem in applications, e.g., in criminology or epidemiology, and refers to the fallible mechanism in the data collection process. As a result, inference from the observed (reported) counts, which are only a fraction of the true counts, will be biased. To account for underreporting, the basic concept is to specify a joint model for the data generating process of events and the fallible reporting process, where the responses in both processes are related to a set of regressors. The most popular model for underreported count data is the Poisson-Logistic (Pogit) model. It is based on a standard Poisson regression model for the true counts and assumes a logit regression model for the reporting process. Identification of the Pogit model is an important issue and requires additional information on the reporting process.
In this thesis, Bayesian inference for the Pogit model is considered which is extended in various ways. The proposed extensions allow to model underreported clustered as well as (underreported) overdispersed count data. Furthermore, Bayesian variable selection is incorporated in both parts of the joint model using spike and slab priors to identify relevant regressors. Accounting for underreporting relies on additional information on the reporting process which may be provided by different sources: either by validation data, parameter restrictions (as a result of variable selection) or informative prior distributions.
To deal with overdispersion of count data due to omitted covariates, Poisson mixture models with different heterogeneity distributions are considered in this thesis which are also appropriate alternatives to the count data distribution in the Pogit model. In general, overdispersion may have various potential causes and can result either from the event generating process or the data collection process.
Bayesian inference for the presented models is based on Markov chain Monte Carlo (MCMC) sampling schemes that rely on data augmentation and auxiliary mixture sampling techniques. The main goal is to achieve model representations as conditionally Gaussian regression models in auxiliary variables to allow for more general and complex model specifications and straightforward implementation of variable selection. The sampling schemes are implemented in an R package which is available on the Comprehensive R Archive Network (CRAN). The proposed methods are illustrated in real data applications in the field of epidemiology to account for underreporting of cervical cancer death risk and norovirus infections.