Alberto Cassese is assistant professor at the Department of Statistics, Computer Science, Application “G. Parenti” of the University of Florence. He was our guest at the D2 Seminar series, presenting his work on “Bayesian negative binomial mixture regression models for the analysis of sequence count and methylation data”.

This is the abstract his talk: A Bayesian hierarchical mixture regression model is developed for studying the association between a multivariate response, measured as counts on a set of features, and a set of covariates. We have available RNASeq and DNA methylation data on breast cancer patients at different stages of the disease. We account for heterogeneity and over-dispersion of count data by considering a mixture of negative binomial distributions and incorporate the covariates into the model via a linear modeling construction on the mean components. Our modeling construction employs selection techniques allowing the identification of a small subset of features that best discriminate the samples, simultaneously selecting a set of covariates associated to each feature. Additionally, it incorporates known dependencies into the feature selection process via Markov random field priors. On simulated data, we show how incorporating existing information via the prior model can improve the accuracy of feature selection. In the case study, we incorporate knowledge on relationships among genes via a gene network, extracted from the KEGG database. Our data analysis identifies genes that are discriminatory of cancer stages and simultaneously selects significant associations between those genes and DNA methylation sites. A biological interpretation of our findings reveals several biomarkers that can help to understand the effect of DNA methylation on gene expression transcription across cancer stages.

Could you explain briefly what your research work is about?

I develop statistical models for the integrative analysis of high-dimensional datasets. In particular these models have been developed for applications in genetics.

What is the major motivation for your study?

The idea is to equip researchers from other fields with methods to analyze their data with advanced methods, instead of relying in less complex models, whose assumptions may not be realistic.

What is the most interesting or unexpected result you found in your case study?

When we ran an enrichment analysis using our method and some competing methods, I was worried that the results may have not confirmed that our model was the best choice. Instead, the results really suggest that our model is capable of identifying the most interesting findings. That was really exciting!

what are the different applications that the model you proposed can have and what can be the possible future outcomes of your work?

In principle any dataset where the outcome variable is measured as count data, and where the interest is in selecting a small subset among many of these outcome variables, such that those are capable of discriminating among known groups. As an example, still in biology, single cell data or microbiome data may have these characteristics. Of course, although some features may be in common, there could still be differences that need to be accounted for in the model to be fitted, and this can lead to interesting extensions of the proposed work.

Alberto Cassese is a member of the Florence center for Data Science, you can watch the recording of his seminar at this page.