Wednesday, 8 February 2012

A stick-breaking likelihood for categorical data analysis

We have a new paper appearing in the forthcoming AISTATS2012 - you can get the paper and code here:
M. E. Khan, S. Mohamed, B. M. Marlin and K. P. Murphy. A stick-breaking likelihood for categorical data analysis with latent Gaussian models, AISTATS, April 2012.
In this paper we look at building models for the analysis of categorical (multi-class) data -- we try to be as general as possible, and look at both multi-class Gaussian process classification and categorical factor analysis. Emtiyaz will soon be on the post-doc trail, so you might here about this live in a lab near you soon. Existing models look at probit and logit link functions, and here we look at a third, new likelihood function, which we call the stick-breaking likelihood (related to the stick-breaking you know from Bayesian non-parametrics). We combine this likelihood with variational inference and show convincing results in favour of our new likelihood. One of the key messages is that this likelihood, in combination with the variational EM algorithm proposed, gives better correspondence between the marginal likelihood and the prediction error. Thus choosing hyperparameters by optimising the marginal likelihood will also give good prediction accuracy, where this is not the case with other approaches. The paper has all the details - all the Matlab code is online as well, so feel free to play around with it and let us know what you think.

Of course the stick-breaking likelihood has some limitations, such as having a dependence on the order of the categories, but this is something we are going to look at further along with other assumptions. While we do say something about multi-class EP in the paper, we are working on a longer technical report for multi-class GP classification that will look into this a bit more.

Here is the abstract:
The development of accurate models and efficient algorithms for the analysis of multivariate categorical data are important and long-standing problems in machine learning and computational statistics. In this paper, we focus on modeling categorical data using Latent Gaussian Models (LGMs). We propose a novel stick-breaking likelihood function for categorical LGMs that exploits accurate linear and quadratic bounds on the logistic log-partition function, leading to an effective variational inference and learning framework. We thoroughly compare our approach to existing algorithms for multinomial logit/probit likelihoods on several problems, including inference in multinomial Gaussian process classification and learning in latent factor models. Our extensive comparisons demonstrate that our stick-breaking model effectively captures correlation in discrete data and is well suited for the analysis of categorical data.

No comments: