Wednesday, March 23, 2011

Pre-meeting commentry - Dhananjay

Focus Paper : Reading tea leaves: How humans interpret topic models. Chang et al. NIPS 2009
Related Paper : Topic Evolution in a Stream of Documents. A e Gohr et al. SIAM: 2009 Data Mining

The related paper addresses the problem of changing nature of document collections. It tries to adapt the feature space and underlying document model. The idea is to generate a PLSA model for every fixed length time window. The paper provides an adaptive model for computing PLSA as a substitute to relearning for every window.

The PLSA model is parameterized by documents, words and topics. For every time window, the words and documents of the previous time window are discarded and new words are introduced. EM is used for inference. The MAP estimates of the previous window, remain as the current iteration estimates for the current window. For the current time, the documents that came before the time window are discarded, and so is the vocabulary that is not present in the current window. New words are "folded in" and the model is inferred again.

For evaluation, the authors used ACM-SIGIR conferences from 2000-2007. They compared their model with independent PLSA for a window for every time. The difference was the initialization. While the current model used the MAP from the previous computation, the independent PLSA was initialized randomly. The comparison was carried out for two windows - 1 year, and 2 years, and using the natural order and random order of documents. The average perplexity (50 iterations over k = 1, 2 4, ..., 128) for adaptive PLSA comes about 5% less than the independent PLSA.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.