CMU Advanced NLP Seminar 2011

Wednesday, March 16, 2011

I read

Comparing clusterings: An information based distance. Meila, M. 2007. Journal of Multivariate Analysis

www.stat.washington.edu/mmp/Papers/compare-jmva-revised.ps

This paper provides a different approach to deriving the distance metric used to compare clusterings in the focus paper. The goal of this metric, called variation of information (VI), is to be intuitive as well as to possess desirable mathematical characteristics. From a variety of axioms, the following definition is derived VI(C,C')=H(C|C')+H(C'|C), where H denotes the conditional entropy. This function is a true metric, which makes possible more types of reasoning about the space of clusterings. VI is not directly dependent on dataset size, so distances are comparable between different datasets. The author proves that VI is the only function that satisfies all the desired properties, although some of these properties are somewhat nonintuitive, relating to the way the metric interacts with combinations of clusters. The principal advantage of this metric over others is the fact that it is comparable across datasets and experimental conditions without rescaling, which is generally not mathematically justified.

CMU Advanced NLP Seminar 2011

Wednesday, March 16, 2011

No comments:

Post a Comment