Wednesday, March 9, 2011

Related paper A Method of Automated Nonparametric Content Analysis for Social Science, Daniel J. Hopkins, Gary King
Focus paper: General Purpose Computer-Assisted Clustering and Conceptualization, Justin Grimmer and Gary King

The related paper focuses on estimating proportions of categories (classes), instead of doing individual classifications (what is mostly done in computer science). They first review two existing approaches to estimate proportions: 1) sample a subset and hand label them to estimate the category proportions, 2) Do individual classification, and aggregate the predictions to calculate a proportion. They explain why both approaches have problems, and then propose two new methods. The first one applies existing individual classification techniques, but then estimates the errors per category and corrects the aggregated category proportions. The second one estimates proportions directly without doing individual classification. The problem can be framed as a regression problem and the class proportions are the regression coefficients. Because of computational and sparsity issues, they sample subsets of words, and estimate it for each set. The results are then averaged.

I think the paper gave a nice explanation of the goals and alternatives. Furthermore, it was interesting because I've never thought about estimating proportions instead of individual classifications, and now know why you would want to use other methods instead of just aggregation individual classification predictions.

