Wednesday, April 20, 2011
Premeeting post - Dhananjay
I read the paper - Text Genre Detection Using Common Word Frequencies by E. STAMATATOS, N. FAKOTAKIS, and G.KOKKINAKIS (http://www.aclweb.org/anthology/C/C00/C00-2117.pdf). The paper presents a simple method to classify documents in genres by using as style markers the frequencies of the occurences of most frequent words. The idea builds on a paper by Burrows (1987) which uses similar style markers with additional tasks such as expansion of I'm to I am; seperation of common homographic forms (e.g. to as infinitive and preposition), proper names and text sampling. The method proposed in this paper, removes these restrictions. Also, instead of using the training corpus for extracting the most frequent occurence list, BNC is used. They use discriminant analysis to perform classification. In the results section, they report a decrease in the error rate (2.5%) as compared to the Burrows method (6.25%). The comparison is however made for 30 most frequent words and 55 most frequent words respectively which is the minima for both the methods. After 40-50 words, the Burrows method has a lower error rate. The author claim that the decrease in performance is due to overfitting. The paper also presents that using punctuation reduces the error rate.