Focus paper: Genre Distinctions for Discourse in the Penn Treebank
This week we read Bonnie Webber's paper from ACL 2009. The paper provides genre information about the articles in the Penn Tree Bank. Four different genres were characterized by discourse connectives and their senses that are manually annotated in the Penn Discourse Tree Bank. The findings in the paper involve those looking at differences between genres in the senses associated with various discourse connectives and also differences between the genres in senses of marked and unmarked discourse relations. The paper arges that genre is important in and that lexically marked relations are not a good model for automatic sense labeling of non-lexically marked discourse relations.
I read "Automatic sense prediction for implicit discourse relations in text" by Pitler et al. from ACL-IJCNLP, Singapore 2009. It talks about using word-level features for automatic sense prediction of implicit discourse relations. Previously used word-pair features are examined and their weaknesses are evaluated. 3 classifiers and four classification tasks are run as an experiment, and f-scores show improvement over a true-distribution randomized baseline.
Dhananjay read "Text genre detection using common word frequencies" from Stamatatos et al. COLING 2000. The paper introduces a method to do genre classification using common word frequencies as style markers. A continuation on the work of Burrows (1987), the authors remove most of the restrictions that were present then (expanding contractions, etc.). They used BNC to extract frequency lists, and discriminant analysis to do classification. The error rate is decreased from 6.25 to 2.5, but the comparison is made for only the 30 most frequent words as opposed to 55 before.
Daniel read "Building a discourse-tagged corpus in the framework of rhetorical structure theory" Carlson et al. from SIGdial 2002. The paper talks about constructing a discourse corpus called the RST corpus that aims for high reliability of tags. Based on the Rhetorical Structure Theory framework, it consists of labeling trees of sentence segments called elementary discourse units. The inter-annotator agreement was tracked and kappa scores were consistently high. The published corpus is composed of 176,000 words from Penn Tree Bank articles.
Both Dong and Weisi read papers that address the problem of identifying the arguments of discourse connectives.
Dong read "Discourse connective argument Identification with Connective Specific Rankers" by Elwell and Baldridge from ICSC '08. The authors of this paper attempt to model discourse connectives individually, in contrast to previous research that considers only using a single classifier. Higher global models are used for interpolation due to reduced availability of data for each specific connective. The evaluation used a max ent. ranker and the data was from the Penn Discourse Tree bank. Overall, the interpolation improved results.
Weisi read "Automatically identifying the arguments of discourse connectives" from Wellner and Pustejovsky in EMNLP 2007. The problem is given a connective, what should the discourse segments be. They use a log linear model along with the head finding algorithm from Collins 1999. Both models for each argument are also wrapped up in Collins' perceptron. The evaluation consists of comparing the predicted arguments to those in the Penn Discourse Tree Bank. Results improve on previous models.