Wednesday, April 20, 2011

Pre-meeting - Daniel

Focus paper: Genre Distinctions for Discourse in the Penn Treebank
Related paper: Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. Building a discourse-tagged corpus in the framework of rhetorical structure theory. SIGdial 2002.

This paper describes the construction of a discourse corpus, the RST corpus. The goals of the authors are to create a fairly large corpus with annotations firmly grounded in a theory of discourse that has a high tag reliability. The corpus is based on the Rhetorical Structure Theory framework, although the authors do not really describe what that is. In order to tag a passage, each sentence is broken up into Elementary Discourse Units, which roughly correspond to clauses. Then, all of the EDUs are assembled into a labeled tree. Each node in the tree is labeled with a discourse relation, drawn from a shallow hierarchy consisting of 16 classes and 78 fine-grained labels. A small set of well trained annotators was employed over a fairly long period of time in order to give reliable output. The authors kept track of inter-annotator agreement throughout the process, and found consistently high kappa scores. The published corpus consists of articles from the PTB totalling 176,000 words.

