Tuesday, May 3, 2011

Public Note

This blog is a collective journal kept by participants in the Spring 2011 iteration of the Advanced Natural Language Processing Seminar (11-713) at the Language Technologies Institute at Carnegie Mellon University, taught by Noah Smith.  The participants in the course agreed for the journal to be made public in case others would find it useful.

The seminar operated as follows.  Each week, the instructor chose a focus paper from recent literature in NLP.  This was announced by Friday (usually).  By Monday, participants left a comment on the announcement saying what other paper they would read, individually, in addition to the focus paper.  On Wednesday, each participant posted a summary of the two papers.  The seminar met each Thursday, with mostly informal discussion, and students were encouraged to post follow-up thoughts for about half of the discussions.

Saturday, April 30, 2011

Post meeting - Dhananjay - 7 April

We discussed on phasal alignement for machine translation. The discussion was informative in two subjects. In the focus paper, the search space is reduced to searching in inverse transduction grammars. We had an introduction on what ITG is. The second was using MIRA to classify. MIRA is used as an incremental learning algorithm where each class is represented with a vector. The instance is classified in the class whose vector is most similar. If the the instance is incorrectly classified, then all the interesting vectors (the actual class and the vectors which are more similar than the actual class vector) are changed to align the instance towards the correct class. I was earlier apprehensive that the order of training instances may affect the class vectors. However, since the training instances are iterated multiple times and randomly, there is very less chance of happening so.

Post meeting - Dhananjay - 21 Apr

We discussed Bonnie Weber's paper from ACL 2009 on genre distinctions in the Penn treebank. It was dicussed that genre distinctions were important for sense disambiguation. I had a view that genre distinction is similar to topic models except that the first one is classification while the second one is clustering. It was discussed that while in topic models, multiple topics generate a document, in genre, a document belongs to only one genre. We also discussed about different discourse connectives.

Friday, April 29, 2011

Post meeting summary for the first week discussion

During the meeting we have discussed semantic parsing with the focus paper "Inducing Probabilistic CCG Grammars from Logical Form with Higher-order Unification" by Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwater, and Mark Steedman appeared in EMNLP 2010. The paper was based on the 2005 UAI best paper by Luke Zettlemoyer and Michael Collins, with a few extensions, including the representation generalizing over other formalisms and that lexical items can be multi-word, etc. People noted that semantic parsing is not necessarily constrained to CCG, as any other formalism could potentially be augmented with logical forms to conduct semantic parsing, eg. LFG. People have read some other related papers that address different problems, such as "Unsupervised Semantic Parsing" by Hoifung Poon and Pedro Domingos, which addresses the problem in unsupervised setting and handles the syntactic variations given the same semantic representation. People have also read papers that conduct semantic parsing with Machine translation methods. People have also discussed about the potential beneficiaries of doing semantic parsing, besides the natural language interface to database and so on. Text entailment was suggested as a possible one.

Thursday, April 28, 2011

Premeeting - Dhananjay

I read Joint Unsupervised Coreference Resolution with Markov Logic by Poon and Domigos (EMNLP'08). The paper claims that unsupervised approaches though attractive due to abundance of unlabeled training data, are not explored as they are more difficult. The method uses Markov logic to do a joint inference. The head word is determined by using Stanford parser rules. This gives a better precision than just choosing the right most word as the head word. Two mentions are clustered together if they have the same head word. Since this doesn't work for pronominal entities, predicates for gender, entity type and number are checked. In addition apposition and predicate nominals are also incorporated using predicates. I didn't digest the inference of this network.

The results show a 7% increase in the F-measure from the baseline H&K system. However, in the systems where the determination of the head word is similar (choosing the rightmost word), the precision decreases (although we have a good recall).

Since they cluster mentions with the same headword, without using any feature such as distance, some mentions whose head words are common nouns may be incorrectly clustered.

Wednesday, April 27, 2011

Pre-meeting - Daniel

Focus paper: Unsupervised Ontology Induction from Text
Related paper: Unsupervised Semantic Parsing

This paper presents the USP system, upon which the system of the focus paper is based. The authors aim to create an fully unsupervised system capable of parsing text to a deep semantic representation. Instead of trying to learn both syntax and semantics at the same time, this system takes syntactic parses as given and induces semantics from them. The first stage in the process is a deterministic mapping from the given dependency tree to a quasi-logical form: essentially just a first-order logical representation of the dependency tree. Then, the goal of the semantic parsing is to partition these representations into separate lambda forms, and cluster lambda forms that have the same meaning. Training the model is done using Markov Logic Networks, and parsing is done using a greedy search through the space of possible partitions of the QLF.

Since there is no clear way to evaluate an an unsupervised semantic parser intrinsically, the authors apply their system to the task of question answering. They restrict their attention to the biomedical domain, and compare to other open-domain question answering systems, and they perform above the other systems. In addition, they manually inspect the outputs of their system, and claim that the semantic clusters produced tend to be very coherent and that their system is capable of understanding many different types of paraphrases.

Pre-Meeting Alan

Related paper: Open Information Extraction from the Web
Focus paper: Unsupervised Ontology Induction from Text

The related paper I read introduces a new extraction paradigm called Open Information Extraction (OIE). In this paradigm, the system passes over the corpus a single time and is able to extract a sufficient number of relations. The system requires no human input but automatically discovers and stores relations of interest, independent of domain. Open IE operates without knowing relations a priori, where standard IE systems only operate on relations given to it a priori by the user.

As for experiments and evaluation, the paper uses a fully implemented OIE system called TextRunner, which, when compared to the state-of-the-art KnowItAll web extraction system, has a noticeably lower error rate and significantly improved performance while maintaining a similar accuracy rate.

The basic architecture of the system consists of a single pass extractor, which passes over the entire corpus to extract tuples for all possible relations. It uses a part of speech tagger to do this. Each candidate tuple is then sent to a self-supervised learner which classifies it "trustworthy" or otherwise. Lastly there is a redundancy-based assessor that assigns a probability to each retained tuple.

Two big things about TextRunner are its performance and thus scalability. The runtime is constant in the number of relations as opposed to linear. Although it's a little hard to directly compare, TextRunner extracted all relations in about 85 CPU hours on a test run, where KnowItAll took 6.3 hours per relation.