CMU Advanced NLP Seminar 2011: February 2011

Thursday, February 10, 2011

Post-Meeting Commentary for Feb. 10, 2011

I guess after some time I realized that imposing constraints from coupling extractor training should actually be quite intuitive. In general if you have multiple sets of information you would naturally consider whether one set of information could increase the amount of information in the other set or vice versa.

So although bootstrapped learners are common for semi-supervised learning, there are a few other approaches such as using EM, graphical modelling, or the graph-based scheme Prof. Smith mentioned during seminar. Although it's difficult, I feel like if we can keep pushing the accuracy generated from just simple seed data, then it's a major step up from supervised learning, although it probably still needs a lot of work.

The papers Daniel and Brendan introduced I also found pretty interesting. The idea of making the card-pyramid-like data structure, regardless of the end goal, seems like a cool approach, although I guess its usefulness was unclear in the end. Daniel introduced the FACTORIE library that can construct graphical models with pretty good performance results. I recall Markov Logic Networks were mentioned as the basis for comparison, although I am unfamiliar with those at the moment.

As a note, I have yet to find anyone's sharing to be uninteresting, although sometimes I feel like the conversation goes a bit out of the scope of my current knowledge base.

-Alan

Reading for 3/3/11: Reichart and Rappoport, EMNLP 2010

Tense Sense Disambiguation: a New Syntactic Polysemy Task

Author: Roi Reichart and Ari Rappoport

Venue: EMNLP 2010

Leader: Alan

Request: When you post to the blog, please include:

Your name (plus "leader") if you are leading the discussion
Which focus paper this post relates to
Whether this is the pre-meeting review or the post-meeting summary

Reminders:

Leave a comment on this post (non-anonymously) giving the details of the related paper you will read (include a URL), by Monday, February 28.
Post your commentary (a paragraph) as a new blog post, by Wednesday, March 2.

Week 4 pre-meeting summary

Collective Cross-Document Relation Extraction Without Labelled Data
Limin Yao, Sebastian Riedel, and Andrew McCallum
EMNLP 2010
Pre-meeting summary

The focus paper proposes an approach that jointly models entity type prediction and relation extraction, and also explicitly models compatibility (in this paper they focused on selectional preferences). Furthermore, information across documents is used to exploit redundancy. Freebase is used as a source for distant supervision. Their evaluation showed high performance gains, especially when tested on out-of-domain data.

The related papers were all closely related to the focus paper. There was overlap (three people) for the paper about distant supervision. Other papers were about applying constraints to semi-supervised learning, joint entity and relation extraction, and factorie, a programming language for graphical models.

Weisi, Dhananjay and Dong read “Distant supervision for relation extraction without labeled data". This paper introduces the ‘distant supervision’ paradigm, and serves as a basis for the focus paper. Freebase is used to extract training instances, and then logistic regression is used as a classifier. They used both lexical and syntactic features, and found that syntactic features outperform lexical features for ambiguous or lexically distant relations. Evaluation was done with held out data and manual evaluation. Negative examples were created by selecting random entity pairs that were not in a Freebase relation. It is not totally clear why they chose to sample 1% as negative examples, and not a different number.

Daniel read “Factorie: Probabilistic programming via imperatively deﬁned factor graphs”. Factorie is a combination of an imperative and declarative language for specifying conditional undirected graphical models. It was used in the focus paper to construct the graphical model. In their evaluation, the authors obtained a 20-25% error reduction and 3-15 times speedup over the next best system that used Markov Logic Networks.

Alan read “Coupled Semi-Supervised Learning for Information Extraction”. The paper takes a semi-supervised learning approach for both extracting entities and relations. Semi-supervised learning often suffers from low accuracy. In their approach, multiple extractors are trained together, and the resulting constraints are applied to increase the accuracy of each extractor. In their evaluation, adding additional constraints showed significantly higher precision.

Brendan read “Joint entity and relation extraction using Card-Pyramid Parsing”. A pyramid structure is placed over the chunks of a sentence; the nodes are possible relations between pairs of chunks. Their motivation and goal doesn’t seem to be very clear. For example, it isn’t clear if they wanted to create single coherent trees. In their evaluation, joint inference sometimes improved performance.

In addition, Michael read “Bi-directional Joint Inference for Entity Resolution and Segmentation Using Imperatively-Defined Factor Graphs” and Matt read “Learning 5000 relational extractors”, but there was no write up for these.

Wednesday, February 9, 2011

"Pyramid Parsing" paper (comments by Brendan)

I liked the focus paper a lot.

My paper was "Joint entity and relation extraction using Card-Pyramid Parsing" by Kate and Mooney (CoNLL 2010). They place a "pyramid" structure over the chunks of a sentence, where the nodes are possible relations that could hold between pairs of chunks. (Like a CKY chart. But the semantics are supposed to be, a node refers to the leaves of its span endpoints... I think.) There are productions for e.g. how relations are composed of entity types. I had a hard time telling if the goal is to create a single coherent tree, exactly. They do a parse using classifiers for entity and relation recognition as features for parse decisions (Nivre-like, kind of). They have results show the joint inference sometimes improves performance. But I was still confused by the motivation of their approach, beyond joint inference (for which one can imagine many other reasonable approaches).

Pre-meeting commentary for week 4

Related paper - Distant supervision for relation extraction without labeled data

(Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky)

The paper introduces distant supervision as a new paradigm for training classifiers. The paradigm is applied to mine relations from unlabelled data.

Distant supervision is acheived by refering to Freebase. Each relation in freebase is mined in the training data to extract textual features used to train the classifier. The features considered are

* Lexical features - (1) sequence of the words (2) POS (3) which entity came first (4) window of k words left to e1 and right to e2 along with their POS tags

* Syntactic features - (1) dependency path (2) one window node

In the preprocessing task, consequtive words with same named entity tag (and occurring consequtively in the parse tree) are 'chunked'. Negative examples are provided by using random unrelated (according to Freebase) entities. Evaluation is made using held out data from Freebase and Human evaluation (seperately). Syntactic features outperform lexical features in which the sentences that mention the relations are ambiguous.

Response for Feb. 10, 2011

To recap, I read

Coupled Semi-Supervised Learning for Information Extraction

http://rtw.ml.cmu.edu/papers/carlson-wsdm10.pdf

As a starting note, I want to comment that the focus paper was fairly easy to digest this time around. Compared to previous papers there was sufficient background given and more tangible examples to keep a sanity check of what was going on.

The supplemental paper I read this week looks at semi-supervised learning for both extracting categories (entities) and relations. Supervised learning is costly in the sense that sufficient amounts of labeled data are required, so semi-supervised learning seeks to mitigate such factors at a highly undesirably sacrifice of accuracy. The selling point here is that different information extractors may be able to tell us something about one another. So instead of learning individual information extractors on their own, we can train multiple extractors together, and apply the resulting constraints to increase the accuracy of each extractor. Intuitively, this seems like a general step in the right direction, since it seems fairly likely that some relations can help restrict the possibilities of certain other relations. Another problem this could apply to is semantic drift from bootstrap learning methods and this is mentioned briefly.

In terms of results, several algorithms are compared with themselves except with an additional coupling procedure that filters out candidates using mutual exclusion and type checking. The version of the algorithms with the additional constraints show significantly higher average precision of the promoted instances in their bootstrapping learner. Pretty cool stuff from in-house.

-Alan

Pre-meeting commentary for week 4

Focus paper: Collective Cross-Document Relation Extraction Without Labeled Data. Limin Yao Sebastian Riedel Andrew McCallum. EMNLP 2010

Related paper: Factorie: Probabilistic programming via imperatively deﬁned factor graphs. Andrew McCallum, Karl Schultz, and Sameer Singh. NIPS 2009.

This paper describes the system, Factorie, that underlies the work of the focus paper. Factorie is an combination imperative and declarative language for specifying conditional undirected graphical models, and it provides routines for learning the parameters of such models using MCMC methods. Models are specified by describing the variables present, the factors which are used to score the assignments to the variables and how they are shared, and a proposal function for generating a proposal distribution. This last step is optional, and generic methods such as Gibbs sampling can be chosen instead. They use a method called Sample-rank to avoid having to compute marginals or do full decoding of the input data.

To demonstrate the system, the authors use Factorie for the problem of joint segmentation and coreference of paper citations. They obtain a %20-25 percent error reduction and 3-15 times speedup over the next best system, one that uses Markov Logic Networks.

Pre-meeting Post from Weisi Duan

I have read the paper "Distant supervision for relation extraction without labeled data" by Mike Mintz et al.. The paper serves as the basis of the focus paper. The main idea is to use distant supervision to bootstrap the training instances, and then use a multi-class logistic regression classifier to predict the relation. The features are key part. The authors utilized both lexical features and the syntactic features, and compared the performance of using both against each individually. The evaluation is done in the similar way as the focus paper, since the focus paper is based on this paper. The authors sampled 1% of the positive cases to be used as negative cases, while this is intuitively plausible, I wonder what is the theoretical foundation for it. I don't see exactly that this would fall as bias variance trade off. The author could evaluate on recall to make the results stronger, which I feel might be presented to show how much the conjunctive features (which tend to induce high precision) affect the recall.

Comments week 4

Focus paper: Collective Cross-Document Relation Extraction Without Labeled Data, Yao et al., EMNLP 2010
Pre-meeting

My related paper was:
Distant supervision for relation extraction without labeled data
Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky, ACL 2009

Previous approaches for relation extraction include supervised,
unsupervised and bootstrapping approaches. Mintz et al. propose 'distant supervision'. Their approach is build on the following intuition: If two entities participate in a relation, any sentence that contain those two entities is likely to express that relation. They therefore use an external database (e.g. Freebase), to create a (noisy) training set.

Their approach is as follows:

- For each pair of entities that appear in some Freebase relation, find all sentences containing these.
- Extract features, features are aggregated across sentences
- Train classifier

The focus paper notes that distant supervision doesn't work well when the target text and the database (e.g. Freebase) are not closely related. This is often caused due to violations of compatibility constraints (such as selectional preferences). Their model therefore explicitly models compatibility. Furthermore, it jointly models entity type prediction and relation extraction and exploits redundancy by using information from multiple documents. They use a CRF, where the variables represent facts, and the factors measure compatibility. Experiments were done using Freebase as distant supervision on Wikipedia and the New York Times datasets. They got a lot of performance gain compared to the baseline on the NYT dataset. Their performance gain on the Wikipedia set was less, but that's because their test set (Wikipedia) is very similar to Freebase.

Friday, February 4, 2011

Week 3 - post meeting

Focus paper: An Entity-Level Approach to Information Extraction
Aria Haghighi and Dan Klein, ACL 2010
Post-meeting

Some of the topics covered in the meeting were
* Comparison of the task of the focus paper with more recent tasks (such as read the web)
* Evaluation venues (MUC, DUC, TAC etc.). One of the current tasks in TAC is the Knowledge Base Population task (http://nlp.cs.qc.cuny.edu/kbp/2010/)
Another task which is somewhat related is the entity track in TREC (http://ilps.science.uva.nl/trec-entity/, such as entity list completion).
* The difference between the focus paper and their NAACL paper. Overall, people agreed that it was very similar (but with some changes, such as different task, partially supervised etc.) but that it was ok because they had clearly stated it in the focus paper.
* The way they did the inference (variational EM, etc.).
* Real world applications and relevance of these template-filling tasks for companies

This time there was a lot of overlap between the choice of related papers.
The papers read by people tackling the IE task could be divided in two approaches: the more heuristic, and task based approach often using classifiers etc (Patwardhan and Riloff 2009, Shasha Liao and Ralph Grishman 2010) and approaches where a formal model is defined (focus paper).

Thursday, February 3, 2011

Reading for 2/10/11: Yao et al., EMNLP 2010

Collective Cross-Document Relation Extraction Without Labelled Data

Author: Limin Yao, Sebastian Riedel, and Andrew McCallum

Venue: EMNLP 2010

Leader: Dong

Request: When you post to the blog, please include:

Your name (plus "leader") if you are leading the discussion
Which focus paper this post relates to
Whether this is the pre-meeting review or the post-meeting summary

Reminders:

Leave a comment on this post (non-anonymously) giving the details of the related paper you will read (include a URL), by Monday, February 7.
Post your commentary (a paragraph) as a new blog post, by Wednesday, February 9.

Week 3 - Pre-meeting commentary

This week's focus paper discussed about a generative solution to the template filling problem. The generation process is composed of three components - Semantic: generating entities for roles, Discourse: entity indicators for the mention and Mention Generation. The learning is done through variational EM. A major assumption of the paper is the one-to-one mapping of roles to entities.

The related papers read this week had a great amount of overlap. They concerned (1) Co-reference resolution, (2) Another approach to role filling using event recognition, and (3) Event extraction.

Matt, Michael and I read "A uniﬁed model of phrasal and sentential evidence for information extraction" by S. Patwardhan and E Riloff. The paper proposes classifying that a sentence conveys an event as a factor to a mention that relates to an entity fills the role. The authors claim that one of the significant contributions of the paper is classifying a sentence as an event sentence. The GLACIER model outperforms a context-only baseline on test data due primarily to (1) extracting entities with inconclusive local context but clearer sentence-level context, and (2) reducing false positives by identifying uneventful sentences and not attempting entity extraction.

Daniel, Alan and Weisi read "Coreference Resolution in a Modular, Entity-Centered Model". This model is different from the focus paper in the sense that it uses a log-linear model over multiple features for the discourse component, as compared to tree-distance. Another difference is that the assumption that a role is mapped to a single entity is not made.

Dong read "An Entity-Level Approach to Information Extraction". It uses a much broader context - document to the template filling problem. It tries to find out the easy cases, and then uses this information to track the hard cases. It assumes that mentions are likely to trigger the same event in a given context; there is a strong corelation between event types; roles are consistent across events and document as a strong context.

Brendon read on FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text, which discusses a fast transducer to extract events and their attributes. It is a pipeline of pattern recognizers. The domain independent and domain specific operations are clearly demarcated thus making the system easy to adapt to new domains.

Wednesday, February 2, 2011

Comments for week 3

I read the paper --

S. Patwardhan and E Riloff. 2009. A uniﬁed model of phrasal and sentential evidence for information extraction. In EmpiricalMethods in Natural Language Processing (EMNLP).

This paper combines two components to jointly model the probability using the following factorized distribution:

P(EvSent(SNPi),PlausFillr(NPi )|F) =
P(EvSent(SNPi)|F) ∗ P(PlausFillr(NPi )|EvSent(SNPi), F)

The first component of models is the "sentential event recognizer" which uses sentence level features and the second component is the "plausible role filler recognizer." For the sentence classifier, they tried Naive Bayes but (as is expected) it does not give very good probability estimates. Therefore, they decided to use an SVM and normalize the scores to get a probability. They also used NB for the role filler recognizer.

From an ML/statistical perspective this is far simpler model than the focus paper which uses a more complicated graphical model that and jointly trains all the components of the model, even using unannotated data.

Matt

Week 3 - Related reading

For this week, I read A Unified Model of Phrasal and Sentential Evidence for Information Extraction; Authors: Siddharth Patwardhan and Ellen Riloff. The primary objective of the research is to use sentential context, along with phrasal context to identify entities that fill a particular role. The authors focus on NP extraction. They use a joint event probability of whether the entity occurs in a sentence that discusses an event, and does it satisfy a particular role. Although the authors use classifier based approaches to determine the latter probability, they acknowledge that pattern based approaches can also be used.

The authors claim that one of the significant contributions of the paper is classifying a sentence as an event sentence. For this, the main problem is generation of a corpus.

Commentary for Feb 2

I read the paper Coreference Resolution in a Modular, Entity-Centered Model. This paper presents a mostly unsupervised approach to coreference resolution, with similar methods to the focus paper. Instead of trying to match mentions to slots in frames, they match mentions to entities. The system deals with a hierarchy of mentions which are present in the text, abstract entities, and types, which are classes of entities. This allows the model to make generalizations across multiple different entities.

The system uses a hierarchical generative process, where first a list of entities is drawn by drawing a list of types, then an entity from each of those types. Then, mentions are drawn from the entities using a sequential distance-dependent chinese restaurant process. Finally, each of these mentions generates a surface realization.

To train the model, each level of the generative hierarchy. is updated using EM in turn, until all have converged. This is an approximation to just running normal EM, which would be computationally infeasible given the model. All of the training is done on unlabeled data, except for prototypes of the types which are hard-coded in at the beginning of training.

The explicit modeling of discourse is interesting in this paper, and seems like a good starting point for generative models of discourse that could be incorporated into other models.

Comments for February 2, 2011

To recap, I read

Coreference Resolution in a Modular, Entity-Centered Model

http://www.aclweb.org/anthology/N/N10/N10-1061.pdf

In the paper, the authors attempt to apply a entity-centered model that shares many features with the one used in the focus paper. Their approach is unsupervised, and their generative model makes use of distributional entity types, which is one of the major factors that separates this paper from the template-filling approach.

The flow of the generative model consists of 3 basic modules, one for semantics, one for discourse, and one for mention generation. From what I understand it feels that these respective components in the focus paper are almost exactly the same, except roles are replaced by types. Intuitively speaking, template-filling and co-reference resolution seem like sibling problems, which only further supports the fact that the two papers share authors and year of publication.

The learning procedure divides the variables into subgroups and does optimizations in a round-robin update scheme. Again, this is not unlike the variational EM algorithm used in the focus paper. The form of evaluation is on several standard co-reference resolution metrics that I am mostly unfamiliar with, but overall, the results show sizable improvement over previous work in almost all metrics (reduced error rates).

What's cool about this work is that it exploits information at multiple levels, considering both individual entities and entity types. Of course, the motivation for this is that a better grasp on semantic constraints is the key to improved co-reference revolution systems.

Still getting used to reading research papers...

-Alan

Week 3 - Pre-meeting Commentary - Michael

Focus paper: An Entity-Level Approach to Information Extraction

Related paper: A Unified Model of Phrasal and Sentential Evidence for Information Extraction

Authors: Siddharth Patwardhan and Ellen Riloff

Patwardhan and Riloff present GLACIER, a probabilistic model for extracting role-filling entities from sentences. Like Haghighi and Klein, the authors incorporate sentential information beyond local context to determine role-fillers, though not to the extent of the focus work as context is still limited to the sentence level. The system employs a sentential event recognizer to determine if a sentence discusses a relevant event for which role-filling entities can be extracted, followed by a plausible role-filler recognizer to extract such entities. The role-filler recognizer is implemented as a Naive Bayes classifier that considers contextual features generated by various off-the-shelf NLP tools such as named entity recognizers, shallow parsers, and semantic dictionaries. The sentential event recognizer, implemented alternatively as a NB classifier or SVM classifier, uses similar features calculated for all sentence NPs plus additional sentence-level features. The GLACIER model outperforms a context-only baseline on test data due primarily to (1) extracting entities with inconclusive local context but clearer sentence-level context, and (2) reducing false positives by identifying uneventful sentences and not attempting entity extraction. When viewed alongside the focus work, these results provide an intermediate data point that helps demonstrate the benefit of incrementally increasing context scope and model sophistication to improve performance of IE systems.

Week 3 Comments

Name: Weisi Duan
Focus paper: "An Entity-Level Approach to Information Extraction" by Aria Haghighi and Dan Klein.
This is a pre-meeting review.

I read the paper "Coreference Resolution in a Modular, Entity-Centered Model" where the authors used a generative model which decomposes into three models. The Semantic model and mention model seem to be similar to the two models in the focused paper. The difference seems to be the discourse model in which they used a log linear model which utilizes more feature than the one in the focused paper, which uses only tree distance. Another difference is that in the focused paper, the roles are mapped to the entities one on one while in this paper the types are mapped one to many. Finally, the variational inference in this paper seems to be jointly inferring both the parameters and the entities. I wonder how to do the inference in gibbs sampling in both of two papers. One final thing is about evaluation, since the entities can take any form, I am not sure how exactly the extracted entities are mapped to the gold standard, eg. mapped by overlap of word list of the properties. It is interesting to see the way they represented the entities using the variable length word list and it can be also used in WSD to represent the word senses because there are lots of senses in wordnet with only one synonym. As said the paper, the word list can skew the mention model to the entity, and I guess this can also be done to a word sense.

Comments week 3

Focus paper: An Entity-Level Approach to Information Extraction
Aria Haghighi and Dan Klein, ACL 2010
Pre-meeting

In addition to the focus paper, I read:

Using Document Level Cross-Event Inference to Improve Event Extraction
Shasha Liao and Ralph Grishman, ACL 2010
http://aclweb.org/anthology-new/P/P10/P10-1081.pdf

Both papers are about the template-filling problem, and they both try to incorporate more information than just the local context as evidence. The focus paper presents a generative model, while the related paper incorporates global context features in their classifiers.

The related paper by Liao and Grishman presents an approach that uses document level information to improve information extraction (event and role extraction). Their approach is build on several intuitions:

- First extract easier cases, then use this information to tag the harder cases
- If a word triggers a particular event, other instances of the word probably also trigger events of the same type
- There is a strong correlation between event types
- Role consistency
- Document level information can make labeling more consistent.

Their approach first applies a baseline, state-of-the art IE system. This system extracts information independently for each sentence. For their second step, they only keep the high-confident extracted events and roles using a heuristic threshold. Then two additional classifiers are used, for which the features uses the high-confident extracted events and roles. For example, an example feature they used contained a binary indicator whether the particular event type was also present elsewhere in the document.
I think their intuitions are really nice, and they paper provided a nice analysis to motivate this. However, I think their final approach could be more sophisticated instead of generating features with binary indicators as document level information.

I like the approach of Haghigi and Klein because they present a generative model. Furthermore they seem to use more sophisticated distance information (such as tree distance) in comparison with Liao and Grishman.

Tuesday, February 1, 2011

Brendan on Hobbs et al 1997, "FASTUS"

I (Brendan) read the classic paper,

FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text
Jerry R. Hobbs, Douglas Appelt, John Bear, David Israel, Megumi Kameyama, Mark Stickel, Mabry Tyson
1997 (in a book or something), available at: http://arxiv.org/abs/cmp-lg/9705013

The task is templated information extraction: find events and their attributes, as defined by a frame/relation schema, from newswire text. It's like the merger/acquisition recognition problem the Haghighi and Klein paper uses. The FASTUS system was the best? or one of the best? systems competing in some of the MUC (Message Understanding Conference) shared task competitions that ran in the late 80's to early 90's.

There's a rollicking good NLP engineering war story behind it. They started with a complicated semantic analysis system with a parser and other things. It was slow. They were near the submission deadline and had a bad system, and realized they could take the finite-state component and use it for the entire system. Because development was orders of magnitude faster -- it ran in 10 minutes on the dataset, instead of many hours -- they could quickly iterate their rule engineering and dramatically improved their accuracy scores over the course of just a few weeks.

FASTUS is a cascaded finite-state extraction system. That means it runs 5 layers of pattern recognizers/chunkers in a pipeline. The output from one layer is the input to the next.

1. Complex words
=> fixed expression multiwords and names. Sometimes look at immediate context for names recognition.
=> editorial: I think fixed multiwords is a hugely neglected area for document understanding, today.

2. Basic phrases
=> small noun chunks, verb chunks, critical function word classes, certain entity classes (e.g. Location, Company Name).

3. Complex phrases
=> modifier and PP attachments to verb and noun chunks.

4. Domain events
=> e.g. [Company] [Start] [Activity] in/on [Date]
=> Note that these are much easier to engineer with the preprocessing above.
=> They expand patterns into alternate ordered forms; this is what finite-state is supposed to be bad at because it requires cross-producting out your space.

5. Merging structures
=> Coreference resolution via exact name match. They don't have a KB?
=> This gets the final frame structures for evaluation.

Since the system is all finite-state, it's very fast. They report it being an order of magnitude faster than competing approaches.

The subdivision between layers helps to understand the system. They claim that steps 1-3 are linguistically universal, and therefore domain-independent. Steps 4-5 are specific for the domain. They say they port the system to new domains by only having to rewrite steps 4-5.

I really like that how they approach syntactic tagging and chunking. It's much more rational than the way the NLP field defines POS tagging and parsing without any context of the final application.

If I had to write an IE system in a month, I'd use an approach basically like this. The challenge for machine learning is to better automate the whole system. Haghighi and Klein cite work that claims to replace steps 2-4ish with HMM or CRF sort of things, and H&K themselves model steps 4-5ish. This is the right direction but at 4 pages it's awfully limited relative to the overall goal.