Focus paper: Collective Cross-Document Relation Extraction Without Labeled Data, Yao et al., EMNLP 2010
My related paper was:
Distant supervision for relation extraction without labeled data
Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky, ACL 2009
Previous approaches for relation extraction include supervised,
unsupervised and bootstrapping approaches. Mintz et al. propose 'distant supervision'. Their approach is build on the following intuition: If two entities participate in a relation, any sentence that contain those two entities is likely to express that relation. They therefore use an external database (e.g. Freebase), to create a (noisy) training set.
Their approach is as follows:
- For each pair of entities that appear in some Freebase relation, find all sentences containing these.
- Extract features, features are aggregated across sentences
- Train classifier
The focus paper notes that distant supervision doesn't work well when the target text and the database (e.g. Freebase) are not closely related. This is often caused due to violations of compatibility constraints (such as selectional preferences). Their model therefore explicitly models compatibility. Furthermore, it jointly models entity type prediction and relation extraction and exploits redundancy by using information from multiple documents. They use a CRF, where the variables represent facts, and the factors measure compatibility. Experiments were done using Freebase as distant supervision on Wikipedia and the New York Times datasets. They got a lot of performance gain compared to the baseline on the NYT dataset. Their performance gain on the Wikipedia set was less, but that's because their test set (Wikipedia) is very similar to Freebase.