Wednesday, April 27, 2011

Pre-Meeting Alan

Related paper: Open Information Extraction from the Web
Focus paper: Unsupervised Ontology Induction from Text

The related paper I read introduces a new extraction paradigm called Open Information Extraction (OIE). In this paradigm, the system passes over the corpus a single time and is able to extract a sufficient number of relations. The system requires no human input but automatically discovers and stores relations of interest, independent of domain. Open IE operates without knowing relations a priori, where standard IE systems only operate on relations given to it a priori by the user.

As for experiments and evaluation, the paper uses a fully implemented OIE system called TextRunner, which, when compared to the state-of-the-art KnowItAll web extraction system, has a noticeably lower error rate and significantly improved performance while maintaining a similar accuracy rate.

The basic architecture of the system consists of a single pass extractor, which passes over the entire corpus to extract tuples for all possible relations. It uses a part of speech tagger to do this. Each candidate tuple is then sent to a self-supervised learner which classifies it "trustworthy" or otherwise. Lastly there is a redundancy-based assessor that assigns a probability to each retained tuple.

Two big things about TextRunner are its performance and thus scalability. The runtime is constant in the number of relations as opposed to linear. Although it's a little hard to directly compare, TextRunner extracted all relations in about 85 CPU hours on a test run, where KnowItAll took 6.3 hours per relation.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.