Tuesday, February 1, 2011

Brendan on Hobbs et al 1997, "FASTUS"

I (Brendan) read the classic paper,

  • FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text
    Jerry R. Hobbs, Douglas Appelt, John Bear, David Israel, Megumi Kameyama, Mark Stickel, Mabry Tyson
    1997 (in a book or something), available at: http://arxiv.org/abs/cmp-lg/9705013

The task is templated information extraction: find events and their attributes, as defined by a frame/relation schema, from newswire text. It's like the merger/acquisition recognition problem the Haghighi and Klein paper uses. The FASTUS system was the best? or one of the best? systems competing in some of the MUC (Message Understanding Conference) shared task competitions that ran in the late 80's to early 90's.

There's a rollicking good NLP engineering war story behind it. They started with a complicated semantic analysis system with a parser and other things. It was slow. They were near the submission deadline and had a bad system, and realized they could take the finite-state component and use it for the entire system. Because development was orders of magnitude faster -- it ran in 10 minutes on the dataset, instead of many hours -- they could quickly iterate their rule engineering and dramatically improved their accuracy scores over the course of just a few weeks.

FASTUS is a cascaded finite-state extraction system. That means it runs 5 layers of pattern recognizers/chunkers in a pipeline. The output from one layer is the input to the next.

1. Complex words
=> fixed expression multiwords and names. Sometimes look at immediate context for names recognition.
=> editorial: I think fixed multiwords is a hugely neglected area for document understanding, today.

2. Basic phrases
=> small noun chunks, verb chunks, critical function word classes, certain entity classes (e.g. Location, Company Name).

3. Complex phrases
=> modifier and PP attachments to verb and noun chunks.

4. Domain events
=> e.g. [Company] [Start] [Activity] in/on [Date]
=> Note that these are much easier to engineer with the preprocessing above.
=> They expand patterns into alternate ordered forms; this is what finite-state is supposed to be bad at because it requires cross-producting out your space.

5. Merging structures
=> Coreference resolution via exact name match. They don't have a KB?
=> This gets the final frame structures for evaluation.

Since the system is all finite-state, it's very fast. They report it being an order of magnitude faster than competing approaches.

The subdivision between layers helps to understand the system. They claim that steps 1-3 are linguistically universal, and therefore domain-independent. Steps 4-5 are specific for the domain. They say they port the system to new domains by only having to rewrite steps 4-5.

I really like that how they approach syntactic tagging and chunking. It's much more rational than the way the NLP field defines POS tagging and parsing without any context of the final application.

If I had to write an IE system in a month, I'd use an approach basically like this. The challenge for machine learning is to better automate the whole system. Haghighi and Klein cite work that claims to replace steps 2-4ish with HMM or CRF sort of things, and H&K themselves model steps 4-5ish. This is the right direction but at 4 pages it's awfully limited relative to the overall goal.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.