Tuesday, May 3, 2011

Public Note

This blog is a collective journal kept by participants in the Spring 2011 iteration of the Advanced Natural Language Processing Seminar (11-713) at the Language Technologies Institute at Carnegie Mellon University, taught by Noah Smith.  The participants in the course agreed for the journal to be made public in case others would find it useful.

The seminar operated as follows.  Each week, the instructor chose a focus paper from recent literature in NLP.  This was announced by Friday (usually).  By Monday, participants left a comment on the announcement saying what other paper they would read, individually, in addition to the focus paper.  On Wednesday, each participant posted a summary of the two papers.  The seminar met each Thursday, with mostly informal discussion, and students were encouraged to post follow-up thoughts for about half of the discussions.

Saturday, April 30, 2011

Post meeting - Dhananjay - 7 April

We discussed on phasal alignement for machine translation. The discussion was informative in two subjects. In the focus paper, the search space is reduced to searching in inverse transduction grammars. We had an introduction on what ITG is. The second was using MIRA to classify. MIRA is used as an incremental learning algorithm where each class is represented with a vector. The instance is classified in the class whose vector is most similar. If the the instance is incorrectly classified, then all the interesting vectors (the actual class and the vectors which are more similar than the actual class vector) are changed to align the instance towards the correct class. I was earlier apprehensive that the order of training instances may affect the class vectors. However, since the training instances are iterated multiple times and randomly, there is very less chance of happening so.

Post meeting - Dhananjay - 21 Apr

We discussed Bonnie Weber's paper from ACL 2009 on genre distinctions in the Penn treebank. It was dicussed that genre distinctions were important for sense disambiguation. I had a view that genre distinction is similar to topic models except that the first one is classification while the second one is clustering. It was discussed that while in topic models, multiple topics generate a document, in genre, a document belongs to only one genre. We also discussed about different discourse connectives.

Friday, April 29, 2011

Post meeting summary for the first week discussion

During the meeting we have discussed semantic parsing with the focus paper "Inducing Probabilistic CCG Grammars from Logical Form with Higher-order Unification" by Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwater, and Mark Steedman appeared in EMNLP 2010. The paper was based on the 2005 UAI best paper by Luke Zettlemoyer and Michael Collins, with a few extensions, including the representation generalizing over other formalisms and that lexical items can be multi-word, etc. People noted that semantic parsing is not necessarily constrained to CCG, as any other formalism could potentially be augmented with logical forms to conduct semantic parsing, eg. LFG. People have read some other related papers that address different problems, such as "Unsupervised Semantic Parsing" by Hoifung Poon and Pedro Domingos, which addresses the problem in unsupervised setting and handles the syntactic variations given the same semantic representation. People have also read papers that conduct semantic parsing with Machine translation methods. People have also discussed about the potential beneficiaries of doing semantic parsing, besides the natural language interface to database and so on. Text entailment was suggested as a possible one.

Thursday, April 28, 2011

Premeeting - Dhananjay

I read Joint Unsupervised Coreference Resolution with Markov Logic by Poon and Domigos (EMNLP'08). The paper claims that unsupervised approaches though attractive due to abundance of unlabeled training data, are not explored as they are more difficult. The method uses Markov logic to do a joint inference. The head word is determined by using Stanford parser rules. This gives a better precision than just choosing the right most word as the head word. Two mentions are clustered together if they have the same head word. Since this doesn't work for pronominal entities, predicates for gender, entity type and number are checked. In addition apposition and predicate nominals are also incorporated using predicates. I didn't digest the inference of this network.

The results show a 7% increase in the F-measure from the baseline H&K system. However, in the systems where the determination of the head word is similar (choosing the rightmost word), the precision decreases (although we have a good recall).

Since they cluster mentions with the same headword, without using any feature such as distance, some mentions whose head words are common nouns may be incorrectly clustered.

Wednesday, April 27, 2011

Pre-meeting - Daniel

Focus paper: Unsupervised Ontology Induction from Text
Related paper: Unsupervised Semantic Parsing

This paper presents the USP system, upon which the system of the focus paper is based. The authors aim to create an fully unsupervised system capable of parsing text to a deep semantic representation. Instead of trying to learn both syntax and semantics at the same time, this system takes syntactic parses as given and induces semantics from them. The first stage in the process is a deterministic mapping from the given dependency tree to a quasi-logical form: essentially just a first-order logical representation of the dependency tree. Then, the goal of the semantic parsing is to partition these representations into separate lambda forms, and cluster lambda forms that have the same meaning. Training the model is done using Markov Logic Networks, and parsing is done using a greedy search through the space of possible partitions of the QLF.

Since there is no clear way to evaluate an an unsupervised semantic parser intrinsically, the authors apply their system to the task of question answering. They restrict their attention to the biomedical domain, and compare to other open-domain question answering systems, and they perform above the other systems. In addition, they manually inspect the outputs of their system, and claim that the semantic clusters produced tend to be very coherent and that their system is capable of understanding many different types of paraphrases.

Pre-Meeting Alan

Related paper: Open Information Extraction from the Web
Focus paper: Unsupervised Ontology Induction from Text

The related paper I read introduces a new extraction paradigm called Open Information Extraction (OIE). In this paradigm, the system passes over the corpus a single time and is able to extract a sufficient number of relations. The system requires no human input but automatically discovers and stores relations of interest, independent of domain. Open IE operates without knowing relations a priori, where standard IE systems only operate on relations given to it a priori by the user.

As for experiments and evaluation, the paper uses a fully implemented OIE system called TextRunner, which, when compared to the state-of-the-art KnowItAll web extraction system, has a noticeably lower error rate and significantly improved performance while maintaining a similar accuracy rate.

The basic architecture of the system consists of a single pass extractor, which passes over the entire corpus to extract tuples for all possible relations. It uses a part of speech tagger to do this. Each candidate tuple is then sent to a self-supervised learner which classifies it "trustworthy" or otherwise. Lastly there is a redundancy-based assessor that assigns a probability to each retained tuple.

Two big things about TextRunner are its performance and thus scalability. The runtime is constant in the number of relations as opposed to linear. Although it's a little hard to directly compare, TextRunner extracted all relations in about 85 CPU hours on a test run, where KnowItAll took 6.3 hours per relation.

Pre-meeting Post from Weisi Duan

I read the paper “Unsupervised Semantic Parsing” by Hoifung Poon and Pedro Domingos appeared in EMNLP 2009. The paper talks about the learning the semantic representations in the unsupervised setting, utilizing Markov Logic. My problem with the paper is that it does not give enough examples to illustrate what is actually going on, under the hood of the Markov Logic rules which are templates of features. My guess about the inference is that given the QLF and the learned clusters, the QLF are assigned to the clusters, and during learning, the inference step is first find a possible clustering through search, and then evaluate the MAP assignment probability for the clustering, and use the assignment for the parameter estimation. The clusters which are represented as constants are the things being searched. The paper argues that it handles the variations in the syntactics given the same semantic representation.

About the focus paper, I feel it has the same problem as the related paper, it is not evaluating directly on the task that it claims to solve. It would be better if it could evaluate on some ontology test set. It gives me a feeling it is just a small extension of USP to make it do IE better.

Pre-meeting Dong

Pre-meeting (Dong Nguyen).
Related paper: Semantic taxonomy induction from heterogenous evidence
Focus paper: Unsupervised Ontology Induction from Text

The related paper focuses on taxonomy induction. Their novel contributions are jointly taking evidence of multiple relationships into account and handling polysemy. They view a taxonomy as a set of relations. In this paper they focus on two relations: hyponyms and cousinhood (i and j are mn-cousins if their closest least common subsumer is within exactly m and n links). They add two constraints related to transitivity and cousinhood to the taxonomy structure.

They define the probability of a taxonomy as the joint probability of its relations. Relations have prior and posterior probability given evidence (for example lexical and syntactic patterns). A greedy local search is used to find a taxonomy.

Experiments are done by extending Wordnet. Humans judges evaluated a random sample of generated links. They had good performance with 84% precision and a 70% improvement over non-joint algorithms.

For the focus paper, it would be nice if they had evaluated the extracted relations directly instead of using a task based approach (for example manual judges). I'm also not sure if ontology is the right term to use for the kind of structures they extracted.

Sunday, April 24, 2011

Reading for 4/28/11: Poon and Domingos, ACL 2010

Remember:  we will start the meeting at 5pm, not the usual time, to enable attendance to Radha Rao's retirement party.

Author:  Hoifung Poon and Pedro Domingos
Venue:  ACL 2009
Leader:  none
Request:  When you post to the blog, please include:
  1. Your name 
  2. Which focus paper this post relates to
  3. Whether this is the pre-meeting review or the post-meeting summary
  • Leave a comment on this post (non-anonymously) giving the details of the related paper you will read (include a URL), by Tuesday, April 26.
  • Post your commentary (a paragraph) as a new blog post, by Wednesday, April 27.

Saturday, April 23, 2011

Post-Meeting Alan

This week we talked about discourse relations as well as the Penn Tree Bank and the Penn Discourse Tree Bank.

There was discussion on assumptions people have made about the Penn Tree Bank and the surprising variety of genres amongst the material it contains. The focus paper was a little different than most this time, with less theory and not so algorithm-heavy as papers we read before. We talked about how papers in NLP are usually one of these two natures. The paper was very much experiment-based, and we agreed that it probably took a long time to get the right perspective and relation for the dataset.

From the discussion, discourse is very much dependent on other tasks, one of which we pointed out was co-reference resolution. Kevin brought up a few points about intra-sentence relations versus inter-sentence relations and from the results of the paper, there seems to be a lot going on within sentences, which could be pretty useful for doing things such as machine translation.

Other related papers covered topics pretty close to the focus paper, Dong and Weisi read about finding the arguments to discourse connectives, Daniel talked about Rhetorical Structure Theory, which is pretty much a modified high-tag-fidelity-discourse-based subset of the Penn Tree Bank, and Dhananjay talked about genre detection using common word frequencies as style markers. We talked about how discourse is usually not a big focus for undergraduates, and we mentioned some interesting things that could be done with Wikipedia in terms of interesting research projects.

Friday, April 22, 2011

Post meeting comment Dong

We first compared PDTB (Penn Discourse TreeBank) annotation with RST (Rhetorical Structure Theory). We then discussed the Penn Treebank itself. It corpus has mostly been treated as news, and many don't realize that it contains different genres such as poetry etc. We then discussed some other datasets, such as the Brown dataset, which is a dataset from the 1960s. For the focus paper, the conclusion was that it was data driven, not too theoretical, but it was unclear what the direct applications would be. The type of research (analysis) is very different than most papers found in NLP conferences nowadays. The analysis itself is probably not that much work, but often it requires looking at the data a lot and doing many analyses before coming up with an analysis like this. We also discussed discourse in general, but it still remains somewhat vague.

Thursday, April 21, 2011

Pre-Meeting Summary 4/21

Leader: Alan
Focus paper: Genre Distinctions for Discourse in the Penn Treebank

This week we read Bonnie Webber's paper from ACL 2009. The paper provides genre information about the articles in the Penn Tree Bank. Four different genres were characterized by discourse connectives and their senses that are manually annotated in the Penn Discourse Tree Bank. The findings in the paper involve those looking at differences between genres in the senses associated with various discourse connectives and also differences between the genres in senses of marked and unmarked discourse relations. The paper arges that genre is important in and that lexically marked relations are not a good model for automatic sense labeling of non-lexically marked discourse relations.

I read "Automatic sense prediction for implicit discourse relations in text" by Pitler et al. from ACL-IJCNLP, Singapore 2009. It talks about using word-level features for automatic sense prediction of implicit discourse relations. Previously used word-pair features are examined and their weaknesses are evaluated. 3 classifiers and four classification tasks are run as an experiment, and f-scores show improvement over a true-distribution randomized baseline.

Dhananjay read "Text genre detection using common word frequencies" from Stamatatos et al. COLING 2000. The paper introduces a method to do genre classification using common word frequencies as style markers. A continuation on the work of Burrows (1987), the authors remove most of the restrictions that were present then (expanding contractions, etc.). They used BNC to extract frequency lists, and discriminant analysis to do classification. The error rate is decreased from 6.25 to 2.5, but the comparison is made for only the 30 most frequent words as opposed to 55 before.

Daniel read "Building a discourse-tagged corpus in the framework of rhetorical structure theory" Carlson et al. from SIGdial 2002. The paper talks about constructing a discourse corpus called the RST corpus that aims for high reliability of tags. Based on the Rhetorical Structure Theory framework, it consists of labeling trees of sentence segments called elementary discourse units. The inter-annotator agreement was tracked and kappa scores were consistently high. The published corpus is composed of 176,000 words from Penn Tree Bank articles.

Both Dong and Weisi read papers that address the problem of identifying the arguments of discourse connectives.

Dong read "Discourse connective argument Identification with Connective Specific Rankers" by Elwell and Baldridge from ICSC '08. The authors of this paper attempt to model discourse connectives individually, in contrast to previous research that considers only using a single classifier. Higher global models are used for interpolation due to reduced availability of data for each specific connective. The evaluation used a max ent. ranker and the data was from the Penn Discourse Tree bank. Overall, the interpolation improved results.

Weisi read "Automatically identifying the arguments of discourse connectives" from Wellner and Pustejovsky in EMNLP 2007. The problem is given a connective, what should the discourse segments be. They use a log linear model along with the head finding algorithm from Collins 1999. Both models for each argument are also wrapped up in Collins' perceptron. The evaluation consists of comparing the predicted arguments to those in the Penn Discourse Tree Bank. Results improve on previous models.

Pre-Meeting Alan

Pre-meeting Alan (leader)
Focus paper: Genre Distinctions for Discourse in the Penn Treebank
Related paper: Automatic sense prediction for implicit discourse relations in text

The related paper I read focused on automatically identifying the sense of implicit discourse relations. Here they only focus on the most general senses of comparison, contingency, temporal, and expansion.

A signficant portion of the paper is spent focusing on varous word pair features and how they can help determine information about a discourse between two text spans. They analyze prior work using word pair features and identifying their short-comings in trying to capture semantic oppositions. They then use their own features including polarity tags, verb classes, modality, context, language-model-based probabilities (WSJ-LM), etc. They use various sections of the Penn Discourse Treebank for training and testing, and run four different binary classification tasks to identify each realtion. These included Naive Bayes, Max Ent. and AdaBoost, implemented in MALLET.

As a metric for evaluation they use f-score for distinguishing a single sense versus something that is not that sense (other). The baseline is a random assignment of classes in proportion to the true distribution in the test set. The largest gain is in the Contingency prediction task, using the combination of polarity, verb information, first and last words, modality, and context.

Wednesday, April 20, 2011

Premeeting post - Dhananjay

I read the paper - Text Genre Detection Using Common Word Frequencies by E. STAMATATOS, N. FAKOTAKIS, and G.KOKKINAKIS (http://www.aclweb.org/anthology/C/C00/C00-2117.pdf). The paper presents a simple method to classify documents in genres by using as style markers the frequencies of the occurences of most frequent words. The idea builds on a paper by Burrows (1987) which uses similar style markers with additional tasks such as expansion of I'm to I am; seperation of common homographic forms (e.g. to as infinitive and preposition), proper names and text sampling. The method proposed in this paper, removes these restrictions. Also, instead of using the training corpus for extracting the most frequent occurence list, BNC is used. They use discriminant analysis to perform classification. In the results section, they report a decrease in the error rate (2.5%) as compared to the Burrows method (6.25%). The comparison is however made for 30 most frequent words and 55 most frequent words respectively which is the minima for both the methods. After 40-50 words, the Burrows method has a lower error rate. The author claim that the decrease in performance is due to overfitting. The paper also presents that using punctuation reduces the error rate.

Pre-meeting - Daniel

Focus paper: Genre Distinctions for Discourse in the Penn Treebank
Related paper: Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. Building a discourse-tagged corpus in the framework of rhetorical structure theory. SIGdial 2002.

This paper describes the construction of a discourse corpus, the RST corpus. The goals of the authors are to create a fairly large corpus with annotations firmly grounded in a theory of discourse that has a high tag reliability. The corpus is based on the Rhetorical Structure Theory framework, although the authors do not really describe what that is. In order to tag a passage, each sentence is broken up into Elementary Discourse Units, which roughly correspond to clauses. Then, all of the EDUs are assembled into a labeled tree. Each node in the tree is labeled with a discourse relation, drawn from a shallow hierarchy consisting of 16 classes and 78 fine-grained labels. A small set of well trained annotators was employed over a fairly long period of time in order to give reliable output. The authors kept track of inter-annotator agreement throughout the process, and found consistently high kappa scores. The published corpus consists of articles from the PTB totalling 176,000 words.

Tuesday, April 19, 2011

Pre-meeting Post from Weisi Duan

I have read the paper “Automatically Identifying the Arguments of Discourse Connectives” by Ben Wellner and James Pustejovsky, appeared in EMNLP 2007. The paper is targeted at the problem of identitying discourse segments and whether a relation exists between two identified segments. The authors have formulated the problem as given a connective, what would be the discourse segment be (The relation seems to be already existing since the connective is already there as input.). The method employs a log linear model that wraps up features between the connective and the possible argument (discourse segment lexical head). The arguments are identified though the head finding algorithm (Collins,1999). The training method is not well described, as the authors claimed “the correct candidate receives 100% probability mass and the wrong ones receive 0”, which sounds like Collins’ perceptron and definitely not max entropy. The authors conducted some ablation of the features and discovered that the dependency parse features are better than constituent parse features. To utilize the features between the two arguements that one connective could have, the two models (one for each argument of the connective) are further wrapped up in Collins’ perceptron. The results turned to be better that of the previous models. The evaluation is conducted by comparing the predicted arguments (lexical heads) to ones in Penn Discourse Tree Bank. I like the paper but the formulation of the problem is a little sloppy that the argument to discourse segment mapping is not one-on-one, which means finding the argument does not necessarily lead to the correct discourse segement. If finding the argument is all their goal, they should probably treat the different segments as hidden variables and sum them out.

Pre-meeting Dong

Pre-meeting (Dong Nguyen).
Related paper: Discourse Connective Argument Identification with Connective Specific Rankers
Focus paper: Genre Distinctions for Discourse in the Penn Treebank

The goal of the related paper is to automatically identify the arguments of discourse connectives ('and', 'however' etc). Previous research looked at training a single classifier for this task. The authors in this paper argue that connectives differ, and thus to be effective, it is better to model the individual connectives.

Their approach trains models for specific connectives, but to overcome the fewer amounts of training data available for each connective individually, they interpolate this with more global models. Specifically they interpolated with a model trained over all connectives together and models trained over connective types ( they divided all connectives into one of three types). They used a maximum entropy ranker and used the Penn Discourse Treebank dataset. Results showed improved when using the interpolation model.

In addition to this new interpolation approach, they introduced new features that looked at syntactic, morphology and relation with other connectives.

Overall, I think it was a nice paper. Good idea and nice evaluation.

Friday, April 15, 2011

Post-meeting - Daniel

Focus paper: Poetic Statistical Machine Translation: Rhyme and Meter

This week we discussed how standard machine translation techniques can be adapted in order to give output in metered, rhymed, or otherwise constrained form. There were two papers directly on this subject: the focus paper and the paper Dong and I read. We decided that the focus paper did not use particularly interesting modelling, nor did it really demonstrate that its results were interesting. The other paper also suffered from a lack of good evaluation, but it used a richer model that made significantly fewer assumptions about the problem.
Although the task of translating into to metered verse is not a particularly useful task on its own, there are potentially other similar tasks that could be quite useful. In our discussion, we thought about translating between technical and non-technical text, producing more easily memorizable text, and applications to marketing. We also had a brief discussion about the history of phrase-based MT, but did not really conclude anything.

Thursday, April 14, 2011

Reading for 4/22/11: Webber, 2009

Author:  Bonnie Webber
Venue:  ACL 2009
Leader:  Alan Zhu
Request:  When you post to the blog, please include:
  1. Your name 
  2. Which focus paper this post relates to
  3. Whether this is the pre-meeting review or the post-meeting summary
  • Leave a comment on this post (non-anonymously) giving the details of the related paper you will read (include a URL), by Monday, April 18.
  • Post your commentary (a paragraph) as a new blog post, by Wednesday, April 21.

Pre-meeting summary (Dong - leader)

Pre-meeting Summary (Dong Nguyen - leader)
Focus paper: Poetic Statistical Machine Translation: Rhyme and Meter

The focus paper constrained the search for translation with meter, rhyme and length constraints. Instead of post hoc filtering translation that fit the rhyme/meter constraints, they incorporate it as feature functions while doing the search for translations. One of the problem that was also observed in the related papers, that it's not clear how to evaluate poetry generation/translation.

Most of the related papers read focused on Rhyme and Poetry in NLP. In addition, one paper was on machine translation.

Daniel and Dong read Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation, Greene et al. EMNLP 2010. The paper only deals with sonnets, a poetry form with fairly strict iambic pentameter. Stress patterns of each word are modeled using Finite State Transducers (FST) and EM is used to learn weights. The main problem with this model is that the pronunciation probabilities are content independent, which makes the processing of single words difficult. The learned word stress patterns are used for poetry generation and translation. However, in both cases they didn't perform
a quantitative evaluation.

Dhananjay read the paper Poetry Generation in COLIBRI, Diaz-Agudo et al. in Advances in Case-Based Reasoning 2002. The paper uses a case based reasoning approach to poetry generation. The approach is build on the idea of taking an existing poem and adapt it to the current scenario by replacing words. The method is divided in three parts, retrieval, adaptation and revision.

Alan read Using an on-line dictionary to find rhyming words and pronunciation for unknown words, Roy J. Byrd and Martin Chodorow, ACL 1985. This is an old paper from 1985 and focuses on implementing a system that find rhymes and determines how to pronounce words the system has not seen before. The authors take a pronunciation-based approach to identify rhyming words. To pronounce unknown input words, they try to find overlapping substrings in the input word that are already present in the words in the dictionary. Because of the lack of hard data or experiments, it's hard to evaluate the system, but this might be caused by the technological limitations of that time.

Weisi read Discriminative training and maximum entropy models for statistical machine translation by Och and Neu, ACL 2002.
The paper frames the machine translation source channel into the log-linear model, which makes adding features easier. Training is done using generalized iterative scaling.

Pre-meeting Post from Weisi Duan

I have read the paper “Discriminative Training and Maximum Entropy Models for Statistical Machine Translation” by Franz Josef Och and Hermann Ney, appeared in ACL 2002. The paper frames the machine translation source channel into the log-linear framework and thus made adding features to the model easier. More specifically, all the probabilistic components in the source channel model can be used as features and will be combined with other features that come from different information source. The training is done through generalized iterative scaling. The inference in GIS, seems to be done in MAP, such that all probabilistic mass given a source sentence is allocated to the hypothesis that is closest to a possible gold-standard sentence. The inference is done with dynamic programming with n-best list for global features. The evaluation includes 7 different measures (SER, WER, PER, mWER, BLEU, SSER, IER) and as more distinct features are added, the performance improves in terms of the measures.

Pre-meeting - Daniel

Focus paper: Poetic Statistical Machine Translation: Rhyme and Meter
Related Paper: Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation

This paper presents a generative model of rhythm in poetry, and shows how it can be applied to the task of poetry translation. The paper only deals with sonnets, a poetry form using fairly strict iambic pentameter. For the model of rhythm, two elements are combined, the CMU pronunciation dictionary and an FST-based model of stress. The FST model learns pronunciation patterns of each word by constraining each line two have one of four standard stress patterns, and using EM to learn weights of word-specific pronunciation FSTs. For poetry generation, the authors mention that the FST model is augmented with the CMU dictionary, but do not discuss how. On the task of assigning stresses to words on held-out data, the FST method achieves 94% per word accuracy and 81% per line. The authors note that the main problem with this model is that the lexical pronunciation probabilities learned are context independant, which makes the processing of single syllable words difficult.
For generation, the authors use a language-model based method, combined with the pronunciation model and a simplified model of rhyme. This model is not evaluated directly, but produces at least amusing poems.
For translation, the pronunciation model is applied as part of the language model in a PBMT system. The authors do not do any sort of quantitative evaluation, do the intrinsic difficulty of evaluating poetry automatically. The authors find that when testing on training data, the resulting poems tend to more closely resemble human translations then the outputs of a pure PBMT system, but when testing on test data, the system often fails to produce output.

Pre-Meeting Alan

Related paper: Using an on-line dictionary to find rhyming words and pronunciation for unknown words
Focus paper: Poetic Statistical Machine Translation: Rhyme and Meter

The related paper I read this week was a little dated, being from ACL 1985. But it was a pretty interesting read, detailing a computer system WordSmith that was being built by IBM at the time. WordSmith is a multi-dimensional dictionary - that is, given an input word, it can display words with similar pronunciation, words that are likely to rhyme with the input word, and words with similar end-spellings. It can also attempt to generate pronunciations of words that do not exist in its dictionary. The main focus of the paper is on the methodology for implementing the system that finds rhymes, and also the system that determines how to pronounce words that the system has not seen before.

Instead of a spelling-based algorithm to identify rhyming words, they use a pronunciation-based approach. There is a three-part encoding scheme that starts of with mapping pronunciation symbols to single-byte codes that represent phonetic segments. The second part of the encoding scheme arranges the word segments in order of importance for determining rhyme. Lastly, the rearranged segments are reversed, grouped based on the position of the primary-stress syllable, and sorted (according to the reversed order) I will skip over the details here, but the only real limitation of their approach is the fundamental disagreement amongst people of what are rhymes are good and not so good.

To try and pronounce an unknown input word, the basic approach is to find overlapping substrings in the input word that are present in words already in the dictionary. Basically it's like a probabilistic version of minimum edit distance. The substrings are chosen greedily in the sense that the chosen substrings are the longest that can be matched in the dictionary file.

Most of the ideas introduced by the paper are interesting and seem like they would do well in a full-fledged system as the one described. There isn't any hard data or experiments to show the progress of the system which may been just a technological limitation at the time. But definitely the paper brings up some psycholinguistic questions about how words can be and are represented.

Wednesday, April 13, 2011

Premeeting - Dhananjay

Related paper - Poetry Generation in COLIBRI

The paper discusses about using a case based reasoning approach to poetry generation. Initially, the paper gives an overview of the system - COLIBRI and the ontology - CBROnto. It frames the task in such way - (1) Take an existing poem, (2) Adapt it to the current scenario by replacing words. The problem is divided in three parts - (1) Retrieval, (2) Adaptation and (3) Revision. Retrieval consists of extracting cases of the poem based on its structure. It then constructs a query as a sequence of words that the poem would be based upon. It selects the cases with the largest number of POS tags in common with the query. In adaptation phase, it substitutes words from the query with the new words. The revision phase evaluates and repairs the proposed solution. Features such as rhyme, meter are considered in this phase. For me, poetry generation is a misnomer for the process. This work probably lays down certain ideas of template based poetry generation.

Pre-meeting Dong

Pre-meeting (Dong Nguyen).
Related paper: Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation.
Focus paper: Poetic Statistical Machine Translation: Rhyme and Meter

The related paper was on analysis of rhythmic poetry, with applications to generation and translation. While the focus paper briefly touches on the issue of finding which syllables in a word are stressed and resort to the CMU Dictionary, the related paper argues that the CMU dictionary is not sufficient. The meter for the text is known beforehand, thus the focus is to map a word to syllable-stress patterns. They propose to use Finite State Transducers to convert English words to stress patterns. Every word initially has transitions to all stress pattern subsequences of lengths 1 to 4. EM training is then used to train the model. Their data is augmented with external data to get more data for training, and they allow common alternative patterns to occur.

They also discuss possible applications such as poetry generation. However, they merely show an example of a produced text, and don't do a quantitative evaluation.

For translation, they use a cascade of weighted FST. Again, the evaluation is not quantitative. Furthermore, the experiment setup was probably somewhat uncommon where text was translated that was already in the training set.
For held out data the system had much more difficulty.

Friday, April 8, 2011

Post-meeting 4/8 - Alan

Post-Meeting Commentary
Focus Paper: Discriminative Modeling of Extraction Sets for Machine Translation

Yesterday during the seminar we talked about a good variety of topics relevant to machine translation. First we talked about Inversion Transduction Grammars (ITG) and how they work, namely the permutation possibilities they can generate, and how they help solve the word alignment problem which is a fundamental component in MT. Dong read a paper about statistical phrase-based translation which led to discussion on some of the earlier papers in MT.

Another big component in the discussion was supervised versus unsupervised word alignment models, and I feel like in general there were more useful things said about unsupervised approaches, one of the reasons being there may be too many dependencies in a system if a supervised approach is used; unsupervised approaches by nature seem more flexible. Daniel and I discussed the paper we read, which provided an unsupervised model to improve alignment for tree transducer-based syntactic machine translation models. I found the idea of a Markov walk along a parse tree to be pretty cool and there was a nice drop in AER, but the authors could not say anything about the fact that it improves translation since there was no incorporation into a full system. Dhananjay read a paper on MIRA which led to some interesting talk about multi-class labeling problems and structure prediction problems. Chris also brought up some points about the nice DP method used in the focus paper.

At the end we talked about several things, including maybe why the paper Daniel and I read did not experiment with a full system, and this lead to brief discussions about what code is open-source and why things may be difficult to share amongst the research community. We also discussed some limitations of the focus paper, a major one Kevin noticed was that sentences past length 40 were ignored, which actually accounts for roughly 20% of all sentences in news articles namely the Newswire data used in training and evaluation. It turns out this is kind of unsettling, since sentence stability and BLEU scores correlate significantly with length, and so limiting the data used may suggest something about the performance of the model on the data left out.

Overall I enjoyed the discussion as my purpose for taking the class is to some of the more relevant and complicated problems in NLP. I may have skipped over some conversation topics, so feel free to comment - I am adding them as I remember.


Thursday, April 7, 2011

Reading for 4/14/11: Genzel et al., EMNLP 2010

Author:  Dmitriy Genzel, Jakob Uszkoreit, and Franz Och
Venue:  EMNLP 2010
Leader:  Dong Nguyen
Request:  When you post to the blog, please include:
  1. Your name 
  2. Which focus paper this post relates to
  3. Whether this is the pre-meeting review or the post-meeting summary
  • Leave a comment on this post (non-anonymously) giving the details of the related paper you will read (include a URL), by Monday, April 11.
  • Post your commentary (a paragraph) as a new blog post, by Wednesday, April 13.

Pre-meeting summary 4/7 - Dhananjay (Leader)

This week's focus paper tries to extract phrasal translation rules for MT. The alignment is modeled as a multiclass labeling problem. The labels are ITG alignments with a constraint over the span (3 in this paper). The feature space is constructed by exploring the space of links (sure links if available). The multiclass labeling problem is resolved by using MIRA.

For the related paper, Alan and Daniel read an unsupervised word alignment method by the same authors - Tailoring Word Alignments to Syntactic Machine Translation. The innovative step that they take is to generate a parse tree for the target language, and use a syntax-sensitive distortion component that conditions on the tree. The idea is that these trees can alter the probabilities of transitions between alignment positions so that distortions which respect tree structure can be preferred. The model is trained using plain EM. The model drastically reduces the number of alignments that cross constituencies, but only mildly improves alignment scores, generally preferring recall over precision.

Weisi and Dong read Statistical Phrase-Based Translation. It presents a translation model and decoder and compare different ways to build phrase translation tables. Their decoder uses a beam search algorithm. The search involves selecting a sequence of untranslated foreign words and an English phrase, and updating the hypothesis cost. An important observation by the authors was that while phase helps during translation, syntax based phrase donot help.

I read a paper that describes MIRA (margin infused relaxed algorithm) that is used for the multiclass labeling problem in the focus paper. A prototype vector is developed for each label, and a similarity score is calculated. The instance is assigned the label with highest similarity score. The training phase involves updating this prototype vector space.

Pre-meeting comment - Dhananjay

Related paper - Ultraconservative Online Algorithms for Multiclass Problems.
Koby Crammar and Yoram Singer, JMLR 2003

The paper proposes a class of algorithms for multiclass classification. The intuition is to maintain one prototype vector per class. Given an input instance, the algorithm computes a similarity score. The instance is assigned the class with the highest similarity. The prototypes are iteratively developed as a part of training. These algorithms are classified as ultraconservative algorithms in the sense that only those prototypes are updated for which the similarity scores are greater than the correct label.

Briefly, for each instance that is incorrectly classified, an error set consisting of labels for which the similarity score is greater than the correct instance is constructed. Each prototype (corresponding to the label in the error set) is reduced by a factor of the input instance. This is equivalent of taking this prototype vector away (more dissimilar) from the input vector. The prototype of the correct label is aligned more towards the input vector by adding the instance. All the factors are chosen such that they add up to 0.

Using this setup, the authors go on to show a class of additive algorithms and multiplicative algorithms. For each step, an optimization problem is constructed such that the norm of the prototype vector matrix is minimized. The variables are the factors by which the prototype vectors are adjusted.

Wednesday, April 6, 2011

Pre-Meeting 4/6 - Alan

Supplemental Paper: Tailoring Word Alignments to Syntactic Machine Translation
Focus Paper: Discriminative Modeling of Extraction Sets for Machine Translation

The supplemental paper I read was by the same authors as the focus paper. The purpose of the paper was to address the problems caused by word alignment errors in syntatic MT systems that extract tree transducer rules. When word alignments of a sentence pair violate the constituent structure of the target sentence, it increases the size of the minimal translational units. Units that span larger segments have poor ability to generalize and thus lead to the blocking of many rules that may be present.

An unsupervised word alignment model is presented that is an extension of both the HMM model of Ney and Vogel (1996) and of the system described by Galley et al. (2006). The innovative step that they take is to generate a parse tree for the target language, and use a syntax-sensitive distortion component that conditions on the tree. The idea is that these trees can alter the probabilities of transitions between alignment positions so that distortions which respect tree structure can be preferred. The regular HMM alignment model only uses string distance for its distortion model, where a key aspect of this paper is to now use a new kind of shortest path between two positions defined by a first degree Markov walk through the tree that consists of popping up from a leaf, moving to different branches, and pushing down to a new leaf, all done probabilistically.

Training of the model is done using general EM. The performance metric used is the standard alignment error rate (AER) metric. Evaluation is done on both French-English and Chinese-English manually aligned data sets. Their model succeeds in drastically reducing AER, meaning that the number of alignments that violate constituent structure is significantly reduced. However, the big question of whether this improves actual translation as a whole is still left in the open as there was no incorporation into a full MT system.


Pre-meeting comment - Daniel

Focus Paper: Discriminative Modeling of Extraction Sets for Machine Translation
Related Paper: Tailoring Word Alignments to Syntactic Machine Translation

This paper presents an unsupervised word alignment method that aims to create alignments that are beneficial specifically to tree transducer-based syntactic machine translation models. The problem with standard IBM model 4 alignments is that alignments frequently cross constituent boundaries, which prevents rules from being extracted, leading to poorer performance. Roughly, the method uses target side trees to softly prefer alignments which do not violate constituents. The method is a modification of the HMM alignment method, which takes into account a weighted tree distance instead of string distance in the distortion model. The model is trained using plain EM. The model drastically reduces the number of alignments that cross constituencies, but only mildly improves alignment scores, generally preferring recall over precision. Unfortunately, the authors do not show the effects of the model when plugged into an end-to-end system, so it is difficult to say whether or not the method is actually beneficial.

Pre-meeting Dong

Pre-meeting (Dong Nguyen).
Related paper: Statistical Phrase-Based Translation
Focus paper: Discriminative Modeling of Extraction Sets for Machine Translation

The related paper presents a translation model and decoder and compare different ways to build phrase translation tables. Their decoder uses a beam search algorithm. The search involves selecting a sequence of untranslated foreign words and an English phrase, and updating the hypothesis cost.

They experimented with three different methods to build phrase translation tables:
* Phrases from Word Based alignments (Giza++)
* Taking syntactic phrases into account
* Joint phrase model

They used Europarl corpus for evaluation. Most experiments were done by translating German to English. They also experimented with some additional language pairs. Some observations they made were:
* Small phrases up to three lead to high level of accuracy
* Syntactic restrictions hurt
* Heuristic based on word alignments work well.
* What works depends on language pair and size of training corpus.

Monday, April 4, 2011

Pre-meeting Post from Weisi Duan

I have read the paper “Statistical Phase-Based Translation” for the focus paper. The paper describes a framework that different heuristics of phrase extraction can be used to generate phrases. The authors have compared the performance of different phrase extraction methods with IBM4, and concludes that the phrase helps during translation while syntactic based phrases do not. The beam search on the decoder feels a little ad hoc, and the search operations are not clear enough to me on certain cases, such as whether phrases could overlap or not, and how the distortion probability is used during decoding, e.g. there can be an operation like swapping two phases in the hypothesis.

For the focus paper, because of the limitation of knowledge of machine translation frameworks, I am not sure how the extracted sets are used in decoding, eg. the parameters for them as features are estimated in what way, eg. as LM or discrimatively.

Friday, April 1, 2011

Post meeting summary by Weisi Duan

During the meeting, we first discussed about the standard LDA, including the representation and inference, learning. We discussed then about the labeled-LDA which is a supervised model that ties the labels of the documents to the hidden topics to obtain a distribution over vocabulary given labels. We discussed about the perks of this model such as adding prior as well as parameter tying. We also discussed about issues involving twitter data, eg. the dialog model by Alan Ritter et al.; the POS tagging within the twitter, such as why it is difficult; the API's and potential problems that they bring such as the samples are not intact dialogues. We discussed about researching a problem that is well formulated in the sense there is something concrete to evaluate, eg. the structured model for modeling twitter dialects can be used to predict the location of the speaker, which is a useful task.

Post-meeting - Daniel

The meeting this week started off with a discussion of LDA, and moved on to a discussion about Twitter. We went over the graphical model specification of LDA as well as the way it was modified to create Latent LDA, the method used in the focus paper. We concluded that Latent LDA was an interesting extension to LDA that was able to handle supervision more successfully than the more common Supervised LDA.
In the Twitter-related half of the discussion, we discussed general properties of Twitter, as well as a method for identifying discourse acts in threads of tweets. The discourse act paper had a major flaw in that its only quantitative evaluation did not work, and it did not clearly specify the problem it was trying to solve. The methods that it used also seemed more suitable for longer conversations, rather than the 3-6 message threads present in the Twitter data. We also discussed general properties of Twitter, but did not really conclude anything on that topic.

Reading for 4/7/11: DeNero and Klein, ACL 2010

Author:  John DeNero and Dan Klein
Venue:  ACL 2010
Leader:  Dhananjay Kulkarni
Request:  When you post to the blog, please include:
  1. Your name 
  2. Which focus paper this post relates to
  3. Whether this is the pre-meeting review or the post-meeting summary
  • Leave a comment on this post (non-anonymously) giving the details of the related paper you will read (include a URL), by Monday, April 4.
  • Post your commentary (a paragraph) as a new blog post, by Wednesday, April 6.

Post Meeting - Alan

A good portion of the discussion was dedicated to LDA and labeled LDA, which I found useful since I initially had only a very cloudy idea of how LDA works just from the fact it has been mentioned in the past two focus papers. I'm not sure of how well it works in practice, but it seems to be a favorite in terms of people who do research concerning topic models so there is some solidity behind the method.

I guess a big concern with the supplemental paper I read was that the evaluation metrics were all pretty hand-wavy. When it comes down to it, it's hard to say anything about the success of applying new techniques when there isn't a well-specified and intuitive standard to show something significant about a given approach - visually speaking, their results seem alright, but it's hard to say anything really constructive. Also, the fact they only used a small subset of the data, and truncated conversations to 3-6 posts (when some conversations consisted of more than 200 posts) is pretty questionable. For what it's worth, the amount of data they collected is fairly complete and fairly large, which will probably be pretty useful for future research. I guess it was also interesting to see topic models used for a slightly different purpose.

In conclusion we also commented on the current popularity of Twitter based on a well-implemented API and size of data publicly available. The data is fairly raw, which makes it noisy and hard to model, but also captures human correspondence on a more genuine level than normal text. There is probably some cool stuff to be found, but the general success of twitter-focused research ventures is questionable.


Thursday, March 31, 2011

Post-meeting Dong

A large part of today's meeting was about Labeled LDA. The model was compared to standard LDA and Supervised LDA (http://www.cs.princeton.edu/~blei/papers/BleiMcAuliffe2007.pdf). It was mentioned that Supervised LDA is theoretically very nice, but in practice often is hard to get it working. For Labeled LDA, It was noted that the inference for the labels for the test documents was not worked out well (they had simplified it to standard LDA inference).

We also discussed the paper for discovering dialogue acts in Twitter. Overall, it wasn't clear was the goal was, how they defined dialogue acts, and therefore the evaluation was not very satisfying. We weren't sure what evaluation method would fit better.

We also discussed the overall popularity of looking at Twitter data nowadays. One of the main advantages is that it's easy to get (their API seems to be pretty good). Also, it's language is less formal than for example the WSJ, so could be more interesting if you're interested in more natural language.

Pre-meeting summary

The focus paper for this week, Characterizing Microblogs with Topic Models, uses a variation of LDA, Labeled LDA, to analyze the topics present in Twitter, and then uses these for the tasks of rating tweets and recommending users to follow. They construct several datasets for this purpose: a small corpus of tweets labeled with topics and a set of tweets rated by users that received them. Overall, the features extracted from their topic model significantly improve performance on both tasks.

Only three different related papers were read this week: a paper about modeling conversations in twitter, a paper about Labeled LDA, and a paper about analyzing twitter discourse.

Dhananjay and Alan read "Unsupervised Modeling of Twitter Conversations", Alan Ritter, Colin Cherry, Bill Dolan, NAACL 2010. This paper discusses unsupervised methods for analyzing dialog acts in series of tweets. They present three methods, the EM Conversation model, the Conversation+Topic model, and the Bayesian Conversation model, where the Conversation+Topic model was their principal contribution. They perform several types of evaluation: qualitative analysis of the output of the system, held-out likelihood, and a new task, conversation ordering, which consists of reordering a scrambled conversation.

Weisi and Dong read "Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora", Daniel Ramage et al, EMNLP 2009. This paper presents the modified form of LDA that is used in the focus paper. This model allows some documents to be specified with labels, which constrains the latent topics that are allowed to be associated with that document. They evaluate the model on many different tasks, comparing to one-vs-rest SVM, although Weisi notes that comparing to vanilla LDA would have been good in order to understand the quantitative effects of adding the labels.

I read Beyond Microblogging: "Conversation and Collaboration via Twitter", Honeycutt, C. and Herring, S, HICSS 2009. This paper discusses and analyzes dialog in Twitter, particularly relating to the use of the @ symbol. The authors sampled 37k tweets over the course of the day, and hand-annotated a sample of these in order to answer questions about the presence and function of dialog in twitter, and the relation of the @ symbol to dialog. They found that dialogs are fairly prevalent on Twitter, and tended to be medium-length between two people. Further the use of the @ symbol was strongly related to discourse acts involving interacting with other people.

Wednesday, March 30, 2011

Pre-meeting comment - Dhananjay

Focus paper: Characterizing Microblogs with Topic Models
Related paper: Unsupervised Modeling of Twitter Conversations; Alan Ritter, Colin Cherry, Bill Dolan, NAACL 2010

The paper proposes an unsupervised approach to task of modeling dialogue acts. It is based on conversation model by Barzilay and Lee (2004) using HMMs for topic transitions, with each topic generating a message. Another latent variable - source is added that generates a particular word. A source may be one of (1) The current post's dialogue act, (2) Conversation's topic and (3) General English. Evalation is done by comparing the probability of the held out test data using Chib's estimator. Training is performed on a set of 10000 randomly sampled conversations with 3-6 posts. An interpretation of the model based on the 10-act dialogue model is presented. I observed that their interpretation presents a transition graph which is acyclic (excluding the self loops). This probably could be because of the length of the conversations and also that transitions with probability less than 0.1 are not shown. They are however not discussed. A comparison metric based on the ordering of the posts is proposed. The posts for a particular conversation are permuted and the Kendall co-efficient is calculated.

Pre-Meeting Commentary Alan

Related paper: Unsupervised Modeling of Twitter Conversations;
Alan Ritter, Colin Cherry, Bill Dolan, NAACL 2010

Focus paper: Characterizing Microblogs with Topic Models

The related paper I read this week proposes an unsupervised method for discovering dialogue structure or "dialogue acts" in twitter conversations. The idea was to automatically extract information that says something about the nature of the interactions between people in new mediums such as twitter. This is a pretty cool problem since the conversational aspects of English or any language seem to be one of the harder problems one could pose in NLP. The authors crawled Twitter using its API and obtained the posts of a sample of users, and all the replies to their posts, extracting entire conversation trees. All the data amounted to 1.3 million conversations. Only 10,000 random conversations are used, and scaling the models to the entire corpus is left for future work. The authors introduce three models, the EM Conversation model, the Conversation+Topic model, and the Bayesian Conversation model, the second being an extension of the first.

The Conversation+Topic model is basically a modified HMM borrowed from some previous work on multi-document summarization. Just using the Conversational model wasn't quite good enough since topic and dialogue structure were mixed in the results and the focus is on dialogue structure. They use an LDA framework to modify the model to account for topic and thus separate content words from dialogue indicators. For the inference engine, the HMM dp is swapped for Gibbs sampling. Slice sampling is also applied.

To evaluate the set of generated dialogue acts, the authors examine both qualitative and quantitive evaluations. The qualitative evaluation really only focuses on the Conversation+Topic model, and they go through a 10-act model, showing the probability on the transitions between dialogue acts. Most of the acts are reasonable and do a good job of illustrating the fact that Twitter is a microblog. They also display word lists and example posts for each Dialogue Act which are fairly convincing. For quantitative evaluation the paper introduces a new task of conversation ordering, that is given a random set of conversations, all permutations of the conversations are generated and the probability of each permutation is evaluated as if it were an unseen conversation. It appears that although useful, this metric does not directly imply anything about the interpretability of the model.

It seems the paper does a decent job at planting a first step in terms of unsupervised dialogue act tagging. The work done doesn't seem overly complex, but the observations and data they collected seems like they could be useful for at least a couple more runs.

Pre-meeting comment - Daniel

Focus Paper: Characterizing Microblogs with Topic Models. Ramage, D. et al. 2010.
Related Paper: Beyond Microblogging: Conversation and Collaboration via Twitter. Honeycutt, C. and Herring, S. 2009

The paper that I read presents an analysis of dialog on Twitter, and specifically the use of the @ symbol. The authors analyze 37k tweets sampled from four 1-hour periods in order to answer four primary questions:
What is the language break-up of tweets over time, and do different languages use the @ symbol to different degree?
How do English tweets use the @ symbol?
What topics are present in tweets, and how does the presence of the @ symbol affect this distribution?
How does the @ symbol function with regards to interactive exchanges?
In answer to the first question, they found that the use of the @ symbol does not vary significantly over time or language. They categorized the use of '@' into a variety of categories: addresses, references, and several uses not related to the @username construction. Of the instances of '@' in a sample of 1500 tweets, 90% were addresses, 5% were references, and the last 5% was distributed among the remaining categories. The authors devised a semantic coding scheme for classifying tweets into one of twelve categories, and manually tagged around 200 tweets. Overall, they found that tweets with '@' in them tended to be more interactive; they were more likely to address others directly, make requests of others, or provide imformation.
To analyze dialog, the authors extracted all of the conversations that occurred in the sample of tweets, and found that the typical conversation was between two people, consisted of 3-5 tweets, and occurred over a period of 15-30 minutes. Further, they found that the use of the @ symbol was strongly tied to the presence of dialog.

Sunday, March 27, 2011

Pre-meeting post from Weisi Duan

I have read the paper “Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora” by Daniel Ramage et al. appeared in EMNLP 2009. This paper tackles the problem on how to make the topics to respresent some known labels, which is not addressed by standard LDA. Not aware of this paper, I have recently came up with a similar model, in which I tied the word senses to the topics, so as to discover sense indicators for different senses. I have submitted a paper on this, and now I am not sure if I am going to get a rejection... This related paper is nice in that it has conducted numerous experiments to prove the model to be useful in certain situations. However, for certain experiments, the set up is not very clear and readers have to figure out by themselves, e.g. the snippet extraction task, it might be they are training first on the same data, and then do testing for the extraction. Also, in the Tagged Web Page task, they independently select 3000 docs for cross validation, and this suggests leak of test data. Although the leak is the same for both of the two models, I feel it would be nice if they could clearly separate the training and testing data.

For the focus paper, I am not sure the way they calculate the Fleiss Kappa makes sense, because they suggest they are doing an one-vs-other for each single category, and how they are treating the agreement on the “other” category is unknown. I feel it matters here on how to calculate the probability of accident in the Fleiss Kappa. The other thing is that it would be nice if they could present a comparison against the standard LDA in the Ranking experiments, since this way we would be sure that the label information that L-LDA brought in makes a difference, because maybe the standard LDA could also achieve the same results.

Pre-meeting Dong

Pre-meeting (Dong Nguyen).
Related paper: Labeled LDA: A supervised topic model for credit attribution in multi-label corpora
Focus paper: Characterizing Microblogs with Topic Models

The related paper introduced Labeled LDA, which is used as the main method in the focus paper. Labeled LDA defines a one-to-one correspondence between LDA's latent topics and user tags. In comparison with previous models such as Supervised LDA, this model allows documents to be associated with multiple labels. The inference is similar to the standard LDA model, except the topics of a particular document are restricted to the topic set that is associated with the labels of that document. I liked the way they evaluated the model, they evaluated it in a range of tasks: topic visualization, snippet extraction and multilabel text classification (compared with strong baseline: one vs- rest SVM).

I'm not sure what to think of the focus paper. Much of the paper builds on the four dimensions: substance, style, status and social. Although they seem to make sense, these were identified by interviewing a small and not representative group. Also the way they identified the labeled dimensions in Twitter and the mapping to these dimensions seem somewhat ad hoc, so I'm not sure what to think of their Twitter characterization. Because for the ranking experiments, they don't use the 4S dimensions but only the topic distribution of the tweet, it would have been nice if they had also compared with standard LDA.

Saturday, March 26, 2011

Reading for 3/31/11: Ramage et al., ICWSM 2010

Author:  Daniel Ramage, Susan Dumais, Dan Liebling
Venue:  ICWSM 2010
Leader:  Daniel Mills
Request:  When you post to the blog, please include:
  1. Your name 
  2. Which focus paper this post relates to
  3. Whether this is the pre-meeting review or the post-meeting summary
  • Leave a comment on this post (non-anonymously) giving the details of the related paper you will read (include a URL), by Monday, March 28.
  • Post your commentary (a paragraph) as a new blog post, by Wednesday, March 30.