NLP Lecture 04: Sequence Labeling with HMMs, P.O.S Tagging
Keywords: Sequence Labeling with Hidden Markov Models, Part-of-Speech Tagging.
Garden-Path Sentences
- The horse raced past the barn.
- The horse raced past the barn fell.
- The old dog the footsteps of the young.
Parts-of-Speech
- Classes of words that behave alike:
- Appear in similar contexts.
- Perform a similar grammatical function in the sentence.
- Undergo similar morphological transformations.
- ~9 traditional parts-of-speech:
- noun, pronoun, determiner, adjective, verb, adverb, preposition, conjunction, interjection
Parts-of-Speech Tagging
- Goal: Translate from a sequence of words \((w_1, w_2, ..., w_n) \in V^*\), to a sequence of tags \((t_1, t_2, ..., t_n ) \in T^*\).
- NLP is full of translation problems from one structure to another. Basic solution:
- Construct search space of possible translations.
- Find best paths through this space (decoding) according to some performance measure.
Bayesian Inference for Sequence Labeling
- Assume each word wi in the observed sequence \((w_1, w_2, ..., w_n) \in V^*\) was generated by some hidden variable \(t_i\).
- Infer the most likely sequence of hidden variables given the sequence of observed words.
Markov Chains
- A Markov chain is a sequence of random variables \(X_1, X_2, ...\)
- The domain of these variables is a set of states.
- Markov assumption: Next state depends only on current state. \(P(X_{n+1}|X_1, X_2, ..., X_n) = P(X_{n+1}|X_n)\)
- This is a special case of a weighted finite state automaton (WFSA).
Hidden Markov Models (HMMs)
Generative (Bayesian) probability model. Observations: sequences of words. Hidden states: sequence of part-of-speech labels.
- Hidden sequence is generated by an n-gram language model (typically a bi-gram model).
- \[t_0 = START\]
- \[P(t_1, t_2, ..., t_n) = \prod_{i=1}^nP(t_i|t_{i-1})\]
- There are two types of probabilities: Transition Probabilities and Emission Probabilities. \begin{equation} P(t_1,t_2, …, t_n, w_1,w_2,…,w_n) = P(t_1|start)P(w_1|t_1)P(t_2|t_1)P(w_2|t_2) \cdots P(t_n|t_{n-1})P(w_n|t_n) \end{equation} \begin{equation} P(t_1,t_2, …, t_n, w_1,w_2,…,w_n) = \prod_{i=1}^nP(t_i|t_{i-1})P(w_i|t_i) \end{equation}
Important Tasks on HMMs
Decoding: Given a sequence of words, find the most likely probability sequence. (Bayesian inference using Viterbi algorithm).
Evaluation: Given a sequence of words, find the total probability for this word sequence given an HMM. Note that we can view the HMM as another type of language model. (Forward algorithm)
Training: Estimate emission and transition probabilities from training data. (MLE, Forward-Backward a.k.a Baum-Welch algorithm)
Decoding HMMs
- Goal: Find the path with the highest total probability (given the words) \begin{equation} \arg \max_{t_1, …, t_n} \prod_{i=1}^n P(t_i|t_{i-1}) P(w_i|t_i) \end{equation}
Viterbi Algorithm
- Input: Sequence of observed words \(w_1,..., w_n\).
- Create a table \(\pi\), such that each entry \(\pi[k,t]\) contains the score of the highest-probability sequence ending in tag \(t\) at time \(k\).
- initialize \(\pi[0,start]=1.0\) and \(\pi[0,t]=0.0\) for all tags \(t \in T\).
- for \(k=1\) to \(n\):
- for \(t \in T\):
- \[\pi[k,t] \leftarrow \max_s \pi[k-1, s]\cdot P(t|s) \cdot P(w_k|t)\]
- for \(t \in T\):
- return \(\max_s \pi[n,s]\).
Trigram Language Model
- Instead of using a unigram context, use a bigram context.
- Think of this as having states that represent pairs of tags.
So the HMM probability for a given tag and word sequence is: \begin{equation} \prod_{i=1}^nP(t_i|t_{i-2}t_{i-1})P(w_i|t_i) \end{equation}
- Need to handle data sparseness when estimating transition probabilities (for example using backoff or linear interpolation)
More POS tagging tricks
It is also often useful in practice to add an end-of-sentence marker (just like we did for n-gram language models). \begin{equation} P(t_1,…,t_n,w_1,…,w_n) = [\prod_{i=1}^nP(t_i|t_{i-2}t_{i-1})P(w_i|t_i)]P(t_{n+1}|t_n) \end{equation} where \({t_{-1} = t_{0} = START}\) and \(t_{n+1} = STOP\).
- Another useful trick is to replace words with “pseudo words” representing an entire class.
- Using a smoothed trigram HMM model with these tricks, we can build a tagger that is close to the state-of-the art (~97% accuracy on the Penn Treebank).
HMMs as Language Models
- We can also use an HMM as language models (language generation, MT, …), i.e. evaluate \(P(w_1,...,w_n)\) for a given sentence. What is the advantage over a plain word n-gram model?
- Problem: There are many tag-sequences that could have generated \(w_1, ..., w_n\). \begin{equation} P(w_1,…,w_n,t_1,…,t_n) = \prod_{i=1}^nP(t_i|t_{i-1})P(w_i|t_i) \end{equation}
- This is an example of spurious ambiguity.
- Need to compute: \begin{equation} P(w_1,…,w_n) = \sum_{t_1,…,t_n}P(w_1,…,w_n,t_1,…,t_n) = \sum_{t_1,…,t_n} [\prod_{i=1}^nP(t_i|t_{i-1})P(w_i|t_i)] \end{equation}
Forward Algorithm
- Input: Sequence of observed words \(w_1,..., w_n\).
- Create a table \(\pi\), such that each entry \(\pi[k,t]\) contains the score of the highest-probability sequence ending in tag \(t\) at time \(k\).
- initialize \(\pi[0,start]=1.0\) and \(\pi[0,t]=0.0\) for all tags \(t \in T\).
- for \(k=1\) to \(n\):
- for \(t \in T\):
- \[\pi[k,t] \leftarrow \sum_s \pi[k-1, s]\cdot P(t|s) \cdot P(w_k|t)\]
- for \(t \in T\):
- return \(\sum_s \pi[n,s]\).