NLP Lecture 04: Sequence Labeling with HMMs, P.O.S Tagging

02 Feb 2022 in Study Blog on Natrual Laguage Processing

Keywords: Sequence Labeling with Hidden Markov Models, Part-of-Speech Tagging.

Garden-Path Sentences

The horse raced past the barn.
The horse raced past the barn fell.
The old dog the footsteps of the young.

Parts-of-Speech

Classes of words that behave alike:
- Appear in similar contexts.
- Perform a similar grammatical function in the sentence.
- Undergo similar morphological transformations.
~9 traditional parts-of-speech:
- noun, pronoun, determiner, adjective, verb, adverb, preposition, conjunction, interjection

Parts-of-Speech Tagging

Goal: Translate from a sequence of words \((w_1, w_2, ..., w_n) \in V^*\), to a sequence of tags \((t_1, t_2, ..., t_n ) \in T^*\).
NLP is full of translation problems from one structure to another. Basic solution:
1. Construct search space of possible translations.
2. Find best paths through this space (decoding) according to some performance measure.

Bayesian Inference for Sequence Labeling

Assume each word wi in the observed sequence \((w_1, w_2, ..., w_n) \in V^*\) was generated by some hidden variable \(t_i\).
Infer the most likely sequence of hidden variables given the sequence of observed words.

Markov Chains

A Markov chain is a sequence of random variables \(X_1, X_2, ...\)
The domain of these variables is a set of states.
Markov assumption: Next state depends only on current state. \(P(X_{n+1}|X_1, X_2, ..., X_n) = P(X_{n+1}|X_n)\)
This is a special case of a weighted finite state automaton (WFSA).

Hidden Markov Models (HMMs)

Generative (Bayesian) probability model. Observations: sequences of words. Hidden states: sequence of part-of-speech labels.
Hidden sequence is generated by an n-gram language model (typically a bi-gram model).
- \[t_0 = START\]
- \[P(t_1, t_2, ..., t_n) = \prod_{i=1}^nP(t_i|t_{i-1})\]
There are two types of probabilities: Transition Probabilities and Emission Probabilities. \begin{equation} P(t_1,t_2, …, t_n, w_1,w_2,…,w_n) = P(t_1|start)P(w_1|t_1)P(t_2|t_1)P(w_2|t_2) \cdots P(t_n|t_{n-1})P(w_n|t_n) \end{equation} \begin{equation} P(t_1,t_2, …, t_n, w_1,w_2,…,w_n) = \prod_{i=1}^nP(t_i|t_{i-1})P(w_i|t_i) \end{equation}

Important Tasks on HMMs

Decoding: Given a sequence of words, find the most likely probability sequence. (Bayesian inference using Viterbi algorithm).
Evaluation: Given a sequence of words, find the total probability for this word sequence given an HMM. Note that we can view the HMM as another type of language model. (Forward algorithm)
Training: Estimate emission and transition probabilities from training data. (MLE, Forward-Backward a.k.a Baum-Welch algorithm)

Decoding HMMs

Goal: Find the path with the highest total probability (given the words) \begin{equation} \arg \max_{t_1, …, t_n} \prod_{i=1}^n P(t_i|t_{i-1}) P(w_i|t_i) \end{equation}

Viterbi Algorithm

Input: Sequence of observed words \(w_1,..., w_n\).
Create a table \(\pi\), such that each entry \(\pi[k,t]\) contains the score of the highest-probability sequence ending in tag \(t\) at time \(k\).
initialize \(\pi[0,start]=1.0\) and \(\pi[0,t]=0.0\) for all tags \(t \in T\).
for \(k=1\) to \(n\):
- for \(t \in T\):
  - \[\pi[k,t] \leftarrow \max_s \pi[k-1, s]\cdot P(t|s) \cdot P(w_k|t)\]
return \(\max_s \pi[n,s]\).

Trigram Language Model

Instead of using a unigram context, use a bigram context.
- Think of this as having states that represent pairs of tags.
So the HMM probability for a given tag and word sequence is: \begin{equation} \prod_{i=1}^nP(t_i|t_{i-2}t_{i-1})P(w_i|t_i) \end{equation}
Need to handle data sparseness when estimating transition probabilities (for example using backoff or linear interpolation)

More POS tagging tricks

It is also often useful in practice to add an end-of-sentence marker (just like we did for n-gram language models). \begin{equation} P(t_1,…,t_n,w_1,…,w_n) = [\prod_{i=1}^nP(t_i|t_{i-2}t_{i-1})P(w_i|t_i)]P(t_{n+1}|t_n) \end{equation} where \({t_{-1} = t_{0} = START}\) and \(t_{n+1} = STOP\).
Another useful trick is to replace words with “pseudo words” representing an entire class.
Using a smoothed trigram HMM model with these tricks, we can build a tagger that is close to the state-of-the art (~97% accuracy on the Penn Treebank).

HMMs as Language Models

We can also use an HMM as language models (language generation, MT, …), i.e. evaluate \(P(w_1,...,w_n)\) for a given sentence. What is the advantage over a plain word n-gram model?
Problem: There are many tag-sequences that could have generated \(w_1, ..., w_n\). \begin{equation} P(w_1,…,w_n,t_1,…,t_n) = \prod_{i=1}^nP(t_i|t_{i-1})P(w_i|t_i) \end{equation}
This is an example of spurious ambiguity.
Need to compute: \begin{equation} P(w_1,…,w_n) = \sum_{t_1,…,t_n}P(w_1,…,w_n,t_1,…,t_n) = \sum_{t_1,…,t_n} [\prod_{i=1}^nP(t_i|t_{i-1})P(w_i|t_i)] \end{equation}

Forward Algorithm

Input: Sequence of observed words \(w_1,..., w_n\).
Create a table \(\pi\), such that each entry \(\pi[k,t]\) contains the score of the highest-probability sequence ending in tag \(t\) at time \(k\).
initialize \(\pi[0,start]=1.0\) and \(\pi[0,t]=0.0\) for all tags \(t \in T\).
for \(k=1\) to \(n\):
- for \(t \in T\):
  - \[\pi[k,t] \leftarrow \sum_s \pi[k-1, s]\cdot P(t|s) \cdot P(w_k|t)\]
return \(\sum_s \pi[n,s]\).

NLP Lecture 04: Sequence Labeling with HMMs, P.O.S Tagging

Garden-Path Sentences

Parts-of-Speech

Parts-of-Speech Tagging

Bayesian Inference for Sequence Labeling

Markov Chains

Hidden Markov Models (HMMs)

Important Tasks on HMMs

Decoding HMMs

Viterbi Algorithm

Trigram Language Model

More POS tagging tricks

HMMs as Language Models

Forward Algorithm

Jiawei Lu

Error

Garden-Path Sentences

Parts-of-Speech

Parts-of-Speech Tagging

Bayesian Inference for Sequence Labeling

Markov Chains

Hidden Markov Models (HMMs)

Important Tasks on HMMs

Decoding HMMs

Viterbi Algorithm

Trigram Language Model

More POS tagging tricks

HMMs as Language Models

Forward Algorithm

Templates (for web app):

Error