NLP Lecture 04: Sequence Labeling with HMMs, P.O.S Tagging

Keywords: Sequence Labeling with Hidden Markov Models, Part-of-Speech Tagging.

Garden-Path Sentences

  • The horse raced past the barn.
  • The horse raced past the barn fell.
  • The old dog the footsteps of the young.

Parts-of-Speech

  • Classes of words that behave alike:
    • Appear in similar contexts.
    • Perform a similar grammatical function in the sentence.
    • Undergo similar morphological transformations.
  • ~9 traditional parts-of-speech:
    • noun, pronoun, determiner, adjective, verb, adverb, preposition, conjunction, interjection

Parts-of-Speech Tagging

  • Goal: Translate from a sequence of words \((w_1, w_2, ..., w_n) \in V^*\), to a sequence of tags \((t_1, t_2, ..., t_n ) \in T^*\).
  • NLP is full of translation problems from one structure to another. Basic solution:
    1. Construct search space of possible translations.
    2. Find best paths through this space (decoding) according to some performance measure.

Bayesian Inference for Sequence Labeling

  • Assume each word wi in the observed sequence \((w_1, w_2, ..., w_n) \in V^*\) was generated by some hidden variable \(t_i\).
  • Infer the most likely sequence of hidden variables given the sequence of observed words.

Markov Chains

  • A Markov chain is a sequence of random variables \(X_1, X_2, ...\)
  • The domain of these variables is a set of states.
  • Markov assumption: Next state depends only on current state. \(P(X_{n+1}|X_1, X_2, ..., X_n) = P(X_{n+1}|X_n)\)
  • This is a special case of a weighted finite state automaton (WFSA).

Hidden Markov Models (HMMs)

  • Generative (Bayesian) probability model. Observations: sequences of words. Hidden states: sequence of part-of-speech labels.

  • Hidden sequence is generated by an n-gram language model (typically a bi-gram model).
    • \[t_0 = START\]
    • \[P(t_1, t_2, ..., t_n) = \prod_{i=1}^nP(t_i|t_{i-1})\]
  • There are two types of probabilities: Transition Probabilities and Emission Probabilities. \begin{equation} P(t_1,t_2, …, t_n, w_1,w_2,…,w_n) = P(t_1|start)P(w_1|t_1)P(t_2|t_1)P(w_2|t_2) \cdots P(t_n|t_{n-1})P(w_n|t_n) \end{equation} \begin{equation} P(t_1,t_2, …, t_n, w_1,w_2,…,w_n) = \prod_{i=1}^nP(t_i|t_{i-1})P(w_i|t_i) \end{equation}

Important Tasks on HMMs

  • Decoding: Given a sequence of words, find the most likely probability sequence. (Bayesian inference using Viterbi algorithm).

  • Evaluation: Given a sequence of words, find the total probability for this word sequence given an HMM. Note that we can view the HMM as another type of language model. (Forward algorithm)

  • Training: Estimate emission and transition probabilities from training data. (MLE, Forward-Backward a.k.a Baum-Welch algorithm)

Decoding HMMs

  • Goal: Find the path with the highest total probability (given the words) \begin{equation} \arg \max_{t_1, …, t_n} \prod_{i=1}^n P(t_i|t_{i-1}) P(w_i|t_i) \end{equation}

Viterbi Algorithm

  • Input: Sequence of observed words \(w_1,..., w_n\).
  • Create a table \(\pi\), such that each entry \(\pi[k,t]\) contains the score of the highest-probability sequence ending in tag \(t\) at time \(k\).
  • initialize \(\pi[0,start]=1.0\) and \(\pi[0,t]=0.0\) for all tags \(t \in T\).
  • for \(k=1\) to \(n\):
    • for \(t \in T\):
      • \[\pi[k,t] \leftarrow \max_s \pi[k-1, s]\cdot P(t|s) \cdot P(w_k|t)\]
  • return \(\max_s \pi[n,s]\).

Trigram Language Model

  • Instead of using a unigram context, use a bigram context.
    • Think of this as having states that represent pairs of tags.
  • So the HMM probability for a given tag and word sequence is: \begin{equation} \prod_{i=1}^nP(t_i|t_{i-2}t_{i-1})P(w_i|t_i) \end{equation}

  • Need to handle data sparseness when estimating transition probabilities (for example using backoff or linear interpolation)

More POS tagging tricks

  • It is also often useful in practice to add an end-of-sentence marker (just like we did for n-gram language models). \begin{equation} P(t_1,…,t_n,w_1,…,w_n) = [\prod_{i=1}^nP(t_i|t_{i-2}t_{i-1})P(w_i|t_i)]P(t_{n+1}|t_n) \end{equation} where \({t_{-1} = t_{0} = START}\) and \(t_{n+1} = STOP\).

  • Another useful trick is to replace words with “pseudo words” representing an entire class.
  • Using a smoothed trigram HMM model with these tricks, we can build a tagger that is close to the state-of-the art (~97% accuracy on the Penn Treebank).

HMMs as Language Models

  • We can also use an HMM as language models (language generation, MT, …), i.e. evaluate \(P(w_1,...,w_n)\) for a given sentence. What is the advantage over a plain word n-gram model?
  • Problem: There are many tag-sequences that could have generated \(w_1, ..., w_n\). \begin{equation} P(w_1,…,w_n,t_1,…,t_n) = \prod_{i=1}^nP(t_i|t_{i-1})P(w_i|t_i) \end{equation}
  • This is an example of spurious ambiguity.
  • Need to compute: \begin{equation} P(w_1,…,w_n) = \sum_{t_1,…,t_n}P(w_1,…,w_n,t_1,…,t_n) = \sum_{t_1,…,t_n} [\prod_{i=1}^nP(t_i|t_{i-1})P(w_i|t_i)] \end{equation}

Forward Algorithm

  • Input: Sequence of observed words \(w_1,..., w_n\).
  • Create a table \(\pi\), such that each entry \(\pi[k,t]\) contains the score of the highest-probability sequence ending in tag \(t\) at time \(k\).
  • initialize \(\pi[0,start]=1.0\) and \(\pi[0,t]=0.0\) for all tags \(t \in T\).
  • for \(k=1\) to \(n\):
    • for \(t \in T\):
      • \[\pi[k,t] \leftarrow \sum_s \pi[k-1, s]\cdot P(t|s) \cdot P(w_k|t)\]
  • return \(\sum_s \pi[n,s]\).

© 2022 Jiawei Lu. All rights reserved.