NLP Lecture 03: n-gram language models

Keywords: n-gram language models

Language Modeling

  • Task: predict the next word given the context.
  • Used in speech recognition, handwritten character recognition, spelling correction, text entry UI, machine translation,…

Probability of the Next Word

  • Idea: We do not need to model domain, syntactic, and lexical knowledge perfectly.
  • Instead, we can rely on the notion of probability of a sequence (letters, words…).

Markov Assumption

  • \(P(w_n|w_1, w_2, ..., w_{n-1})\) is difficult to estimate.
  • The longer the sequence becomes, the less likely \(w_1 w_2 w_3 ... w_{n-1}\) will appear in training data.
  • Instead, we make the following simple independence assumption (Markov assumption): The probability to see wn depends only on the previous \(k-1\) words. \begin{equation} P(w_n|w_1, w_2, w_3, …, w_{n-1}) \approx P(w_n|w_{n-k+1}, …, w_{n-1}) \end{equation}

n-grams

  • The sequence \(w_n\) is a unigram.
  • The sequence \(w_{n-1}, w_n\) is a bigram.
  • The sequence \(w_{n-2}, w_{n-1}, w_n\) is a trigram.
  • The sequence \(w_{n-3}, w_{n-2}, w_{n-1}, w_n\) is a quadrigram…

Bi-gram Language Model

  • Using the Markov assumption and the chain rule: \begin{equation} P(w_1, w_2, w_3, …, w_n) \approx P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_2) \cdots P(w_n|w_{n-1}) \end{equation}

  • More consistent to use only bigrams: \begin{equation} P(w_1|start) \cdot P(w_2|w_1) \cdot P(w_3|w_2) \cdots P(w_n|w_{n-1}) \cdot P(end|w_n) \end{equation}

Variable-Length Language Models

  • We typically don’t know what the length of the sentence is.
  • Instead, we use a special marker END/STOP that indicates the end of a sentence.
  • We typically just augment the sentence with START and END/STOP markers to provide the appropriate context. (START i want to eat Chinese food END)

Log Probabilities

  • Probabilities can become very small (a few orders of magnitude per token).
  • We often work with log probabilities in practice. \begin{equation} p(w_1…w_n) = \Pi_{i=1}^np(w_i|w_{i-1}) \end{equation}

    \begin{equation} \log p(w_1…w_n) = \sum_{i=1}^n \log p(w_i|w_{i-1}) \end{equation}

Estimating n-gram Probabilities

  • We can estimate n-gram probabilities using maximum likelihood estimates. \begin{equation} p(w|u) = \frac{count(u,w)}{count(u)} \end{equation}

  • Or for trigrams: \begin{equation} p(w|u, v) = \frac{count(w,u,w)}{count(u, v)} \end{equation}

Unseen Tokens

  • Typical approach to unseen tokens:
    • Start with a specific lexicon of known tokens.
    • Replace all tokens in the training and testing corpus that are not in the lexicon with an UNK token.
  • Practical approach:
    • Lexicon contains all words that appear more than k times in the training corpus.
    • Replace all other tokens with UNK.

Unseen Contexts

  • Two basic approaches:
    • Smoothing / Discounting: Move some probability mass from seen trigrams to unseen trigrams.
    • Back-off: Use n-1-…, n-2-… grams to compute n-gram probability.
  • Other techniques:
    • Class-based backoff, use back-off probability for a specific word class / part-of-speech.

Zipf’s Law

  • Problem: n-grams (and most other linguistic phenomena) follow a Zipfian distribution.
  • A few words occur very frequently.
  • Most words occur very rarely. Many are seen only once.

  • Zipf’s law: a word’s frequency is approximately inversely proportional to its rank in the word distribution list.

Smoothing

Smoothing flattens spiky distributions.

Additive Smoothing

  • Classic approach: Laplacian, a.k.a. additive smoothing.

    \begin{equation} P(w_i) = \frac{count(w_i) + 1}{N+V} \end{equation}

    • N is the number of tokens
    • V is the number of types (i.e. size of the vocabulary)
  • \begin{equation} P(w|u) = \frac{count(u, w) + 1}{count(u) + V} \end{equation}

  • Inaccurate in practice.

Linear Interpolation

  • Use denser distributions of shorter ngrams to “fill in” sparse ngram distributions.

    \begin{equation} P(w|u, v) = \lambda_1 \cdot p_{mle}(w|u,v) + \lambda_2 \cdot p_{mle}(w|v) + \lambda_3 \cdot p_{mle}(w) \end{equation}

  • Where, \(\lambda_1, \lambda_2, \lambda_3 > 0\) and \(\lambda_1 + \lambda_2 + \lambda_3 = 1\).
  • Works well in practice (but not a lot of theoretical justification why).
  • Parameters can be estimated on development data (for example, using Expectation Maximization).

Discounting

  • Idea: set aside some probability mass, then fill in the missing mass using back-off.
  • \(count^*(v, w) = count(v, w) - \beta\) where \(0<\beta<1\).
  • Then for all seen bigrams: $$ p(wv) = \frac{count^*(v, w)}{count(v)}.
  • For each context v the missing probability mass is \begin{equation} \alpha(v) = 1 - \sum_{w:c(v,w)>0} \frac{count^*(v, w)}{count(v)} \end{equation}
  • We can now divide this held-out mass between the unseen words (evenly or using back-off).

Katz’ Backoff

  • Divide the held-out probability mass proportionally to the unigram probability of the unseen words in context v.

    \begin{equation} p(w|v) = \begin{cases} & \frac{count^*(v, w)}{count(v)} & if count(v,w) > 0,
    & \alpha(v) \times \frac{p_{mle}(w)}{\sum_{u:count(v,u) = 0}p_{mle(u)}} & otherwise. \end{cases} \end{equation}

Evaluating n-gram models

  • Extrinsic evaluation: Apply the model in an application (for example language classification). Evaluate the application.
  • Intrinsic evaluation: measure how well the model approximates unseen language data.
    • Can compute the probability of each sentence according to the model. Higher probability -> better model.
    • Typically we compute Perplexity instead.

Perplexity

  • Perplexity (per word) measures how well the ngram model predicts the sample.
  • Given a corpus of ‘m’ sentences ‘\(s_i\)’, where ‘M’ is total number of tokens in the corpus
  • Perplexity is defined as \(2^{-l}\), where \(l = \frac{1}{M} \sum_{i=1}^m \log_2 p(s_i)\).
  • Lower perplexity = better model. Intuition:
    • Assume we are predicting one word at a time.
    • With uniform distribution, all successor words are equally likely. Perplexity is equal to vocabulary size. • Perplexity can be thought of as “effective vocabulary size”.

© 2022 Jiawei Lu. All rights reserved.