NLP Lecture 03: n-gram language models
Keywords: n-gram language models
Language Modeling
- Task: predict the next word given the context.
- Used in speech recognition, handwritten character recognition, spelling correction, text entry UI, machine translation,…
Probability of the Next Word
- Idea: We do not need to model domain, syntactic, and lexical knowledge perfectly.
- Instead, we can rely on the notion of probability of a sequence (letters, words…).
Markov Assumption
- \(P(w_n|w_1, w_2, ..., w_{n-1})\) is difficult to estimate.
- The longer the sequence becomes, the less likely \(w_1 w_2 w_3 ... w_{n-1}\) will appear in training data.
- Instead, we make the following simple independence assumption (Markov assumption): The probability to see wn depends only on the previous \(k-1\) words. \begin{equation} P(w_n|w_1, w_2, w_3, …, w_{n-1}) \approx P(w_n|w_{n-k+1}, …, w_{n-1}) \end{equation}
n-grams
- The sequence \(w_n\) is a unigram.
- The sequence \(w_{n-1}, w_n\) is a bigram.
- The sequence \(w_{n-2}, w_{n-1}, w_n\) is a trigram.
- The sequence \(w_{n-3}, w_{n-2}, w_{n-1}, w_n\) is a quadrigram…
Bi-gram Language Model
Using the Markov assumption and the chain rule: \begin{equation} P(w_1, w_2, w_3, …, w_n) \approx P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_2) \cdots P(w_n|w_{n-1}) \end{equation}
More consistent to use only bigrams: \begin{equation} P(w_1|start) \cdot P(w_2|w_1) \cdot P(w_3|w_2) \cdots P(w_n|w_{n-1}) \cdot P(end|w_n) \end{equation}
Variable-Length Language Models
- We typically don’t know what the length of the sentence is.
- Instead, we use a special marker END/STOP that indicates the end of a sentence.
- We typically just augment the sentence with START and END/STOP markers to provide the appropriate context. (START i want to eat Chinese food END)
Log Probabilities
- Probabilities can become very small (a few orders of magnitude per token).
We often work with log probabilities in practice. \begin{equation} p(w_1…w_n) = \Pi_{i=1}^np(w_i|w_{i-1}) \end{equation}
\begin{equation} \log p(w_1…w_n) = \sum_{i=1}^n \log p(w_i|w_{i-1}) \end{equation}
Estimating n-gram Probabilities
We can estimate n-gram probabilities using maximum likelihood estimates. \begin{equation} p(w|u) = \frac{count(u,w)}{count(u)} \end{equation}
Or for trigrams: \begin{equation} p(w|u, v) = \frac{count(w,u,w)}{count(u, v)} \end{equation}
Unseen Tokens
- Typical approach to unseen tokens:
- Start with a specific lexicon of known tokens.
- Replace all tokens in the training and testing corpus that are not in the lexicon with an
UNK
token.
- Practical approach:
- Lexicon contains all words that appear more than
k
times in the training corpus. - Replace all other tokens with UNK.
- Lexicon contains all words that appear more than
Unseen Contexts
- Two basic approaches:
- Smoothing / Discounting: Move some probability mass from seen trigrams to unseen trigrams.
- Back-off: Use n-1-…, n-2-… grams to compute n-gram probability.
- Other techniques:
- Class-based backoff, use back-off probability for a specific word class / part-of-speech.
Zipf’s Law
- Problem: n-grams (and most other linguistic phenomena) follow a Zipfian distribution.
- A few words occur very frequently.
Most words occur very rarely. Many are seen only once.
- Zipf’s law: a word’s frequency is approximately inversely proportional to its rank in the word distribution list.
Smoothing
Smoothing flattens spiky distributions.
Additive Smoothing
Classic approach: Laplacian, a.k.a. additive smoothing.
\begin{equation} P(w_i) = \frac{count(w_i) + 1}{N+V} \end{equation}
- N is the number of tokens
- V is the number of types (i.e. size of the vocabulary)
\begin{equation} P(w|u) = \frac{count(u, w) + 1}{count(u) + V} \end{equation}
Inaccurate in practice.
Linear Interpolation
Use denser distributions of shorter ngrams to “fill in” sparse ngram distributions.
\begin{equation} P(w|u, v) = \lambda_1 \cdot p_{mle}(w|u,v) + \lambda_2 \cdot p_{mle}(w|v) + \lambda_3 \cdot p_{mle}(w) \end{equation}
- Where, \(\lambda_1, \lambda_2, \lambda_3 > 0\) and \(\lambda_1 + \lambda_2 + \lambda_3 = 1\).
- Works well in practice (but not a lot of theoretical justification why).
- Parameters can be estimated on development data (for example, using Expectation Maximization).
Discounting
- Idea: set aside some probability mass, then fill in the missing mass using back-off.
- \(count^*(v, w) = count(v, w) - \beta\) where \(0<\beta<1\).
Then for all seen bigrams: $$ p(w v) = \frac{count^*(v, w)}{count(v)}. - For each context v the missing probability mass is \begin{equation} \alpha(v) = 1 - \sum_{w:c(v,w)>0} \frac{count^*(v, w)}{count(v)} \end{equation}
- We can now divide this held-out mass between the unseen words (evenly or using back-off).
Katz’ Backoff
Divide the held-out probability mass proportionally to the unigram probability of the unseen words in context v.
\begin{equation} p(w|v) = \begin{cases} & \frac{count^*(v, w)}{count(v)} & if count(v,w) > 0,
& \alpha(v) \times \frac{p_{mle}(w)}{\sum_{u:count(v,u) = 0}p_{mle(u)}} & otherwise. \end{cases} \end{equation}
Evaluating n-gram models
- Extrinsic evaluation: Apply the model in an application (for example language classification). Evaluate the application.
- Intrinsic evaluation: measure how well the model approximates unseen language data.
- Can compute the probability of each sentence according to the model. Higher probability -> better model.
- Typically we compute Perplexity instead.
Perplexity
- Perplexity (per word) measures how well the ngram model predicts the sample.
- Given a corpus of ‘m’ sentences ‘\(s_i\)’, where ‘M’ is total number of tokens in the corpus
- Perplexity is defined as \(2^{-l}\), where \(l = \frac{1}{M} \sum_{i=1}^m \log_2 p(s_i)\).
- Lower perplexity = better model. Intuition:
- Assume we are predicting one word at a time.
- With uniform distribution, all successor words are equally likely. Perplexity is equal to vocabulary size. • Perplexity can be thought of as “effective vocabulary size”.