NLP Lecture 03: n-gram language models

26 Jan 2022 in Study Blog on Natrual Laguage Processing

Keywords: n-gram language models

Language Modeling

Task: predict the next word given the context.
Used in speech recognition, handwritten character recognition, spelling correction, text entry UI, machine translation,…

Probability of the Next Word

Idea: We do not need to model domain, syntactic, and lexical knowledge perfectly.
Instead, we can rely on the notion of probability of a sequence (letters, words…).

Markov Assumption

$P(w_n|w_1, w_2, ..., w_{n-1})$ is difficult to estimate.
The longer the sequence becomes, the less likely $w_1 w_2 w_3 ... w_{n-1}$ will appear in training data.
Instead, we make the following simple independence assumption (Markov assumption): The probability to see wn depends only on the previous $k-1$ words. \begin{equation} P(w_n|w_1, w_2, w_3, …, w_{n-1}) \approx P(w_n|w_{n-k+1}, …, w_{n-1}) \end{equation}

n-grams

The sequence $w_n$ is a unigram.
The sequence $w_{n-1}, w_n$ is a bigram.
The sequence $w_{n-2}, w_{n-1}, w_n$ is a trigram.
The sequence $w_{n-3}, w_{n-2}, w_{n-1}, w_n$ is a quadrigram…

Bi-gram Language Model

Using the Markov assumption and the chain rule: \begin{equation} P(w_1, w_2, w_3, …, w_n) \approx P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_2) \cdots P(w_n|w_{n-1}) \end{equation}
More consistent to use only bigrams: \begin{equation} P(w_1|start) \cdot P(w_2|w_1) \cdot P(w_3|w_2) \cdots P(w_n|w_{n-1}) \cdot P(end|w_n) \end{equation}

Variable-Length Language Models

We typically don’t know what the length of the sentence is.
Instead, we use a special marker END/STOP that indicates the end of a sentence.
We typically just augment the sentence with START and END/STOP markers to provide the appropriate context. (START i want to eat Chinese food END)

Log Probabilities

Probabilities can become very small (a few orders of magnitude per token).
We often work with log probabilities in practice. \begin{equation} p(w_1…w_n) = \Pi_{i=1}^np(w_i|w_{i-1}) \end{equation}
\begin{equation} \log p(w_1…w_n) = \sum_{i=1}^n \log p(w_i|w_{i-1}) \end{equation}

Estimating n-gram Probabilities

We can estimate n-gram probabilities using maximum likelihood estimates. \begin{equation} p(w|u) = \frac{count(u,w)}{count(u)} \end{equation}
Or for trigrams: \begin{equation} p(w|u, v) = \frac{count(w,u,w)}{count(u, v)} \end{equation}

Unseen Tokens

Typical approach to unseen tokens:
- Start with a specific lexicon of known tokens.
- Replace all tokens in the training and testing corpus that are not in the lexicon with an UNK token.
Practical approach:
- Lexicon contains all words that appear more than k times in the training corpus.
- Replace all other tokens with UNK.

Unseen Contexts

Two basic approaches:
- Smoothing / Discounting: Move some probability mass from seen trigrams to unseen trigrams.
- Back-off: Use n-1-…, n-2-… grams to compute n-gram probability.
Other techniques:
- Class-based backoff, use back-off probability for a specific word class / part-of-speech.

Zipf’s Law

Problem: n-grams (and most other linguistic phenomena) follow a Zipfian distribution.
A few words occur very frequently.
Most words occur very rarely. Many are seen only once.
Zipf’s law: a word’s frequency is approximately inversely proportional to its rank in the word distribution list.

Smoothing

Smoothing flattens spiky distributions.

Additive Smoothing

Classic approach: Laplacian, a.k.a. additive smoothing.
\begin{equation} P(w_i) = \frac{count(w_i) + 1}{N+V} \end{equation}
- N is the number of tokens
- V is the number of types (i.e. size of the vocabulary)
\begin{equation} P(w|u) = \frac{count(u, w) + 1}{count(u) + V} \end{equation}
Inaccurate in practice.

Linear Interpolation

Use denser distributions of shorter ngrams to “fill in” sparse ngram distributions.
\begin{equation} P(w|u, v) = \lambda_1 \cdot p_{mle}(w|u,v) + \lambda_2 \cdot p_{mle}(w|v) + \lambda_3 \cdot p_{mle}(w) \end{equation}
Where, $\lambda_1, \lambda_2, \lambda_3 > 0$ and $\lambda_1 + \lambda_2 + \lambda_3 = 1$.
Works well in practice (but not a lot of theoretical justification why).
Parameters can be estimated on development data (for example, using Expectation Maximization).

Discounting

Idea: set aside some probability mass, then fill in the missing mass using back-off.
$count^*(v, w) = count(v, w) - \beta$ where $0<\beta<1$.
Then for all seen bigrams: $$ p(w v) = \frac{count^*(v, w)}{count(v)}.
For each context v the missing probability mass is \begin{equation} \alpha(v) = 1 - \sum_{w:c(v,w)>0} \frac{count^*(v, w)}{count(v)} \end{equation}
We can now divide this held-out mass between the unseen words (evenly or using back-off).

Katz’ Backoff

Divide the held-out probability mass proportionally to the unigram probability of the unseen words in context v.
\begin{equation} p(w|v) = \begin{cases} & \frac{count^*(v, w)}{count(v)} & if count(v,w) > 0,
& \alpha(v) \times \frac{p_{mle}(w)}{\sum_{u:count(v,u) = 0}p_{mle(u)}} & otherwise. \end{cases} \end{equation}

Evaluating n-gram models

Extrinsic evaluation: Apply the model in an application (for example language classification). Evaluate the application.
Intrinsic evaluation: measure how well the model approximates unseen language data.
- Can compute the probability of each sentence according to the model. Higher probability -> better model.
- Typically we compute Perplexity instead.

Perplexity

Perplexity (per word) measures how well the ngram model predicts the sample.
Given a corpus of ‘m’ sentences ‘$s_i$’, where ‘M’ is total number of tokens in the corpus
Perplexity is defined as $2^{-l}$, where $l = \frac{1}{M} \sum_{i=1}^m \log_2 p(s_i)$.
Lower perplexity = better model. Intuition:
- Assume we are predicting one word at a time.
- With uniform distribution, all successor words are equally likely. Perplexity is equal to vocabulary size. • Perplexity can be thought of as “effective vocabulary size”.

NLP Lecture 03: n-gram language models

Language Modeling

Probability of the Next Word

Markov Assumption

n-grams

Bi-gram Language Model

Variable-Length Language Models

Log Probabilities

Estimating n-gram Probabilities

Unseen Tokens

Unseen Contexts

Zipf’s Law

Smoothing

Additive Smoothing

Linear Interpolation

Discounting

Katz’ Backoff

Evaluating n-gram models

Perplexity

Jiawei Lu

Error

Language Modeling

Probability of the Next Word

Markov Assumption

n-grams

Bi-gram Language Model

Variable-Length Language Models

Log Probabilities

Estimating n-gram Probabilities

Unseen Tokens

Unseen Contexts

Zipf’s Law

Smoothing

Additive Smoothing

Linear Interpolation

Discounting

Katz’ Backoff

Evaluating n-gram models

Perplexity

Templates (for web app):

Error