NLP Lecture 02: Naive Bayes’ Classifier

26 Jan 2022 in Study Blog on Natrual Laguage Processing

Keywords: Language Classification, Probability Review, Machine Learning Background, Naive Bayes’ Classifier.

Linguistic Terminology

Sentence: Unit of written language.
Utterance: Unit of spoken language.
Word Form: the inflected form as it actually appears in the corpus. “produced”
Word Stem: The part of the word that never changes between morphological variations. “produc”
Lemma: an abstract base form, shared by word forms, having the same stem, part of speech, and word sense – stands for the class of words with stem. “produce”
Type: number of distinct words in a corpus (vocabulary size).
Token: Total number of word occurrences.

Tokenization

The process of segmenting text (a sequence of characters) into a sequence of tokens (words).

Lemmatization

Converting Lemmas into their base form.

Probabilities in NLP

Probabilities make it possible to combine evidence from multiple sources systematically to (using Bayesian statistics).

Bayesian Statistics

Typically, we observe some evidence (for example, words in a document) and the goal is to infer the “correct” interpretation (for example, the topic of a text). Probabilities express the degree of belief we have in the possible interpretations.

Prior probabilities: Probability of an interpretation prior to seeing any evidence.
Conditional (Posterior) probability: Probability of an interpretation after taking evidence into account.

Probability Basics

sample space \(\Omega\)
random variable \(X\)
probability distribution \(P(\omega)\)
joint probability: \(P(A, B)\)
conditional probability: \(P(A|B) = \frac{P(A,B)}{P(B)}\)

Bayes’ Rule

\begin{equation} P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} \end{equation}

Independence

Independent:
- \[P(A) = P(A|B)\]
- \[P(A, B) = P(A) \cdot P(B)\]
Conditionally Independent
- \[P(B, C|A) = P(B|A) \cdot P(C|A)\]
- \[P(B|A,C) =P(B|A)\]

Probabilities and Supervised Learning

Given: Training data consisting of training examples \(data = (x_1, y_1), …, (x_n, y_n)\).
Goal: Learn a mapping \(h\) from \(x\) to \(y\).
Two approaches:
- Discriminative algorithms learn \(P(y | x)\) directly.
- Generative algorithms use Bayes rule \begin{equation} P(y|x) = \frac{P(x|y) \cdot P(y)}{P(x)} \end{equation}

Discriminative Algorithms

Model conditional distribution of the label given the data
Learns decision boundaries that separate instances of the different classes.
To predict a new example, check on which side of the decision boundary it falls.
Examples:
- support vector machine (SVM)
- decision trees
- random forests
- neural networks
- log-linear models

Generative Algorithms

Assume the observed data is being “generated” by a “hidden” class label.
Build a different model for each class.
To predict a new example, check it under each of the models and see which one matches best.
Estimate \(P(x|y)\) and \(P(y)\). Then use bases rule \begin{equation} P(y|x) = \frac{P(x|y) \cdot P(y)}{P(x)} \end{equation}
Examples:
- Naive Bayes
- Hidden Markov Models
- Gaussian Mixture Models
- PCFGs

Naive Bayes

Rules

\begin{equation} P(Label, X_1, …, X_d) = P(Label) \Pi_i P(X_i|Label) \end{equation}

\begin{equation} \begin{split} P(Label|X_1, …, X_d) & = \frac{P(Label) \Pi_i P(X_i|Label)}{\Pi_i P(X_i)}
& = \alpha [P(Label) \Pi_i P(X_i|Label)] \end{split} \end{equation}

Naive Bayes Classifier

\begin{equation} y* = \arg \max_y P(y) \Pi_i P(x_i|y) \end{equation}

Training the Naive Bayes’ Classifier

Estimate the prior and posterior probabilities using Maximum Likelihood Estimates (MLE)

\begin{equation} P(y) = \frac{Count(y)}{\sum_{y’\in Y}Count(y’)} \end{equation}

\begin{equation} P(x_i|y) = \frac{Count(x_i, y)}{Count(y)} \end{equation}

Some Issues to Consider

What if there are words that do not appear in the training set? What if it appears only once?
What if the plural of a word never appears in the training set?
How are extremely common words (e.g., “the”, “a”) handled?

NLP Lecture 02: Naive Bayes’ Classifier

Linguistic Terminology

Tokenization

Lemmatization

Probabilities in NLP

Bayesian Statistics

Probability Basics

Bayes’ Rule

Independence

Probabilities and Supervised Learning

Discriminative Algorithms

Generative Algorithms

Naive Bayes

Rules

Naive Bayes Classifier

Training the Naive Bayes’ Classifier

Some Issues to Consider

Jiawei Lu

Error

Linguistic Terminology

Tokenization

Lemmatization

Probabilities in NLP

Bayesian Statistics

Probability Basics

Bayes’ Rule

Independence

Probabilities and Supervised Learning

Discriminative Algorithms

Generative Algorithms

Naive Bayes

Rules

Naive Bayes Classifier

Training the Naive Bayes’ Classifier

Some Issues to Consider

Templates (for web app):

Error