NLP Lecture 02: Naive Bayes’ Classifier
Keywords: Language Classification, Probability Review, Machine Learning Background, Naive Bayes’ Classifier.
Linguistic Terminology
- Sentence: Unit of written language.
- Utterance: Unit of spoken language.
- Word Form: the inflected form as it actually appears in the corpus. “produced”
- Word Stem: The part of the word that never changes between morphological variations. “produc”
- Lemma: an abstract base form, shared by word forms, having the same stem, part of speech, and word sense – stands for the class of words with stem. “produce”
- Type: number of distinct words in a corpus (vocabulary size).
- Token: Total number of word occurrences.
Tokenization
The process of segmenting text (a sequence of characters) into a sequence of tokens (words).
Lemmatization
Converting Lemmas into their base form.
Probabilities in NLP
Probabilities make it possible to combine evidence from multiple sources systematically to (using Bayesian statistics).
Bayesian Statistics
Typically, we observe some evidence (for example, words in a document) and the goal is to infer the “correct” interpretation (for example, the topic of a text). Probabilities express the degree of belief we have in the possible interpretations.
Prior probabilities: Probability of an interpretation prior to seeing any evidence.
Conditional (Posterior) probability: Probability of an interpretation after taking evidence into account.
Probability Basics
- sample space \(\Omega\)
- random variable \(X\)
- probability distribution \(P(\omega)\)
- joint probability: \(P(A, B)\)
- conditional probability: \(P(A|B) = \frac{P(A,B)}{P(B)}\)
Bayes’ Rule
\begin{equation} P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} \end{equation}
Independence
- Independent:
- \[P(A) = P(A|B)\]
- \[P(A, B) = P(A) \cdot P(B)\]
- Conditionally Independent
- \[P(B, C|A) = P(B|A) \cdot P(C|A)\]
- \[P(B|A,C) =P(B|A)\]
Probabilities and Supervised Learning
- Given: Training data consisting of training examples \(data = (x_1, y_1), …, (x_n, y_n)\).
- Goal: Learn a mapping \(h\) from \(x\) to \(y\).
- Two approaches:
- Discriminative algorithms learn \(P(y | x)\) directly.
- Generative algorithms use Bayes rule \begin{equation} P(y|x) = \frac{P(x|y) \cdot P(y)}{P(x)} \end{equation}
Discriminative Algorithms
- Model conditional distribution of the label given the data
- Learns decision boundaries that separate instances of the different classes.
- To predict a new example, check on which side of the decision boundary it falls.
- Examples:
- support vector machine (SVM)
- decision trees
- random forests
- neural networks
- log-linear models
Generative Algorithms
- Assume the observed data is being “generated” by a “hidden” class label.
- Build a different model for each class.
- To predict a new example, check it under each of the models and see which one matches best.
- Estimate \(P(x|y)\) and \(P(y)\). Then use bases rule \begin{equation} P(y|x) = \frac{P(x|y) \cdot P(y)}{P(x)} \end{equation}
- Examples:
- Naive Bayes
- Hidden Markov Models
- Gaussian Mixture Models
- PCFGs
Naive Bayes
Rules
\begin{equation} P(Label, X_1, …, X_d) = P(Label) \Pi_i P(X_i|Label) \end{equation}
\begin{equation} \begin{split} P(Label|X_1, …, X_d) & = \frac{P(Label) \Pi_i P(X_i|Label)}{\Pi_i P(X_i)}
& = \alpha [P(Label) \Pi_i P(X_i|Label)] \end{split} \end{equation}
Naive Bayes Classifier
\begin{equation} y* = \arg \max_y P(y) \Pi_i P(x_i|y) \end{equation}
Training the Naive Bayes’ Classifier
Estimate the prior and posterior probabilities using Maximum Likelihood Estimates (MLE)
\begin{equation} P(y) = \frac{Count(y)}{\sum_{y’\in Y}Count(y’)} \end{equation}
\begin{equation} P(x_i|y) = \frac{Count(x_i, y)}{Count(y)} \end{equation}
Some Issues to Consider
- What if there are words that do not appear in the training set? What if it appears only once?
- What if the plural of a word never appears in the training set?
- How are extremely common words (e.g., “the”, “a”) handled?