Classification#
Classification is a fundamental task in the field of pattern recognition, involving the assignment of input data into predefined categories based on their features.
Mathematical Definition of Classification#
Classification can be mathematically defined as the task of learning a function \( f \) that maps input vectors to discrete output labels. This function \( f \) is learned from a set of training data \( D \).
Given a training dataset:
where:
\( \mathbf{x}_i \in \mathbb{R}^d \) represents the \( i \)-th input vector with \( d \) features.
\( y_i \in \mathbb{R} \) represents the corresponding output label (for classification, \( y_i \) typically takes discrete values).
The objective is to learn a function \( f \) such that:
Here, \( f \) maps an input vector \( \mathbf{x} \) from the \( d \)-dimensional input space \( \mathbb{R}^d \) to a scalar value \( y \) in the output space \( \mathbb{R} \):
For classification, the function \( f \) usually predicts discrete class labels. The output space \( \mathbb{R} \) can be interpreted differently depending on the type of classification:
Binary Classification: \( y \in \{0, 1\} \) or \( y \in \{-1, 1\} \).
Multi-Class Classification: \( y \in \{1, 2, \ldots, K\} \), where \( K \) is the number of classes.
Multi-Label Classification: Each \( y \) can be a vector of binary values indicating the presence of multiple classes simultaneously.
Example of a Classification Function#
A common example is logistic regression for binary classification. Here, the function \( f \) is defined as:
where:
\( \mathbf{w} \in \mathbb{R}^d \) is the weight vector.
\( b \in \mathbb{R} \) is the bias term.
\( \sigma(z) = \frac{1}{1 + e^{-z}} \) is the sigmoid function that maps the linear combination of features to a probability between 0 and 1.
The predicted class label \( \hat{y} \) is then obtained by thresholding the output of \( f \):
Learning the Function \( f \)#
The function \( f \) is learned by optimizing an objective function, such as minimizing the classification error or maximizing the likelihood of the observed data. For example, in logistic regression, the parameters \( \mathbf{w} \) and \( b \) are typically learned by minimizing the cross-entropy loss:
Classification Algorithms#
Numerous algorithms have been developed for classification tasks, each with its strengths and weaknesses. Some popular classification algorithms include:
Fisher classifier: A linear model that estimates the probability of a binary outcome.
Decision Trees: Non-linear models that split data into subsets based on feature values, forming a tree-like structure.
Support Vector Machines (SVM): Powerful models that find the optimal hyperplane to separate different classes.
Baysian Classifier: A Bayesian classifier uses Bayes’ Rule to predict the probability that a given data point belongs to a particular class based on prior knowledge of conditions related to the class.
k-Nearest Neighbors (k-NN): A simple, instance-based learning algorithm that classifies a data point based on the majority class of its k nearest neighbors.
Neural Networks: Highly flexible models capable of capturing complex patterns, especially useful in deep learning for tasks like image and speech recognition.
A simple algorithm for classification#
In binary classification using kernel methods, you can classify a test point \( x \) by comparing its distance to the centers of mass (centroids) of positive and negative examples. Here’s a step-by-step derivation and explanation based on the kernelized distance approach:
Step-by-Step Derivation#
Define the Center of Mass:
Let \( S^+ \) and \( S^- \) be the sets of positive and negative training examples, respectively. The center of mass (centroid) in the feature space for positive examples \( \phi(S^+) \) and negative examples \( \phi(S^-) \) are:
\[ \phi_S^+ = \frac{1}{|S^+|} \sum_{x_i \in S^+} \phi(x_i) \]\[ \phi_S^- = \frac{1}{|S^-|} \sum_{x_i \in S^-} \phi(x_i) \]Distance from the Centroids:
The squared distance from a test point \( x \) to the positive centroid \( \phi_S^+ \) is:
\[ d^+(x) = \|\phi(x) - \phi_S^+\|^2 \]Similarly, the squared distance from \( x \) to the negative centroid \( \phi_S^- \) is:
\[ d^-(x) = \|\phi(x) - \phi_S^-\|^2 \]Calculate the Distances:
Using the properties of the kernel function and the centroids, we get:
\[ d^+(x) = \|\phi(x) - \phi_S^+\|^2 = \|\phi(x)\|^2 + \|\phi_S^+\|^2 - 2 \langle \phi(x), \phi_S^+ \rangle \]\[ d^-(x) = \|\phi(x) - \phi_S^-\|^2 = \|\phi(x)\|^2 + \|\phi_S^-\|^2 - 2 \langle \phi(x), \phi_S^- \rangle \]Substitute \( \|\phi(x)\|^2 = \kappa(x, x) \) and the inner products \( \langle \phi(x), \phi_S^+ \rangle \) and \( \langle \phi(x), \phi_S^- \rangle \):
\[ \langle \phi(x), \phi_S^+ \rangle = \frac{1}{|S^+|} \sum_{x_i \in S^+} \kappa(x, x_i) \]\[ \langle \phi(x), \phi_S^- \rangle = \frac{1}{|S^-|} \sum_{x_i \in S^-} \kappa(x, x_i) \]So:
\[ d^+(x) = \kappa(x, x) + \frac{1}{|S^+|^2} \sum_{i=1}^{|S^+|} \sum_{j=1}^{|S^+|} \kappa(x_i, x_j) - \frac{2}{|S^+|} \sum_{i=1}^{|S^+|} \kappa(x, x_i) \]\[ d^-(x) = \kappa(x, x) + \frac{1}{|S^-|^2} \sum_{i=1}^{|S^-|} \sum_{j=1}^{|S^-|} \kappa(x_i, x_j) - \frac{2}{|S^-|} \sum_{i=1}^{|S^-|} \kappa(x, x_i) \]Simplify the Classification Rule:
To classify \( x \), compare \( d^+(x) \) and \( d^-(x) \). The classification rule is:
\[\begin{split} h(x) = \begin{cases} +1 & \text{if } d^-(x) > d^+(x) \\ -1 & \text{otherwise} \end{cases} \end{split}\]We can simplify this using the sign function. Express \( d^+(x) \) and \( d^-(x) \) in terms of kernel evaluations:
\[ d^+(x) = \kappa(x, x) + \frac{1}{|S^+|^2} \sum_{i=1}^{|S^+|} \sum_{j=1}^{|S^+|} \kappa(x_i, x_j) - \frac{2}{|S^+|} \sum_{i=1}^{|S^+|} \kappa(x, x_i) \]\[ d^-(x) = \kappa(x, x) + \frac{1}{|S^-|^2} \sum_{i=1}^{|S^-|} \sum_{j=1}^{|S^-|} \kappa(x_i, x_j) - \frac{2}{|S^-|} \sum_{i=1}^{|S^-|} \kappa(x, x_i) \]Thus:
\[ h(x) = \text{sgn} \left( \frac{1}{|S^+|^2} \sum_{i=1}^{|S^+|} \sum_{j=1}^{|S^+|} \kappa(x_i, x_j) - \frac{1}{|S^-|^2} \sum_{i=1}^{|S^-|} \sum_{j=1}^{|S^-|} \kappa(x_i, x_j) - \frac{2}{|S^+|} \sum_{i=1}^{|S^+|} \kappa(x, x_i) + \frac{2}{|S^-|} \sum_{i=1}^{|S^-|} \kappa(x, x_i) \right) \]Final Classification Function:
To simplify, let:
\[ b = \frac{1}{2} \left( \frac{1}{|S^+|^2} \sum_{i=1}^{|S^+|} \sum_{j=1}^{|S^+|} \kappa(x_i, x_j) - \frac{1}{|S^-|^2} \sum_{i=1}^{|S^-|} \sum_{j=1}^{|S^-|} \kappa(x_i, x_j) \right) \]Therefore:
\[ h(x) = \text{sgn} \left( \frac{1}{|S^+|} \sum_{i=1}^{|S^+|} \kappa(x, x_i) - \frac{1}{|S^-|} \sum_{i=1}^{|S^-|} \kappa(x, x_i) - b \right) \]