Classification#

Classification is a fundamental task in the field of pattern recognition, involving the assignment of input data into predefined categories based on their features.

Mathematical Definition of Classification#

Classification can be mathematically defined as the task of learning a function \( f \) that maps input vectors to discrete output labels. This function \( f \) is learned from a set of training data \( D \).

Given a training dataset:

\[ D = \{ (\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n) \} \]

where:

  • \( \mathbf{x}_i \in \mathbb{R}^d \) represents the \( i \)-th input vector with \( d \) features.

  • \( y_i \in \mathbb{R} \) represents the corresponding output label (for classification, \( y_i \) typically takes discrete values).

The objective is to learn a function \( f \) such that:

\[ y = f(\mathbf{x}) \]

Here, \( f \) maps an input vector \( \mathbf{x} \) from the \( d \)-dimensional input space \( \mathbb{R}^d \) to a scalar value \( y \) in the output space \( \mathbb{R} \):

\[ f: \mathbb{R}^d \rightarrow \mathbb{R} \]

For classification, the function \( f \) usually predicts discrete class labels. The output space \( \mathbb{R} \) can be interpreted differently depending on the type of classification:

  • Binary Classification: \( y \in \{0, 1\} \) or \( y \in \{-1, 1\} \).

  • Multi-Class Classification: \( y \in \{1, 2, \ldots, K\} \), where \( K \) is the number of classes.

  • Multi-Label Classification: Each \( y \) can be a vector of binary values indicating the presence of multiple classes simultaneously.

Example of a Classification Function#

A common example is logistic regression for binary classification. Here, the function \( f \) is defined as:

\[ f(\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b) \]

where:

  • \( \mathbf{w} \in \mathbb{R}^d \) is the weight vector.

  • \( b \in \mathbb{R} \) is the bias term.

  • \( \sigma(z) = \frac{1}{1 + e^{-z}} \) is the sigmoid function that maps the linear combination of features to a probability between 0 and 1.

The predicted class label \( \hat{y} \) is then obtained by thresholding the output of \( f \):

\[\begin{split} \hat{y} = \begin{cases} 1 & \text{if } f(\mathbf{x}) \geq 0.5 \\ 0 & \text{if } f(\mathbf{x}) < 0.5 \end{cases} \end{split}\]

Learning the Function \( f \)#

The function \( f \) is learned by optimizing an objective function, such as minimizing the classification error or maximizing the likelihood of the observed data. For example, in logistic regression, the parameters \( \mathbf{w} \) and \( b \) are typically learned by minimizing the cross-entropy loss:

\[ \text{Loss} = -\frac{1}{n} \sum_{i=1}^n \left[ y_i \log(f(\mathbf{x}_i)) + (1 - y_i) \log(1 - f(\mathbf{x}_i)) \right] \]

Classification Algorithms#

Numerous algorithms have been developed for classification tasks, each with its strengths and weaknesses. Some popular classification algorithms include:

  • Fisher classifier: A linear model that estimates the probability of a binary outcome.

  • Decision Trees: Non-linear models that split data into subsets based on feature values, forming a tree-like structure.

  • Support Vector Machines (SVM): Powerful models that find the optimal hyperplane to separate different classes.

  • Baysian Classifier: A Bayesian classifier uses Bayes’ Rule to predict the probability that a given data point belongs to a particular class based on prior knowledge of conditions related to the class.

  • k-Nearest Neighbors (k-NN): A simple, instance-based learning algorithm that classifies a data point based on the majority class of its k nearest neighbors.

  • Neural Networks: Highly flexible models capable of capturing complex patterns, especially useful in deep learning for tasks like image and speech recognition.

A simple algorithm for classification#

In binary classification using kernel methods, you can classify a test point \( x \) by comparing its distance to the centers of mass (centroids) of positive and negative examples. Here’s a step-by-step derivation and explanation based on the kernelized distance approach:

Step-by-Step Derivation#

  1. Define the Center of Mass:

    Let \( S^+ \) and \( S^- \) be the sets of positive and negative training examples, respectively. The center of mass (centroid) in the feature space for positive examples \( \phi(S^+) \) and negative examples \( \phi(S^-) \) are:

    \[ \phi_S^+ = \frac{1}{|S^+|} \sum_{x_i \in S^+} \phi(x_i) \]
    \[ \phi_S^- = \frac{1}{|S^-|} \sum_{x_i \in S^-} \phi(x_i) \]
  2. Distance from the Centroids:

    The squared distance from a test point \( x \) to the positive centroid \( \phi_S^+ \) is:

    \[ d^+(x) = \|\phi(x) - \phi_S^+\|^2 \]

    Similarly, the squared distance from \( x \) to the negative centroid \( \phi_S^- \) is:

    \[ d^-(x) = \|\phi(x) - \phi_S^-\|^2 \]
  3. Calculate the Distances:

    Using the properties of the kernel function and the centroids, we get:

    \[ d^+(x) = \|\phi(x) - \phi_S^+\|^2 = \|\phi(x)\|^2 + \|\phi_S^+\|^2 - 2 \langle \phi(x), \phi_S^+ \rangle \]
    \[ d^-(x) = \|\phi(x) - \phi_S^-\|^2 = \|\phi(x)\|^2 + \|\phi_S^-\|^2 - 2 \langle \phi(x), \phi_S^- \rangle \]

    Substitute \( \|\phi(x)\|^2 = \kappa(x, x) \) and the inner products \( \langle \phi(x), \phi_S^+ \rangle \) and \( \langle \phi(x), \phi_S^- \rangle \):

    \[ \langle \phi(x), \phi_S^+ \rangle = \frac{1}{|S^+|} \sum_{x_i \in S^+} \kappa(x, x_i) \]
    \[ \langle \phi(x), \phi_S^- \rangle = \frac{1}{|S^-|} \sum_{x_i \in S^-} \kappa(x, x_i) \]

    So:

    \[ d^+(x) = \kappa(x, x) + \frac{1}{|S^+|^2} \sum_{i=1}^{|S^+|} \sum_{j=1}^{|S^+|} \kappa(x_i, x_j) - \frac{2}{|S^+|} \sum_{i=1}^{|S^+|} \kappa(x, x_i) \]
    \[ d^-(x) = \kappa(x, x) + \frac{1}{|S^-|^2} \sum_{i=1}^{|S^-|} \sum_{j=1}^{|S^-|} \kappa(x_i, x_j) - \frac{2}{|S^-|} \sum_{i=1}^{|S^-|} \kappa(x, x_i) \]
  4. Simplify the Classification Rule:

    To classify \( x \), compare \( d^+(x) \) and \( d^-(x) \). The classification rule is:

    \[\begin{split} h(x) = \begin{cases} +1 & \text{if } d^-(x) > d^+(x) \\ -1 & \text{otherwise} \end{cases} \end{split}\]

    We can simplify this using the sign function. Express \( d^+(x) \) and \( d^-(x) \) in terms of kernel evaluations:

    \[ d^+(x) = \kappa(x, x) + \frac{1}{|S^+|^2} \sum_{i=1}^{|S^+|} \sum_{j=1}^{|S^+|} \kappa(x_i, x_j) - \frac{2}{|S^+|} \sum_{i=1}^{|S^+|} \kappa(x, x_i) \]
    \[ d^-(x) = \kappa(x, x) + \frac{1}{|S^-|^2} \sum_{i=1}^{|S^-|} \sum_{j=1}^{|S^-|} \kappa(x_i, x_j) - \frac{2}{|S^-|} \sum_{i=1}^{|S^-|} \kappa(x, x_i) \]

    Thus:

    \[ h(x) = \text{sgn} \left( \frac{1}{|S^+|^2} \sum_{i=1}^{|S^+|} \sum_{j=1}^{|S^+|} \kappa(x_i, x_j) - \frac{1}{|S^-|^2} \sum_{i=1}^{|S^-|} \sum_{j=1}^{|S^-|} \kappa(x_i, x_j) - \frac{2}{|S^+|} \sum_{i=1}^{|S^+|} \kappa(x, x_i) + \frac{2}{|S^-|} \sum_{i=1}^{|S^-|} \kappa(x, x_i) \right) \]
  5. Final Classification Function:

    To simplify, let:

    \[ b = \frac{1}{2} \left( \frac{1}{|S^+|^2} \sum_{i=1}^{|S^+|} \sum_{j=1}^{|S^+|} \kappa(x_i, x_j) - \frac{1}{|S^-|^2} \sum_{i=1}^{|S^-|} \sum_{j=1}^{|S^-|} \kappa(x_i, x_j) \right) \]

    Therefore:

    \[ h(x) = \text{sgn} \left( \frac{1}{|S^+|} \sum_{i=1}^{|S^+|} \kappa(x, x_i) - \frac{1}{|S^-|} \sum_{i=1}^{|S^-|} \kappa(x, x_i) - b \right) \]