Kernel method#
We break down the concept of kernel methods in a simple form.
Kernel Methods and the Kernel Trick#
Input Space and Feature Space:
Input Space (\( \mathcal{X} \)): This is the space where your original data points, \( x \) and \( y \), reside.
Feature Space (\( \mathcal{F} \)): This is a higher-dimensional space where your data points are mapped using a feature mapping function, \( \phi \). So, \( \phi(x) \) and \( \phi(y) \) are the representations of \( x \) and \( y \) in the feature space.
Feature Mapping (\( \phi \)):
The feature mapping function \( \phi: \mathcal{X} \rightarrow \mathcal{F} \) transforms data points from the input space to the feature space.
In many cases, the feature space can be very high-dimensional, even infinite-dimensional, which makes computations directly in \( \mathcal{F} \) impractical.
Inner Product in Feature Space:
When using machine learning algorithms, we often need to compute the inner product between two points in the feature space: \(\left\langle \phi(x) , \phi(y) \right\rangle \).
Kernel Function (\( k \)):
The kernel function \( k \) is defined as \( k(x, y) = \phi(x)^{T} \phi(y) \).
The key idea is that we do not need to compute \( \phi(x) \) and \( \phi(y) \) explicitly. Instead, we directly compute \( k(x, y) \) in the input space.
The Kernel Trick
The kernel trick allows us to work in the high-dimensional feature space implicitly without ever computing the coordinates of the data points in that space. Instead, we compute the inner products using the kernel function.
Why is this useful?
Efficiency:
Computing \( \phi(x) \) and \( \phi(y) \) explicitly can be computationally expensive or infeasible.
Using the kernel function \( k(x, y) \), we can perform the same calculations much more efficiently.
Flexibility:
Kernel methods allow us to use different types of kernels, each corresponding to a different feature space. This flexibility helps in capturing various data structures and relationships.
Common Kernels#
Linear Kernel:
\( k(x, y) = x^{T} y \)
This corresponds to no mapping (i.e., \( \phi(x) = x \)).
Polynomial Kernel:
\( k(x, y) = (x^{T} y + c)^d \)
This corresponds to a polynomial feature mapping.
Gaussian (RBF) Kernel:
\( k(x, y) = \exp \left( -\frac{\|x - y\|^2}{2\sigma^2} \right) \)
This corresponds to mapping into an infinite-dimensional feature space.
Sigmoid Kernel:
\( k(x, y) = \tanh(\alpha x^{T} y + c) \)
This corresponds to a feature mapping inspired by neural networks.
Finding \( \phi (x) \) From Kernel#
Example 1: Polynomial kernel
We try decompose the polynomial kernel \( k(x, y) = (x^T y + c)^2 \) into its feature mapping representation \( \phi(x) \) such that \( k(x, y) = \left\langle \phi(x) , \phi(y) \right\rangle \).
Work Steps
Expand the Polynomial Kernel:
Using the binomial expansion, we get:
Identify the Terms: Let’s consider \( x = [x_1, x_2, \ldots, x_d]^T \) and \( y = [y_1, y_2, \ldots, y_d]^T \). The expanded terms can be further decomposed:
Feature Mapping (\( \phi(x) \)): To express \( k(x, y) \) as \( \phi(x)^T \phi(y) \), we need to identify a feature mapping \( \phi(x) \) such that the inner product in the feature space gives us the polynomial kernel.
Let’s construct \( \phi(x) \) as follows:
Here, \( \phi(x) \) includes:
All pairwise product terms \( \sqrt{2} x_i x_j \).
Linear terms \( \sqrt{2c} x_i \).
A constant term \( c \).
Verify \( \phi(x)^T \phi(y) \): Now, we compute the inner product \( \phi(x)^T \cdot \phi(y) \):
This results in:
Simplifying: $\( 2 (x^T y)^2 + 2c (x^T y) + c^2 = (x^T y + c)^2 \)$
Thus, we have:
Example 2: RBF kernel
The Gaussian (RBF) kernel is:
First, let’s rewrite the squared Euclidean distance:
So the kernel can be written as:
This can be further split into:
Homework 1#
Decompose the term \( \exp\left(\frac{x^T y}{\sigma^2}\right) \) and find \( \phi(x) \)
Methods in \( \phi \) space#
Distance in \( \phi(x) \) space#
To compute \(\|\phi(x) - \phi(y)\|^2\) in the feature space and relate it to the Gaussian (RBF) kernel \(k(x, y)\), we need to follow these steps:
Define the Kernel Function: The Gaussian (RBF) kernel is given by:
Feature Mapping: The Gaussian kernel can be thought of as corresponding to an infinite-dimensional feature space. In this space, the kernel function \(k(x, y)\) can be expressed as the inner product of feature vectors \(\phi(x)\) and \(\phi(y)\):
Compute \(\|\phi(x) - \phi(y)\|^2\): To find \(\|\phi(x) - \phi(y)\|^2\), we use the fact that: $\( \|\phi(x) - \phi(y)\|^2 = \|\phi(x)\|^2 + \|\phi(y)\|^2 - 2 \phi(x)^T \phi(y) \)$
We need to compute \(\|\phi(x)\|^2\) and \(\|\phi(y)\|^2\) in the feature space.
Norm Squared of \(\phi(x)\) and \(\phi(y)\): Let’s compute \(\|\phi(x)\|^2\):
Similarly,
For the Gaussian kernel:
Therefore,
Similarly,
Compute \(\|\phi(x) - \phi(y)\|^2\): Now, using the computed norms and the kernel function:
Substituting the values:
Since \(\phi(x)^T \phi(y) = k(x, y)\):
Fuzzy C-Means clustering in \( \phi \) space#
Kernel Fuzzy C-Means with Kernelization of the Metric The Kernel Fuzzy C-Means (KFCM) algorithm is an extension of the standard Fuzzy C-Means (FCM) algorithm that uses kernel methods to handle non-linearity by implicitly mapping data into a high-dimensional feature space.
Objective Function
where:
\( u_{ik} \) is the membership degree of object \( x_k \) in cluster \( i \).
\( \psi(x_k) \) and \( \psi(g_i) \) are the feature space representations of data point \( x_k \) and cluster center \( g_i \), respectively.
\( m \) is the fuzziness parameter.
\( \|\psi(x_k) - \psi(g_i)\|^2 \) represents the squared Euclidean distance between the feature space representations of \( x_k \) and \( g_i \).
Using the kernel trick, this distance can be expressed in terms of the kernel function:
where \( K(\cdot, \cdot) \) is the kernel function. For the Gaussian kernel, this becomes:
Thus,
Substitute this into the objective function:
which simplifies to:
Gaussian Kernel
The Gaussian kernel is defined as:
where \( \sigma^2 \) is the width parameter of the Gaussian kernel. For the Gaussian kernel:
Update Rules for Prototypes and Membership Degrees
Update Prototypes:
To minimize \( J_{KFCM} \) with respect to the cluster prototypes \( g_i \), we need to find the partial derivatives and set them to zero. The update rule for the prototypes \( g_i \) is derived as:
Update Membership Degrees:
To update the membership degrees \( u_{ik} \), we use the method of Lagrange multipliers with the constraints \( \sum_{i=1}^{c} u_{ik} = 1 \) and \( u_{ik} \geq 0 \). The update rule for the membership degrees \( u_{ik} \) is given by:
where \( K(x_k, g_i) \) and \( K(x_k, g_h) \) are kernel evaluations between \( x_k \) and the cluster centers.
Algorithm Steps
. Initialize the membership matrix \( U \) and the prototypes \( g_i \) randomly.
. Repeat until convergence:
Update the prototypes \( g_i \) using:
\[ g_i = \frac{\sum_{k=1}^{n} (u_{ik})^m K(x_k, g_i) x_k}{\sum_{k=1}^{n} (u_{ik})^m K(x_k, g_i)} \]Update the membership degrees \( u_{ik} \) using:
\[ u_{ik} = \frac{1}{\sum_{h=1}^{c} \left( \frac{1 - K(x_k, g_i)}{1 - K(x_k, g_h)} \right)^{\frac{2}{m-1}}} \]
Check for convergence
Regression in \( \phi \) space#
Objective Function in Feature Space
We start with the objective function of Ordinary Least Squares (OLS) regression in the feature space \(\phi(x)\):
Expanding the squared term:
Taking the Gradient
To find the optimal \(w\), we take the gradient of \(J(w)\) with respect to \(w\):
Setting the Gradient to Zero
To find the optimal \(w\), set the gradient to zero:
Representing \(w\) as a Linear Combination of \(\phi(x_i)\)
We assume \(w\) can be expressed as a linear combination of the feature vectors \(\phi(x_i)\):
Substitute this into the equation:
Using the Kernel Trick
Notice that \(\phi(x_i)^T \phi(x_j)\) is the kernel function \(k(x_i, x_j)\). Therefore,
This simplifies to:
Solving for \(\alpha\)
\( y \) is the column vector of labels \( [y_1, y_2, \ldots, y_n]^T \).
Similarly, let \( \alpha \) be the column vector of coefficients \( [\alpha_1, \alpha_2, \ldots, \alpha_n]^T \) After inner product into \(\phi(x_k)\), We reached the equation:
We simplify this to:
For all \( x_k \) where \( k = 1, \ldots, n \), we have \( n \) system equations. After solving them, we obtain \( \alpha \).”
Making Predictions
Using the obtained \(\alpha\), the regression function \(f(x)\) for a new input \(x\) is given by:
By the kernel trick, we have:
The weight vector norm#
Formulation
You appear to be discussing the representation of a weight vector \( w \) in terms of kernel functions and dual coefficients. Here’s how these expressions are generally interpreted in kernel methods:
Weight Vector Representation
In kernel methods, the weight vector \( w \) in the feature space can be expressed as a linear combination of the transformed data points. If \( \phi(x_i) \) denotes the feature space transformation of \( x_i \), then:
where \( \alpha_i \) are the coefficients in the dual space, and \( n \) is the number of training samples.
Inner Product in Feature Space
The inner product between two weight vectors \( w \) and \( w \) can be expressed as:
Expanding this expression:
In kernel methods, the inner product \( \phi(x_i)^T \phi(x_j) \) in the feature space is replaced by the kernel function \( K(x_i, x_j) \):
Thus, the weight vector’s inner product can be written as:
Kernel Matrix Representation
The kernel matrix \( K \) is defined as:
Therefore, the inner product of weight vectors \( w \) and \( w \) can also be expressed as:
where \( \alpha \) is the vector of coefficients \( \alpha_i \), and \( K \) is the kernel matrix.
Normalizing#
The normalization transformation you mentioned requires scaling the kernel function to ensure it is normalized with respect to the data points. This adjustment can be particularly useful in various machine learning algorithms that utilize kernel methods. Let’s analyze the normalization process and derive the normalized kernel function step by step.
Normalization Transformation
Given the transformation:
where \( \phi(x) \) is the feature space representation of \( x \), and \( \|\phi(x)\| \) denotes the norm of \( \phi(x) \) in the feature space.
For two data points \( x \) and \( z \), the normalized kernel function \( \hat{\kappa}(x, z) \) is defined as:
Substitute \( \hat{\phi}(x) \) and \( \hat{\phi}(z) \) into the inner product:
Calculating the Normalized Kernel Function
Compute the Inner Product:
Here, \( \langle \phi(x), \phi(z) \rangle \) is the dot product in the feature space, which is equal to the kernel function \( \kappa(x, z) \):
Compute Norms:
The norm of \( \phi(x) \) in the feature space is:
Similarly:
Substitute Norms:
Substitute these norms into the normalized kernel function:
This can also be written as:
Norm and distance from the centre of mass#
The concept you’re discussing involves calculating and interpreting the center of mass (or centroid) in the feature space induced by a kernel function, and its properties. Here’s a structured explanation and derivation of the key points:
Center of Mass in Feature Space#
Given a set of data points \( \{ x_i \}_{i=1}^n \) mapped to the feature space \( \phi(x_i) \), the center of mass (or centroid) of the set \( \{\phi(x_i)\} \) is:
Norm of the Center of Mass
To compute the norm squared of \( \phi_S \), we use:
Substitute \( \phi_S \) into this equation:
Expand the inner product:
Using the kernel function \( \kappa(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle \), we get:
The term \( \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n \kappa(x_i, x_j) \) is the average of all entries in the kernel matrix \( K \). Hence:
Distance from a Point to the Center of Mass
For a data point \( x \), the squared distance from \( \phi(x) \) to \( \phi_S \) is:
Using the identity:
Substitute \( \|\phi(x)\|^2 = \kappa(x, x) \) and \( \langle \phi(x), \phi_S \rangle = \frac{1}{n} \sum_{i=1}^n \kappa(x, x_i) \):
Expected Squared Distance from Center of Mass#
For a set of points \( \{x_s\}_{s=1}^s \):
Expanding:
Rearrange:
Support Vector Regression#
Given the initial objective function for regression with ε-insensitive loss:
and substituting the square loss with the ε-insensitive loss, the objective function becomes:
where \( e_i = y_i - \hat{y}_i = y_i - (w^T x_i + b) \).
Introducing Slack Variables
We introduce slack variables \( \xi_i \) to represent the amount by which each error \( |e_i| \) exceeds \( \epsilon \):
Thus, the objective function can be rewritten as:
Constraints
The slack variables \( \xi_i \) must satisfy the following constraints:
To handle this constraint without using the absolute value, we can split it into two separate constraints:
Additionally, we need to ensure that \( \xi_i \geq 0 \).
Primal Problem
Combining these constraints with the objective function, we get the primal problem:
subject to:
Using Two Slack Variables
To handle positive and negative deviations separately, we introduce two slack variables \( \xi_i \) and \( \xi_i^* \). The formulation is as follows:
Objective Function
Constraints
Final Formulation
Thus, the final formulation of the Support Vector Regression (SVR) with ε-insensitive loss using two slack variables is:
subject to the constraints:
SVR in Feature Space with Kernel Trick#
To implement SVR in the feature space using the kernel trick, we replace the inner product \( w^T x \) with a kernel function \( \kappa(x, x') \):
Primal Formulation#
The primal formulation in the feature space \( \phi \) is:
subject to:
Dual Formulation#
The dual formulation of the above problem using Lagrange multipliers \( \alpha_i \) and \( \alpha_i^* \) is:
subject to:
Here, \( \kappa(x_i, x_j) \) is the kernel function that computes the dot product in the high-dimensional feature space.
Final Prediction#
The final prediction for a test point \( x \) is given by:
where \( \alpha_i \) and \( \alpha_i^* \) are the solutions to the dual problem and \( b \) is the bias term.
This SVR formulation in the feature space allows us to leverage the kernel trick to handle nonlinear regression tasks efficiently.
Step-by-Step Derivation of Support Vector Regression (SVR) in Kernel Space#
Objective and Constraints in the Primal Form#
The primal problem for SVR with \(\epsilon\)-insensitive loss can be formulated as:
Objective function:
Subject to constraints:
where \(w\) is the weight vector in the feature space, \(b\) is the bias term, \(\xi_i\) and \(\xi_i^*\) are slack variables for positive and negative deviations, respectively, and \(C\) is a regularization parameter.
Lagrangian and Dual Problem#
To solve the primal problem, we first construct the Lagrangian function. Introduce Lagrange multipliers \(\alpha_i, \alpha_i^*, \eta_i, \eta_i^* \geq 0\) for the constraints:
Stationarity Conditions#
(stationary points are those points where the partial derivatives of Λ are zero)
To find the dual problem, we need to set the partial derivatives of the Lagrangian with respect to the primal variables to zero.
With respect to \(w\):
With respect to \(b\):
With respect to \(\xi_i\):
With respect to \(\xi_i^*\):
Substituting Back#
Substituting the stationarity conditions back into the Lagrangian, we get the dual problem:
Since \( w = \sum_{i=1}^{n} (\alpha_i - \alpha_i^*) \phi(x_i) \), substitute \( w \) into the Lagrangian:
This expands to:
Using the kernel trick \( \kappa(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle \):
Dual Problem#
Combining all the terms, the dual problem becomes:
Maximize:
Subject to:
Prediction#
The prediction function for a new data point \( x \) is given by:
where \( b \) can be determined by exploiting the Karush-Kuhn-Tucker (KKT) conditions.
Determining \( b \)#
To determine \( b \), we use the support vectors (data points with \( 0 < \alpha_i < C \) or \( 0 < \alpha_i^* < C \)):
or