Theoretical Aspects of Regression#

The generalization error risk#

Let us define \( L: Y \times Y \to \mathbb{R}^+ \) the loss function used to measure the magnitude of error. Given a hypothesis set \( H \) of functions mapping \( X \) to \( Y \), the regression problem consists of using the labeled sample \( S \) to find a hypothesis \( h \in H \) with small expected loss or generalization error \( R(h) \) with respect to the target \( f \):

\[ R(h) = \mathbb{E}_{x \sim D} [L(h(x), f(x))]. \]

The empirical loss or error of \( h \in H \) is denoted by \( R_S(h) \) and defined by

\[ R_S(h) = \frac{1}{m} \sum_{i=1}^{m} L(h(x_i), y_i). \]

In the common case where \( L \) is the squared loss, this represents the mean squared error of \( h \) on the sample set \( S \).

When the loss function \( L \) is bounded by some \( M > 0 \), that is \( L(y, y') \le M \) for all \( y, y' \in Y \), or, more strictly, \( L(h(x), f(x)) \le M \) for all \( h \in H \) and \( x \in X \), the problem is referred to as a bounded regression problem. Much of the theoretical results presented in the following sections are based on that assumption. The analysis of unbounded regression problems is technically more elaborate and typically requires some other types of assumptions.

Theorem

Let \( L \) be a bounded loss function. Assume that the hypothesis set \( H \) is finite. Then, for any \( \delta > 0 \), with probability at least \( 1 - \delta \), the following inequality holds for all \( h \in H \):

\[ R(h) \leq \hat{R}(h) + M \sqrt{\frac{\log |H| + \log \frac{1}{\delta}}{2m}} \]

In the above theorem, the generalization error risk can be estimated using a sufficiently large number of samples (\( m \)) and a limited hypothesis set (\( H \)), which consists of a finite number of hypotheses (\( h(x) \)) for fitting. Additionally, \( M \) represents the upper bound of the loss function. To minimize the generalization error, we can use a large number of training samples (\( m \)) and a loss function with a small upper bound (\( M \)). For instance, the correntropy loss function can be used instead of the square loss function to achieve a smaller upper bound.

The variance-covariance matrix of the OLS estimator#

Consider the linear regression model: $\( \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \)$ where:

  • \(\mathbf{y}\) is an \(n \times 1\) vector of observations,

  • \(\mathbf{X}\) is an \(n \times p\) matrix of predictors (design matrix),

  • \(\boldsymbol{\beta}\) is a \(p \times 1\) vector of unknown parameters,

  • \(\boldsymbol{\epsilon}\) is an \(n \times 1\) vector of errors.

Assume \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})\), meaning the errors are independently and identically distributed normal variables with mean 0 and variance \(\sigma^2\).

OLS Estimator#

The OLS estimator \(\hat{\boldsymbol{\beta}}\) is given by: $\( \hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \)$

Variance of \(\hat{\boldsymbol{\beta}}\)#

To derive the variance of \(\hat{\boldsymbol{\beta}}\), follow these steps:

  1. Express \(\hat{\boldsymbol{\beta}}\) in terms of \(\boldsymbol{\epsilon}\):

    Since \(\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), $\( \hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T (\mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}) \)\( \)\( \hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{X} \boldsymbol{\beta} + (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \boldsymbol{\epsilon} \)\( \)\( \hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \boldsymbol{\epsilon} \)$

  2. Calculate the variance:

    The variance of \(\hat{\boldsymbol{\beta}}\) is: $\( \text{Var}(\hat{\boldsymbol{\beta}}) = \text{Var} \left( \boldsymbol{\beta} + (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \boldsymbol{\epsilon} \right) \)$

    Since \(\boldsymbol{\beta}\) is a constant vector, the variance is: $\( \text{Var}(\hat{\boldsymbol{\beta}}) = \text{Var} \left( (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \boldsymbol{\epsilon} \right) \)$

    Let \(\mathbf{A} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T\). Then: $\( \text{Var}(\hat{\boldsymbol{\beta}}) = \text{Var}(\mathbf{A} \boldsymbol{\epsilon}) \)$

  3. Use the properties of the variance-covariance matrix:

    If \(\mathbf{A}\) is a matrix and \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})\), then: $\( \text{Var}(\mathbf{A} \boldsymbol{\epsilon}) = \mathbf{A} \, \text{Var}(\boldsymbol{\epsilon}) \, \mathbf{A}^T \)\( \)\( \text{Var}(\mathbf{A} \boldsymbol{\epsilon}) = \mathbf{A} \, (\sigma^2 \mathbf{I}) \, \mathbf{A}^T \)\( \)\( \text{Var}(\mathbf{A} \boldsymbol{\epsilon}) = \sigma^2 \mathbf{A} \mathbf{A}^T \)$

    Substitute \(\mathbf{A} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T\): $\( \text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{X} ((\mathbf{X}^T \mathbf{X})^{-1})^T \)\( Since \)(\mathbf{X}^T \mathbf{X})^{-1}\( is symmetric: \)\( \text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \)\( \)\( \text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1} \)$

Therefore, the variance-covariance matrix of the OLS estimator \(\hat{\boldsymbol{\beta}}\) is: $\( \text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1} \)$

This result shows that the variance of the OLS estimator depends on the variance of the error terms (\(\sigma^2\)) and the design matrix (\(\mathbf{X}\)).