Bayes Estimation#

Bayesian estimation is a statistical technique that applies Bayes’ rule. It combines prior knowledge (prior distribution) with data (likelihood) to form an updated knowledge (posterior distribution).

Bayes’ Rule#

\[ P(\theta | \mathbf{D}) = \frac{P(\mathbf{D} | \theta) P(\theta)}{P(\mathbf{D})} \]

where:

  • \( \theta \) is the parameter or hypothesis.

  • \( \mathbf{D} \) is the observed data (Data set).

  • \( P(\theta | \mathbf{D}) \) is the posterior probability, the probability of the parameter given the data.

  • \( P(\mathbf{D} | \theta) \) is the likelihood, the probability of the data given the parameter.

  • \( P(\theta) \) is the prior probability, the initial belief about the parameter before observing the data.

  • \( P(\mathbf{D}) \) is the total probability of the data.

Prior Distribution#

The prior distribution \( P(\theta) \) represents our beliefs about the parameter \( \theta \) before observing any data. The choice of the prior can be subjective or objective.

Source of Prior Knowledge The notion that prior knowledge must be received from a divine or transcendent source, such as God, touches on epistemological and metaphysical questions:

Epistemological and Metaphysical Perspective:#

From an epistemological and metaphysical standpoint, prior knowledge can be seen as originating from various sources, including

  • intuition

  • previous experience

  • expert opinion

  • theoretical consideration

Risk minimization in the Bayesian perspective#

Steps for Bayesian Risk Minimization

Determine the Posterior Distribution: Compute the posterior distribution \(p(\theta | X)\) using Bayes’ Rule.

\[ p(\theta | X) = \frac{p(X | \theta) p(\theta)}{p(X)} \]

Define the Loss Function: Choose an appropriate loss function \(L(\theta, \theta^{*})\) based on the problem context.

Compute the Expected Posterior Loss: Integrate the loss function over the posterior distribution to get the expected loss for each possible action.

\[ R(\theta^{*} | X) = \int L(\theta, \theta^{*}) p(\theta | X) \, d\theta \]

Minimize the Expected Loss: Select the action \(a^*\) that minimizes the expected posterior loss.

\[ \theta^{optimal} = \arg\min_\theta R(\theta | X) \]

Likelihood#

The likelihood function \( P(\mathbf{D} | \theta) \) describes how likely the observed data \( \mathbf{D} \) is for different values of the parameter \( \theta \). It is a function of \( \theta \) given the data \( \mathbf{D} \).

Posterior Distribution#

The posterior distribution \( P(\theta | \mathbf{D}) \) combines the prior distribution and the likelihood to give the updated belief about the parameter \( \theta \) after observing the data \( \mathbf{D} \).

Marginal Likelihood (Evidence)#

The marginal likelihood \( P(\mathbf{D}) \) is the normalizing constant ensuring that the posterior distribution is a proper probability distribution. It is computed as:

\[ P(\mathbf{D}) = \int P(\mathbf{D} | \theta) P(\theta) \, d\theta \]

Bayesian Estimator#

The Bayesian estimator is a point estimate derived from the posterior distribution. Common Bayesian estimators include:

  • Maximum A Posteriori (MAP) Estimate: The mode of the posterior distribution, which maximizes \( P(\theta | \mathbf{D}) \).

\[ \hat{\theta}_{MAP} = \arg\max_\theta P(\theta | \mathbf{D}) \]

Follow for proof it.

Zero-One Loss#

Sure, let’s substitute \( L(\theta, \theta^{*}) = 1 - \delta(\theta, \theta^{*}) \) into the given integral expression for \( R(\theta^{*} | X) \).

Given:

\[ R(\theta^{*} | X) = \int L(\theta, \theta^{*}) p(\theta | X) \, d\theta \]

Substitute \( L(\theta, \theta^{*}) = 1 - \delta(\theta, \theta^{*}) \):

\[ R(\theta^{*} | X) = \int (1 - \delta(\theta, \theta^{*})) p(\theta | X) \, d\theta \]

Now, let’s break this down into two separate integrals:

\[ R(\theta^{*} | X) = \int p(\theta | X) \, d\theta - \int \delta(\theta, \theta^{*}) p(\theta | X) \, d\theta \]

The first term is the integral of the probability density function \( p(\theta | X) \) over the entire domain of \( \theta \), which is equal to 1 (since it is a probability density function):

\[ \int p(\theta | X) \, d\theta = 1 \]

The second term involves the Kronecker delta function, which is 1 if \( \theta = \theta^{*} \) and 0 otherwise. Thus, the integral simplifies to evaluating \( p(\theta | X) \) at \( \theta = \theta^{*} \):

\[ \int \delta(\theta, \theta^{*}) p(\theta | X) \, d\theta = p(\theta^{*} | X) \]

Putting it all together:

\[ R(\theta^{*} | X) = 1 - p(\theta^{*} | X) \]

So the substituted expression is:

\[ R(\theta^{*} | X) = 1 - p(\theta^{*} | X) \]

Bayesian Risk Minimization

To find the value of \(\theta^*\) that minimizes \(R(\theta^* | X)\), we can set up the optimization problem as follows:

\[ \theta^* =\arg \min_{\theta} R(\theta | X) \]

Since we have:

\[ R(\theta^* | X) = 1 - p(\theta^* | X) \]

we want to minimize \(1 - p(\theta^* | X)\). Minimizing this expression is equivalent to maximizing \(p(\theta^* | X)\) because 1 is a constant and does not affect the optimization.

Therefore, we have:

\[ \arg \min_{\theta^*} (1 - p(\theta^* | X)) = \arg \max_{\theta^*} p(\theta^* | X) \]

So the value of \(\theta^*\) that minimizes \(R(\theta^* | X)\) is the same as the value of \(\theta^*\) that maximizes \(p(\theta^* | X)\):

\[ \theta^* = \arg \max_{\theta} p(\theta | X) \]

This is often referred to as the maximum a posteriori (MAP) estimate in Bayesian inference.

  • Posterior Mean: The expected value of the posterior distribution.

\[ \hat{\theta}_{mean} = \mathbb{E}[\theta | \mathbf{D}] = \int \theta P(\theta | \mathbf{D}) \, d\theta \]

Square Loss#

Sure, let’s substitute \( L(\theta, \theta^{*}) = |\theta-\theta^{*}|^2 \) into the given integral expression for \( R(\theta^{*} | X) \).

Given:

\[ R(\theta^{*} | X) = \int |\theta-\theta^{*}|^2 p(\theta | X) \, d\theta \]

To find the partial derivative of \( R(\theta^* | X) = \int |\theta - \theta^*|^2 p(\theta | X) \, d\theta \) with respect to \(\theta^*\), we can follow these steps:

Define the function \( R(\theta^* | X) \):

\[ R(\theta^* | X) = \int |\theta - \theta^*|^2 p(\theta | X) \, d\theta \]

Compute the partial derivative relative to \(\theta^*\): Let’s denote \( |\theta - \theta^*|^2 \) as the integrand function \( f(\theta, \theta^*) = |\theta - \theta^*|^2 \).

Use the Leibniz integral rule: The Leibniz rule for differentiation under the integral sign allows us to differentiate an integral with respect to a parameter:

\[ \frac{\partial}{\partial \theta^*} R(\theta^* | X) = \frac{\partial}{\partial \theta^*} \int |\theta - \theta^*|^2 p(\theta | X) \, d\theta \]

Apply the derivative inside the integral:

\[ \frac{\partial}{\partial \theta^*} R(\theta^* | X) = \int (-2\theta + 2\theta^*) p(\theta | X) \, d\theta \]

Simplify the expression:

\[ \frac{\partial}{\partial \theta^*} R(\theta^* | X) = 2 \int (\theta^* - \theta) p(\theta | X) \, d\theta \]

Thus, the partial derivative of \( R(\theta^* | X) \) with respect to \(\theta^*\) is:

\[ \frac{\partial}{\partial \theta^*} R(\theta^* | X) = 2 \int (\theta^* - \theta) p(\theta | X) \, d\theta=0 \]
\[ \hat{\theta}_{mean} = \mathbb{E}[\theta | \mathbf{X}] = \int \theta P(\theta | \mathbf{X}) \, d\theta \]

Correntropy Loss#

Let’s continue by substituting \(L(\theta, \theta^{*}) = 1 - \exp\left(\frac{-|\theta - \theta^*|^2}{2\sigma^2}\right)\) into the integral expression for \( R(\theta^* | X) \).

Given:

\[ R(\theta^* | X) = \int L(\theta, \theta^{*}) p(\theta | X) \, d\theta \]
\[ R(\theta^* | X) = \int \left(1 - \exp\left(\frac{-|\theta - \theta^*|^2}{2\sigma^2}\right)\right) p(\theta | X) \, d\theta \]

Now, let’s differentiate \( R(\theta^* | X) \) with respect to \( \theta^* \).

\[ \frac{\partial}{\partial \theta^*} R(\theta^* | X) = \frac{\partial}{\partial \theta^*} \int \left(1 - \exp\left(\frac{-|\theta - \theta^*|^2}{2\sigma^2}\right)\right) p(\theta | X) \, d\theta \]

Since \( 1 \) is a constant and does not depend on \( \theta^* \), we only need to differentiate the exponential term:

\[ \frac{\partial}{\partial \theta^*} R(\theta^* | X) = -\int \frac{\partial}{\partial \theta^*} \exp\left(\frac{-|\theta - \theta^*|^2}{2\sigma^2}\right) p(\theta | X) \, d\theta \]

Differentiate the exponential term:

To differentiate \(\exp\left(\frac{-|\theta - \theta^*|^2}{2\sigma^2}\right)\) with respect to \(\theta^*\):

\[ \frac{\partial}{\partial \theta^*} \exp\left(\frac{-|\theta - \theta^*|^2}{2\sigma^2}\right) = \exp\left(\frac{-|\theta - \theta^*|^2}{2\sigma^2}\right) \cdot \frac{\partial}{\partial \theta^*} \left(\frac{-|\theta - \theta^*|^2}{2\sigma^2}\right) \]

Differentiate the inner term:

\[ \frac{\partial}{\partial \theta^*} \left(\frac{-|\theta - \theta^*|^2}{2\sigma^2}\right) = \frac{-1}{2\sigma^2} \cdot 2(\theta^* - \theta) = \frac{-(\theta^* - \theta)}{\sigma^2} \]

Putting it all together:

\[ \frac{\partial}{\partial \theta^*} \exp\left(\frac{-|\theta - \theta^*|^2}{2\sigma^2}\right) = \exp\left(\frac{-|\theta - \theta^*|^2}{2\sigma^2}\right) \cdot \frac{-(\theta^* - \theta)}{\sigma^2} \]

Substitute back into the integral:

\[ \frac{\partial}{\partial \theta^*} R(\theta^* | X) = -\int \exp\left(\frac{-|\theta - \theta^*|^2}{2\sigma^2}\right) \cdot \frac{-(\theta^* - \theta)}{\sigma^2} p(\theta | X) \, d\theta \]

Simplify the expression:

\[ \frac{\partial}{\partial \theta^*} R(\theta^* | X) = \frac{1}{\sigma^2} \int (\theta^* - \theta) \exp\left(\frac{-|\theta - \theta^*|^2}{2\sigma^2}\right) p(\theta | X) \, d\theta \]

Thus, the partial derivative of \( R(\theta^* | X) \) with respect to \(\theta^*\) is:

\[ \frac{\partial}{\partial \theta^*} R(\theta^* | X) = \frac{1}{\sigma^2} \int (\theta^* - \theta) \exp\left(\frac{-|\theta - \theta^*|^2}{2\sigma^2}\right) p(\theta | X) \, d\theta \]

Let’s follow the given instructions to simplify and solve for \(\theta^*\).

Given:

\[ \exp\left(\frac{-|\theta - \theta^*|^2}{2\sigma^2}\right) = w(\theta, \theta^*) \]

We assume \(w(\theta, \theta^*)\) is fixed. Now, the partial derivative of \( R(\theta^* | X) \) with respect to \(\theta^*\) is:

\[ \frac{\partial}{\partial \theta^*} R(\theta^* | X) = \frac{1}{\sigma^2} \int (\theta^* - \theta) w(\theta, \theta^*) p(\theta | X) \, d\theta \]

Setting the derivative equal to zero to find the critical points:

\[ \frac{1}{\sigma^2} \int (\theta^* - \theta) w(\theta, \theta^*) p(\theta | X) \, d\theta = 0 \]

Since \(\frac{1}{\sigma^2}\) is a constant and \(\sigma^2 \neq 0\), we can ignore it in the equation:

\[ \int (\theta^* - \theta) w(\theta, \theta^*) p(\theta | X) \, d\theta = 0 \]

Now, factor out \(\theta^*\) from the integral:

\[ \theta^* \int w(\theta, \theta^*) p(\theta | X) \, d\theta - \int \theta w(\theta, \theta^*) p(\theta | X) \, d\theta = 0 \]

Thus,

\[ \theta^* = \frac{\int \theta w(\theta, \theta^*) p(\theta | X) \, d\theta}{\int w(\theta, \theta^*) p(\theta | X) \, d\theta} \]

Example: Bayesian Estimation for Gaussian Distribution#

Let’s consider a simple example where the observed data \( \mathbf{D} = \{x_1, x_2, \ldots, x_n\} \) is assumed to come from a Gaussian distribution with unknown mean \( \mu \) and known variance \( \sigma^2 \).

Prior Distribution#

Assume a normal prior for the mean \( \mu \):

\[ \mu \sim \mathcal{N}(\mu_0, \sigma_0^2) \]

Likelihood#

The likelihood of the independent observed data given \( \mu \) is:

\[ P(\mathbf{D} | \mu) = P(\mathbf{x_1,...,x_n} | \mu)= \]
\[ P(\mathbf{x_1} | \mu)*...*P(\mathbf{x_2} | \mu) \]
\[ \prod_{i=1}^n \mathcal{N}(x_i | \mu, \sigma^2) \]

Bayes’ Rule:

\[ P(\mu | \mathbf{D}) = \frac{P(\mathbf{D} | \mu) P(\mu)}{P(\mathbf{D})} \]

Prior Distribution \(P(\mu)\):

Assume the prior distribution of \(\mu\) is normal:

\[ \mu \sim \mathcal{N}(\mu_0, \sigma_0^2) \]

So,

\[ P(\mu) = \frac{1}{\sqrt{2\pi\sigma_0^2}} \exp\left(-\frac{(\mu - \mu_0)^2}{2\sigma_0^2}\right) \]

Likelihood \(P(\mathbf{D} | \mu)\):

Assume the data \(\mathbf{D} = \{x_1, x_2, \ldots, x_n\}\) are i.i.d. samples from a normal distribution with mean \(\mu\) and known variance \(\sigma^2\):

\[ x_i \sim \mathcal{N}(\mu, \sigma^ i) \]

follows \(\mathcal{N}(\mu, \sigma^2)\).

The likelihood function is:

\[ P(\mathbf{D} | \mu) = \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n \exp\left(-\sum_{i=1}^n \frac{(x_i - \mu)^2}{2\sigma^2}\right) \]

Simplify further using the sample mean \(\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i\):

Posterior Distribution \(P(\mu | \mathbf{D})\):

Combine the prior and the likelihood:

\[ P(\mu | \mathbf{D}) \propto P(\mathbf{D} | \mu) P(\mu) \]

HomeWork : Continue and obtain the following#

Resulting Posterior Distribution:

The resulting posterior distribution is a normal distribution with:

  • Mean: \(\mu_{\text{post}} = \frac{\sigma^2\mu_0 + n\sigma_0^2\bar{x}}{\sigma^2 + n\sigma_0^2}\)

  • Variance: \(\sigma^2_{\text{post}} = \frac{\sigma^2 \sigma_0^2}{\sigma^2 + n\sigma_0^2}\)

Therefore, the posterior distribution of \(\mu\) given the data \(\mathbf{D}\) is:

\[ \mu | \mathbf{D} \sim \mathcal{N}\left(\frac{\sigma^2 \mu_0 + n \sigma_0^2 \bar{x}}{\sigma^2 + n \sigma_0^2}, \frac{\sigma^2 \sigma_0^2}{\sigma^2 + n \sigma_0^2}\right) \]