Ensemble Learning#

Introduction to Ensemble Learning:

Ensemble learning is a powerful technique in machine learning where multiple models, often called “weak learners,” are combined to create a stronger, more accurate model. The main idea is that by aggregating the predictions of several models, we can achieve better performance than any single model could on its own.

Two popular and simple approaches to ensemble learning are Bagging and Boosting:

Bagging (Bootstrap Aggregating):

  • Bagging involves training multiple models independently on different random subsets of the training data. The final prediction is made by averaging the predictions of these models (for regression) or by voting (for classification). This method helps to reduce the variance of the model and prevent overfitting, particularly with high-variance models like decision trees.

Boosting:

  • Boosting, in contrast, trains models sequentially, with each new model focusing on correcting the errors made by the previous ones. The models are combined in such a way that the final ensemble model is a weighted sum of the individual models, where more weight is given to models that perform better. This process reduces both bias and variance, resulting in a strong overall model.

Boosting#

Suppose we have a weak classifier denoted as h1, which is not performing well. The goal of boosting is to correct the mistakes made by h1. To do this, we derive a new distribution D from D that emphasizes the instances where h1 makes errors. Using this new distribution D, we train a new classifier h2.

By appropriately combining h1 and h2, we aim to create a better classifier. If the results are still unsatisfactory, we update the distribution and train a third classifier h3. Combining h1, h2, and h3 should result in a more accurate classifier, which is the final output of the algorithm.

Follow the algorithm below:

Input: Sample distribution D;
Base learning algorithm L;
Number of learning rounds T.

Process:

  1. D1=D % Initialize the distribution.

  2. For t=1,,T:

    • ht=L(Dt) % Train a weak learner from distribution Dt.

    • ϵt=PxDt(ht(x)y) % Evaluate the error of ht.

    • Dt+1=Adjust Distribution(Dt,t)

  3. Output: H(x)=Combine\_Outputs({h1(x),,hT(x)})

Homework : Correct following code#

The base learner must be selected such as Bayesian.

import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the MNIST dataset from a local .npz file
mnist_data = np.load('..//mnist.npz')

# Extract the training and test sets
x_train = mnist_data['x_train']
y_train = mnist_data['y_train']
x_test = mnist_data['x_test']
y_test = mnist_data['y_test']

# Preprocess the data
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
x_train = x_train.reshape(-1, 784)
x_test = x_test.reshape(-1, 784)

# The dataset is already for digits 0 through 9, no need to filter
# The following are placeholders to ensure the training and test sets include all digits.

def base_learner(D_t):
    # Extract features and labels from the weighted distribution
    X = np.array([x for x, y, w in D_t])
    y = np.array([y for x, y, w in D_t])
    sample_weights = np.array([w for x, y, w in D_t])
    
    # Train a Naive Bayes classifier
    clf = GaussianNB()
    clf.fit(X, y, sample_weight=sample_weights)
    
    # Return a function that makes predictions using the trained model
    return lambda x: clf.predict([x])[0], clf

def adjust_distribution(D_t, h_t, epsilon_t):
    # Adjust the distribution based on the error rate epsilon_t
    new_D = []
    for x, y, w in D_t:
        if h_t(x) != y:
            w *= np.exp(epsilon_t)
        new_D.append((x, y, w))
    
    # Normalize weights
    total_weight = sum(w for _, _, w in new_D)
    new_D = [(x, y, w/total_weight) for x, y, w in new_D]
    
    return new_D

def combine_outputs(weak_learners, x):
    # Majority voting for the final classifier
    votes = np.zeros(10)  # For 10 classes
    for h in weak_learners:
        prediction = h(x)
        votes[prediction] += 1
    return np.argmax(votes)

def boosting(D, L, T):
    # Initialize distribution with equal weights
    D_t = [(x, y, 1/len(D)) for x, y in D]
    weak_learners = []
    base_learner_accuracies = []

    for t in range(T):
        # Train a weak learner on the current distribution
        h_t, clf = L(D_t)
        weak_learners.append(h_t)
        
        # Evaluate the error of the weak learner
        epsilon_t = sum(w for (x, y, w) in D_t if h_t(x) != y)
        
        # Calculate accuracy of the current weak learner
        predictions = clf.predict(x_train)
        accuracy = accuracy_score(y_train, predictions)
        base_learner_accuracies.append(accuracy)
        print(f"Weak Learner {t + 1} Accuracy: {accuracy:.4f}")

        # Adjust the distribution based on the error
        D_t = adjust_distribution(D_t, h_t, epsilon_t)
    
    # Combine outputs of all weak learners
    return lambda x: combine_outputs(weak_learners, x), base_learner_accuracies

# Prepare the training data for boosting
D_train = [(x, y) for x, y in zip(x_train, y_train)]

# Number of boosting rounds
T = 10

# Train the boosted classifier
H, base_learner_accuracies = boosting(D_train, base_learner, T)

# Evaluate the final model on the test set
y_pred = [H(x) for x in x_test]

# Calculate the accuracy of the final model
final_accuracy = accuracy_score(y_test, y_pred)
print("Final Model Accuracy on the test set:", final_accuracy)

# Compare accuracies
print("Base Learner Accuracies:", base_learner_accuracies)
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[1], line 6
      3 from sklearn.metrics import accuracy_score
      5 # Load the MNIST dataset from a local .npz file
----> 6 mnist_data = np.load('..//mnist.npz')
      8 # Extract the training and test sets
      9 x_train = mnist_data['x_train']

File H:\HadiSadoghiYazdi\PL\Lib\site-packages\numpy\lib\_npyio_impl.py:459, in load(file, mmap_mode, allow_pickle, fix_imports, encoding, max_header_size)
    457     own_fid = False
    458 else:
--> 459     fid = stack.enter_context(open(os.fspath(file), "rb"))
    460     own_fid = True
    462 # Code to distinguish from NumPy binary files and pickles.

FileNotFoundError: [Errno 2] No such file or directory: '..//mnist.npz'

The AdaBoost method#

The general boosting procedure described above requires two key operations: Adjust_Distribution and Combine_Outputs.

The cost function we aim to minimize for a classifier involves both the parameters of the classifier and the weights of each classifier used in the combination. The cost function is given by:

J(θ,α)=E[l(e)]+λG(θ,α)

where e is the error, l is the loss function, θ represents the parameters of the classifier, and α denotes the weights for combining the classifiers. G(θ,α) is used for sparsification.

For a two-class classifier, the error e can be represented as e=yg(x), where g(x) is the decision function. The cost function J(θ,α) is expressed as:

J(θ,α)=i=1nexp(yihm(xi))

Here, hm(xi) is the decision function at the m-th iteration. The decision function H(x) consists of kernel functions h(x) and weight combinations α. The decision function for the m-th iteration is given by:

H(m)(x)=j=1mαjhj(x)

then,

H(m)(x)=H(m1)(x)+αmhm(x)

This expression shows that H(m)(x) is incrementally updated in each boosting iteration.

The cost function can be further expanded as follows:

J(θ,α)=i=1nexp(yiH(m)(xi))

Substituting hm(xi) into the expression:

J(θ,α)=i=1nexp(yi(H(m1)(xi)+αmhm(xi)))

This can be rewritten as:

J(θ,α)=i=1nexp(yiH(m1)(xi))exp(yiαmhm(xi))

or:

J(θ,α)=i=1nwimexp(yiαmhm(xi))

where wim=exp(yih(m1)(xi)) represents the weight of the sample xi at iteration m.

Given:

  • For correctly classified samples: yihm(xi)=1

  • For misclassified samples: yihm(xi)=1

The cost function J(θ,α) can be split into contributions from correctly classified samples Ω1 and misclassified samples Ω2.

The general formula for J(θ,α) is:

J(θ,α)=i=1nwimexp(yiαmhm(xi))

For correctly classified samples Ω1:

In this case, yihm(xi)=1. Therefore:

exp(yiαmhm(xi))=exp(αm)

Thus, the contribution to J(θ,α) from correctly classified samples is:

iΩ1wimexp(αm)

For misclassified samples Ω2:

In this case, yihm(xi)=1. Therefore:

exp(yiαmhm(xi))=exp(αm)

Thus, the contribution to J(θ,α) from misclassified samples is:

iΩ2wimexp(αm)

Combining both contributions:

J(θ,α)=iΩ1wimexp(αm)+iΩ2wimexp(αm)

Weights of combination#

Given the definitions:

  • B=FC(m) (true classifications)

  • A=TC(m) (false classifications)

and using the error rate definition:

ϵm=FC(m)FC(m)+TC(m)

we can substitute these into the derivative calculation.

Rewrite A and B

Substitute A and B into the equation:

exp(2αm)=BA

which becomes:

exp(2αm)=FC(m)TC(m)

Express ϵm in Terms of FC(m) and TC(m)

From the definition of ϵm:

ϵm=FC(m)FC(m)+TC(m)

Rearrange to express FC(m) in terms of ϵm and TC(m):

FC(m)=ϵm(FC(m)+TC(m))

Solve for FC(m):

FC(m)=ϵm(FC(m)+TC(m))
FC(m)=ϵm11ϵmTC(m)

So:

FC(m)TC(m)=ϵm1ϵm

Substitute into the Exponential Term

Now substitute this into the exponential term:

exp(2αm)=FC(m)TC(m)
exp(2αm)=ϵm1ϵm

Solve for αm

Taking the natural logarithm of both sides:

2αm=ln(ϵm1ϵm)

So:

αm=12ln(ϵm1ϵm)

Alpha Adaboost Curve

lexp(Hm1(x)+hm(x)D)=ExD[exp(y(Hm1(x)+hm(x)))]
=ExD[eyHm1(x)eyhm(x)]

Using Taylor expansion of eyhm(x) the exponential loss is approximated by:

exp(Hm1+hmD)ExD[eyHm1(x)(1yhm(x)+y22hm(x)2)]
By noticing that y2=1 and hm(x)2=1, the ideal classifier hm is

desired classifier hm(x)

hm(x)=argminh{exp(Hm1+hD)}
=argminh ExD[eyHm1(x)(1yh(x))+12]=argmaxh ExD[eyHm1(x)yh(x)]=argmaxh ExD[eyHm1(x)yh(x)]

by noticing that $ ExD[eyHm1(x)]$ is a constant.

=argmaxh ExD[eyHm1(x) ExD[eyHm1(x)]yh(x)]

Denote a distribution $Dm(x)= D(x)eyHm1(x) ExD[eyHm1(x)]$

hm(x)=argmaxh ExDm[yh(x)]

In the expression yhm(x)=12I(yhm(x)), where y is the true label and hm(x) is the predicted label, the term I(yhm(x)) is an indicator function that equals 1 if the true label y does not match the predicted label hm(x), and 0 if they are the same.

12I(yhm(x)):

  • If y=hm(x): I(yhm(x))=0, so the expression becomes 12×0=1.

  • If yhm(x): I(yhm(x))=1, so the expression becomes 12×1=1.

Therefore:

  • When the prediction hm(x) is correct, the expression yhm(x) equals 1.

  • When the prediction hm(x) is incorrect, the expression yhm(x) equals -1.

The ideal classifier is:

hm(x)=argminh ExDm[I(yh(x))]

As can be seen, the ideal hm minimizes the classification error under the distribution Dm. Therefore, the weak learner is to be trained under Dm, and has less than 0.5 classification error according to Dm (For yhm(x)>0 that yhm(x)=12I(yhm(x))). Considering the relationship between Dm and Dm+1 we have:

Dm+1(x)=D(x)eyHm(x)ExDm[eyHm(x)]
=D(x)eyHm1(x)eyαmhm(x)ExDt[eyHm(x)]
=Dm(x)eyαthm(x)ExDm[eyHm1(x)]ExDm[eyHm(x)]
1Zm=ExDm[eyHm1(x)]ExDm[eyHm(x)]
Dm+1(x)=Dm(x)eyαthm(x)Zm

Here, 1Zm is a normalization factor. Therefore Dm(x)eyαthm(x) is then calculated with,

Zm=m(Dm(x)eyαthm(x))
Note:

If ϵm>0.5, then

αm=12ln(ϵm1ϵm)

Thus, αm<0. Consequently, there are a few possible scenarios:

  1. αm=0

  2. ϵm=1ϵm, which α is calculated as follow,

αm=12ln(ϵm1ϵm)

Adaboost Algorithm#

Input:

  • Dataset D={(x1,y1),(x2,y2),,(xm,ym)}

  • Base learning algorithm L

  • Number of learning rounds T

Process:

  1. Initialize the weight distribution D1(x)=1m.

  2. For t=1,,T:

    • Train a classifier ht using the base algorithm L on dataset D under the distribution Dt.

    • Calculate the error ϵt of ht, where ϵt=PxDt(ht(x)f(x)).

    • If ϵt>0.5, then αt=0 and continue,

    • Determine the weight αt=12ln(1ϵtϵt) of the classifier ht.

    • Update the distribution for the next round: $Dt+1(x)=Dt(x)exp(αtf(x)ht(x))Ztwhere Z_t isanormalizationfactorthatensures D_{t+1} $ is a valid distribution.

  3. End the loop.

Output:
The final hypothesis is given by H(x)=sign(t=1Tαtht(x)).

How its work Adaboost#

How its work Adaboost1

import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

# Generate synthetic 2D dataset
X, y = make_classification(n_samples=200, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, flip_y=0, random_state=42)

# Initialize base classifier
base_classifier = DecisionTreeClassifier(max_depth=1)

# Initialize AdaBoost classifier
adaboost = AdaBoostClassifier(base_classifier, n_estimators=5, algorithm='SAMME.R', random_state=42)

# Fit AdaBoost classifier to the data
adaboost.fit(X, y)

# Predict using the fitted model
y_pred = adaboost.predict(X)

# Plot the decision boundary
def plot_decision_boundary(X, y, model):
    h = .02  # Step size in the mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.coolwarm)
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k', marker='o', cmap=plt.cm.coolwarm)
    plt.title('AdaBoost Decision Boundary')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.show()

# Plot the decision boundary
plot_decision_boundary(X, y, adaboost)
e:\HadiSadoghiYazdi\.M_HomePage\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:527: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
../../_images/be59755be05a2ad77d511936aea989df7295cb9d0fd43a679571b62b85ea5117.png

Homework : Adaboost#

  • Play with diferent n_estimators

  • write adaboost without sklearn.ensemble

What we do with AdaBoost#

What we do with AdaBoost can be explained by first reviewing the following works we have mentioned:

  1. Bahraini, T., Hosseini, S. M., Ghasempour, M., & Sadoghi Yazdi, H. (2022). Density-oriented linear discriminant analysis. Expert Systems with Applications, 187, 115946.

In the study “Density-oriented linear discriminant analysis,” the authors tackle big data challenges by integrating AdaBoost with a novel base learner, DLDA (Density-Oriented Linear Discriminant Analysis). This combination enhances classification accuracy and efficiency, especially in high-dimensional and imbalanced datasets. The approach is scalable and suitable for various big data applications, offering an innovative solution to improve machine learning performance in complex environments. Future work includes optimizing DLDA and exploring integrations with other advanced techniques.

Miniproject: The LogitBoost algorithm#

Miniproject: The LPBoost algorithm#

Bagging#

Bagging, short for Bootstrap Aggregating, reduces errors by combining multiple independent base learners. It achieves this by generating different training subsets using bootstrap sampling, where each subset is created by sampling with replacement from the original dataset. Multiple base learners are trained on these subsets, and their outputs are aggregated via voting for classification or averaging for regression. Bagging is effective for both binary and multi-class classification problems.

In the context of bagging, after combining T base classifiers, the final ensemble classifier H(x) is given by the following equation:

H(x)=sign(i=1Thi(x))

Bagging algorithm#

Input:

  • Dataset D={(x1,y1),(x2,y2),,(xm,ym)}

  • Base learning algorithm L

  • Number of base learners T

Process:

  1. For t=1,,T:

    • Train the base learner: ht=L(D,Dbs)

    • where Dbs is the bootstrap distribution (sampling with replacement)

  2. End for

Output: $H(x)=argmaxyYt=1TI(ht(x)=y)$

Homework: Random forest#

Homework: Adaboost for Regression#