Ensemble Learning

Instead of one technique, you use several, or absolutely different models.

Basing your answer on the majority vote, or in case of a regression, you take the avg. output.

Models that have ensemble learning embedded in them, like random forest (with multiple decision trees)

Advantages

Increased accuracy
Reduces overfitting

Flavors

About the data and not the model.

Bagging (Bootstrap Aggregating)

Generating for each model a subset from our original dataset via random sampling with replacement.

Each model trains on different data samples. Final prediction via majority vote (classification) or averaging (regression).

Besides already yielding different results, we can validate the quality of the training by using the data points that were not in the dataset (out-of-bag samples). Similar to cross-validation.

Example: Random Forest uses bagging with decision trees.

Boosting

Sequential training where each new model focuses on correcting errors of previous models.

Misclassified instances get higher weights, forcing subsequent models to pay more attention to difficult cases.

Popular algorithms:

AdaBoost (Adaptive Boosting)
Gradient Boosting
XGBoost

Key difference from bagging: Models trained sequentially vs. parallel; focuses on errors vs. random subsets.

Cross-validation

Splitting data into training, validation, test subsets to evaluate model performance.

K-fold CV: Data divided into K subsets. Model trained K times, each time using K-1 folds for training and 1 for validation. Average performance across folds gives final metric.

Purpose:

Prevents overfitting to training data
Better estimate of model generalization
Helps tune hyperparameters

Common split: 60% train, 20% validation, 20% test (or 70/15/15)

DV

Recent Notes

Gallery

3-2-1 backup strategy

Language, personal writing, LLMs

Explorer

Advantages

Flavors

Bagging (Bootstrap Aggregating)

Boosting

Cross-validation

Graph View

Table of Contents