ML Algorithm Explained: Boosting

Boosting is a supervised machine learning algorithm, which utilizes an ensemble of multiple weak learning algorithms to predict values of given data points. It can be used for both classification and regression tasks. Boosting has many applications such as fraud detection, pricing analysis, search engines, medical diagnosis, and much more.

Advantages:

  • It is easy to implement and it can be tuned to improve fitting.
  • It has reduced bias compared to other algorithms.

Disadvantages:

  • It can require significant processing capability and time for training.
  • It is challenging to implement in real-time because of its higher complexity.
 

In this blog post, we provide an overview of the Boosting algorithm with an example of algorithm implementation with code.

Basic Concepts

Boosting algorithms select random samples within a dataset and sequentially implement n weak learning algorithms. The first model trained on a sample is called the base algorithm. Each sample is assigned an equal weight. The boosting algorithm evaluates the output of the base algorithm and adjusts the weight based on the model performance. The output of each model is used to train the subsequent one and so on. Each iteration improves the accuracy of predictions and corrects the errors of the previous model. The process continues until the accuracy reaches a certain threshold or n algorithms have been implemented

Weak and Strong Learners

Weak learners are algorithms with relatively low prediction accuracy while strong learners are algorithms with relatively high prediction accuracy. Boosting combines an ensemble of weak learners to produce a single strong learner.

Algorithm Implementation with Code

Importing Libraries

For this example, we use scikit-learn, a popular machine learning library in python. Scikit-learn is a useful tool for creating datasets, training and testing algorithms, and much more. You must import all necessary libraries beforehand. 

				
					import pandas as pd
import numpy as np

import matplotlib.pyplot as plt #plots
%matplotlib inline

from sklearn.datasets import load_iris #dataset
from sklearn.model_selection import train_test_split #Splitting the data into testing and training data sets

from sklearn.preprocessing import StandardScaler #Standardize variables

from sklearn.ensemble import GradientBoostingClassifier #Gradient Boosting Algorithm 

from sklearn.metrics import classification_report,confusion_matrix,accuracy_score #Predictions and evaluations
				
			

Dataset

To create the dataset, we use load_iris. The iris dataset is a popular dataset used in machine learning. It contains information about 50 observations on four different variables: Petal Length, Petal Width, Sepal Length, and Sepal Width.

				
					# load the iris dataset
iris = load_iris()

# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target
				
			

Test Train Split

In this section, we split the dataset into training data and testing data. The objective is to train the algorithm to predict a value of y based on its associated X values. This allows us to test the performance of the algorithm against testing data by comparing the predicted values with the actual value.   

				
					X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
				
			

Feature Scaling 
In this section, we use feature scaling, which is a preprocessing step that helps normalize the range of independent variables of varying magnitude. In the absence of feature scaling, algorithms weigh the influence of larger variables to be higher than smaller variables. 

				
					sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
				
			

The next step is to create a Boosting object, fit it to the training data and generate predictions based on the test data. There are three main types of Boosting algorithms: Adaptive, Gradient and Extreme Gradient. In this case, we select the Gradient Boosting Classifier, which is a combined sequential model that uses the output on one decision tree to train the subsequent one and so on. Typically, the n_estimators is set to 100, which indicates that the algorithm will use 100 decision trees.

				
					gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)

gb.fit(X_train, y_train)

pred = gb.predict(X_test)
				
			

Evaluation

Finally, we evaluate the predictions using a classification_report, which is used to measure the performance of a classification algorithm. 

				
					print(classification_report(y_test,pred))
print("Accuracy score: ", accuracy_score(y_test, pred))
				
			

The resulting accuracy value is 0.91. In business applications, you would use datasets generated from various business processes. However, the basic process of using the algorithm remains the same. 

Tags: Supervised, Ensemble, Meta-Algorithm, Non-Overfitting, Weak Classifier

AIPI3’s ML platform uses many innovative machine learning algorithms to create value for businesses. Our platform is driven by artificial intelligence & machine learning experts with extensive experience across a wide range of industries, specializations, and applications. 

Get in touch with AIPI3 to discover how we can assist you!