Computing Power to the People

The Official Qarnot Blog

< Back

How to run AutoML on a cluster to predict electricity prices ?


by Rémi Bouzel - May 4, 2021 - Data science

Machine-learning algorithms use statistics to find patterns in massive amounts of data. It is the process that powers many of the services we use today—recommendation systems, search engines, social media feeds, voice assistants. The list goes on.

The classical machine learning workflow

Supposing there is a data set by which we want to obtain a predictive model. The traditional machine learning approach requires the following sequence of actions:

    1. Data pre-processing: It is the process of cleaning raw data. It includes dealing with missing values, duplicate data, processing certain types of categorical or string data, feature (also called variables or columns) scaling, and more.
    2. Feature extraction and engineering: Sometimes it is not best to use all of the features in a machine learning problem, especially when it comes to high dimensionality problems (large number of available features). It can be beneficial to identify the most important features and/or create new ones (feature engineering) that could have better predictive capabilities.
    3. Choosing the right learning model: In every machine learning problem, we have to identify the problem, what is it we are actually trying to do, i.e. predicting a continuous numerical variable (regression), predicting a categorical variable with two or multiple labels (binary or multi-class classification). This allows us to narrow the search of an adequate model for our problem.
    4. Optimization of hyperparameters: Almost all ML models have a number of parameters that have to be set by the user before training, they are called hyper-parameters. The performance and behavior of a model on a given problem can vary drastically by changing its hyper-parameters. When we have several of these and each of them can take different values, it is a challenging task to find the right combination of hyper-parameters.
    5. Training and evaluating with optimal parameters: Once we have the best performing set of hyper-parameters, we train our model and evaluate its performance on the test data. 

Automating this process is the focus of Automated Machine Learning (AutoML).

AutoML

Definition and Frameworks

The essence of AutoML is to automate the above-mentioned tasks (which can take a considerable amount of time) so that data scientists can spend more time on business problems on hand in practical scenarios. AutoML also allows everyone, instead of a small group of people, to use machine learning technology. 

In recent years, many excellent AutoML frameworks have emerged. Below is a brief description of the most popular ones.

  • TPOT: TPOT is a tree-based pipeline optimization tool that uses genetic algorithms to optimize machine learning pipelines. TPOT is built on top of scikit-learn and uses its own regressor and classifier methods.
  • MLBOX: ML Box is a Python-based library offering the features of pre-processing, model optimization, and prediction.
  • H2O: H2O is an open-source machine learning platform developed by H2O.ai. It supports the most widely used statistical and machine learning algorithms including gradient boosted machines, generalized linear models, deep learning, and more.
  • Auto-sklearn: This is the ML framework we will showcase in this article. We will go into more detail about this framework in the next section.

AutoML use cases

Companies can automate their machine learning processes for a variety of purposes. Mostly, companies want to have automated insights for better data-driven decisions and predictions. 

One example of a real-life use case of AutoML is fraud detection by Paypal. Fraud detection is the process of identifying and preventing unauthorized financial activity. This can include fraudulent credit card transactions, identify theft, cyber hacking, insurance scams, and more. Paypal used AutoML to improve its existing ML solution’s accuracy to 95% and reduced model training time to under 2 hours. 

Auto-sklearn

At its core, every effective AutoML service needs to solve the fundamental problems of deciding which machine learning algorithm to use on a given dataset, whether and how to preprocess its features, and how to set all hyperparameters given a time and memory budget. This is the problem that Auto-sklearn tries to address.

What is Auto-sklearn 

Auto-sklearn is an open-source automated machine learning software package built on scikit-learn. Auto-sklearn defines AutoML as a problem of finding the best machine learning model and its hyperparameters for a dataset among a vast search space, including plenty of classifiers and a lot of hyperparameters. In the figure below (from Auto-sklearn article), you can see a representation of the different components and general workflow of Auto-sklearn.

In general, we can see that the Auto-sklearn has three main components:

  • Meta-Learning
  • Bayesian optimization (BO)
  • Build ensemble

Meta-Learning

Auto-sklearn applies meta-learning to select instantiations of our given machine learning framework that are likely to perform well on a new dataset. More specifically, Auto-sklearn has been trained on a large number of datasets in order to find out which models perform best on which types of data. Once Auto-sklearn is presented with a new dataset, it computes a number of meta-features (general statistics about the data). These meta-features are used to match our new data to one of the many, above mentioned, datasets that resemble it the most. This way, we already know some models and ML frameworks that could potentially perform well on our new data. It serves as a “warm start” for the optimization process.

Bayesian optimization

Bayesian optimization is a powerful strategy for finding the extrema of objective functions that are expensive to evaluate. It is particularly useful when one does not have access to derivatives, or when the problem at hand is non-convex.

Problem definition

Let’s consider c(x) the cost function associated with the ML framework block, with x being the different configurations it can have. This is known as the objective function. c(x)  is a black box function for which we want to find the global minima, i.e., find the best performing configuration of hyperparameters for the ML framework. Let’s suppose that c(x) has the following “true” shape.

Surrogate function approximation

Bayesian optimization approaches this task through a method known as surrogate optimization. A surrogate function is an approximation of the objective function. It is formed based on sampled points. Samples are equivalent to different ML framework configurations x, and their corresponding objective function scores, c(x). Note that the surrogate function will be mathematically expressed in a way that is significantly cheaper to evaluate than the true objective function. 

Based on the surrogate function, we can identify which points are promising minima. We decide to sample more from these promising regions and update the surrogate function accordingly. At each iteration, we continue to look at the current surrogate function, learn more about areas of interest by sampling, and update the function.

After a certain number of iterations, we ought to find a global minimum. Note that this method does not make any assumptions about the objective function (except that it can be optimized), and doesn’t require any derivatives. So what makes it Bayesian, exactly?

Link with Bayesian statistics

The essence of Bayesian statistics and modeling is taking into account a prior (previous) belief in light of new information to produce an updated posterior (‘after’) belief. This statement is mathematically represented by the famous Bayes theorem.

$P(A|B) = \frac{P(B|A)}{P(A)P(B)}$

Where A and B are events and $P(B)\ne0$.

  • $P(A|B)$ is a conditional probability: the probability of event $A$ occurring given that $B$ is true. It is also called the posterior probability of $A$ given $B$.
  • $P(B|A)$ is also a conditional probability: the probability of event $B$ occurring given that $A$ is true.
  • $P(A)$ and $P(B)$ are the probabilities of observing $A$ and $B$ respectively without any given conditions; they are known as the prior probability.

This is exactly what surrogate optimization in this case does, so it can be best represented through Bayesian systems, formulas, and ideas. However, in order to apply the Bayes theorem, we have to represent the surrogate function as a probability distribution. This is done using a Gaussian Process.

Gaussian Process

A Gaussian process is a probability distribution over possible functions. It can be thought of as a dice roll that returns functions fitted to given data points instead of numbers 1 to 6. The process returns several functions, which have probabilities attached to them. This creates a probability distribution for our surrogate function.

For instance, we may define the current set of data points as being 40% representable by a function a(x), 10% by function b(x), etc. By representing the surrogate function as a probability distribution, it can be updated with new information. Perhaps when new information is introduced, the data is only 20% representable by function a(x). These changes are governed by Bayesian formulas (seen above).

What follows is a visual example of Bayesian inference with Gaussian processes to better understand it. Let’s say we have an unknown function we’re trying to estimate. Our prior belief about the unknown function is visualized below. On the right is the mean and standard deviation of our Gaussian process — it’s 0 since we don’t have any knowledge yet. On the left, each line is a sample from the distribution of functions and our lack of knowledge is reflected in the wide range of possible functions and diverse function shapes on display.

After having seen some evidence we can use Bayes’ rule to update our belief about the function to get the posterior Gaussian process, i.e. our updated belief about the function we’re trying to fit.

The updated Gaussian process is constrained to the possible functions that fit our data which results in a narrower distribution of functions.

Once additional samples and their evaluation via the objective function c(x) have been collected, they are added to the data and the posterior is then updated. This process is repeated until the global extrema of the objective function is located, a good enough result is located, or resources are exhausted.

Summary

Bayesian optimization is primarily used to optimize expensive black-box functions. It can be performed as such:

  1. We first choose a surrogate model for modeling the true function and define it as the prior.
  2. Given the set of observations (function evaluations), use the Bayes rule to obtain the posterior.
  3. Use the posterior surrogate function to choose the next sample points.
  4. Add newly sampled data to the set of observations and go to step 2 till convergence or budget elapses.

Build ensemble

Ensemble methods is a machine learning technique that combines several base models in order to produce one optimal predictive model. It is well known that ensembles often outperform individual models and that effective ensembles can be created from a library of models. They perform particularly well if the models they are based on are individually strong and make uncorrelated errors. Since this is much more likely when the individual models are different in nature, ensemble building is particularly well suited for combining strong instantiations of a flexible ML framework.

How to run Auto-sklearn on Qarnot

Use case

The data showcased in this article is the electricity data set. This data was collected from the Australian New South Wales Electricity Market where electricity prices are set every five minutes based on supply and demand.

The dataset contains 45,312 instances dated from 7 May 1996 to 5 December 1998. Each example of the dataset refers to a period of 30 minutes. Given this historical data, we have to predict whether the electricity prices will go up or down. This is called a binary classification problem and our class labels are UP and DOWN.

We want to build the best possible ensemble using Auto-sklearn in a given time frame. The best way to do so is to train multiple models in parallel and increase our chances of building a strong ensemble. A solution to this is using the Qarnot HPC service, which is well adapted for parallelizing Auto-sklearn’s computation across multiple nodes in a cluster.

Initial setup

The first step is to create a Qarnot account. We offer 15€ worth of computation on your subscription, which amply covers the running cost of this example.

Next, we need to create locally an input folder inside of which we put the necessary files for the computation:

  • electricity-normalized.csv: electricity data set (link to download .csv later).
  • run_autosklearn.py: Python script that loads the data and launches the computation.

 

What follows is the Qarnot run script and configuration file located at the same level as the input folder:

  • cluster_run.py : python script for running auto-sklearn on a Qarnot cluster of multiple nodes.
  • qarnot.conf : conf file containing the user’s API Token (you will find it in the “API” section of your Qarnot account). Can be quickly set-up by following these steps.

All that’s left to do is follow these steps to set up a python virtual environment and install the Qarnot Python SDK.

Qarnot Script

Once the environment is ready with all the necessary files, you can run python3 cluster_run.py from your terminal to launch the computation on Qarnot.

In the following example, we will run Auto-sklearn on a 3 node (1 master and 2 workers) cluster for a total training time of 15 minutes and 5 minutes time limit per model trained. 80% of the data will be used for training and validation while 20% will be kept as a test set. These parameters and others can be set in cluster_run.py.

Results

You can then view the details of the task on your own console or on the Qarnot console by clicking on your task. Once the task has finished running, several output files will be automatically downloaded to your computer. The figure below is one of them. It is a heatmap of the confusion matrix showing the final ensemble’s predictions across the two classes knowing their true labels.

After 15 minutes of training on 3 nodes, the final ensemble achieves a test accuracy of 94.49% (more information is available in log_autosklearn.log file). These are very good results considering the short execution time and relative ease of use.

Qarnot Benchmark

To showcase the benefits of parallelizing across multiple nodes and cores we propose the following benchmark done on the Qarnot platform. We ran Auto-sklearn for 25 mins on the same data using different configurations:

  • Local laptop, Intel i5-6300U Four-Core Processor
  • Qarnot AMD Ryzen 7 2700X Eight-Core Processor
    • 1 node using only 1 core
    • using all cores on 1, 3, 5, and 7 nodes respectively

Note that on a multiple node cluster, one node is the master which handles task scheduling and Bayesian optimization and the rest of the nodes are the workers that actually train the machine learning models.

We chose to base our benchmark on three metrics:

  • Number of attempted runs: number of ML models that auto-sklearn attempted to train
  • Number of successful runs: number of ML models that successfully converged given time and memory constraints
  • Test accuracy: prediction accuracy on the test data

We can see that the number of attempted and successful runs increases at a steady rate. So the more cores we have the more potential models we can train. In theory, increasing the number of attempted and successfully completed runs could lead to improving performances on any problem. The test accuracy also improves as we increase the number of nodes in the cluster. However, its growth rate slows down eventually. It means we are nearing the best possible performance for this specific problem with auto-sklearn.

Summary

In this article, we introduced the concept of AutoML as well as several of the most-used frameworks, specifically Auto-sklearn. We briefly went over how it works and how to perform an accurate prediction on electricity prices using a multi-node cluster on Qarnot.

To go further in the field of AutoML, one could look into AutoML systems for automating deep learning like AutoKeras or H2O. Technologies like these are quite promising because the task of fine-tuning a neural network (number of layers, depth of layers, optimizers, etc…) is no easy task and could benefit greatly from the AutoML philosophy. 

Craving more machine learning articles? You could also read one of our other articles, and learn how to train your own neural network on Qarnot for weather prediction. We hope you enjoyed this tutorial! Should you have any question(s) or if you wish to use our platform for heavier computations (we can provide state-of-the-art resources on demand), don’t hesitate to contact us.

 

written by Mehdi Oumnih