Qarnot Technical Team
Engineers
HPC platform
Launch compute tasks in a few lines of code or a few clicks on Tasq, our HPC platform.

AutoML with Binder on Qarnot Cloud - documentation

November 3, 2021 - HPC discovery, Documentation, Machine Learning / AI

Introduction

This a step-by-step guide on how to use AutoML, specifically Auto-sklearn, on Qarnot with minimal user intervention using a Binder Jupyter notebook as a graphical user interface. Binder is a free Jupyter notebook/lab hosting service that enables the user to share notebooks with other via a simple link.

We encourage you to read the standalone AutoML documentation to get a better understanding of how this software works.

Version

Release yearVersion
2021v0.12.5

If you are interested in another version, please send us an email at qlab@qarnot.com.

Prerequisites

Before starting a calculation with the Python SDK, a few steps are required:

  • Retrieve the authentication token (here)
  • Install Qarnot’s Python SDK (here)

Note: in addition to the Python SDK, Qarnot provides C# and Node.js SDKs and a Command Line.

Test Case

The data showcased in this tutorial is the Localization Data for Person Activity. It contains recordings of five people performing different activities. Each person wore four sensors (tags) while performing the same scenario five times. The problem consists of classifying the activity type, from 11 different types ( walking, falling; sitting, etc...), for each entry given the collected sensor data. You can download the data from this link.

Unlike the above linked AutoML tutorial, this is a multi-class classification problem, i.e. each data entry can have one of 11 different values for the activity type. As opposed to a binary classification where you have only two classes to predict (for example classifying images as dog or cat). This is a completely different Machine Learning problem using the same exact software.

Launching the test case

Once you have downloaded the data set, all you have to do is click on the following link to get access to the Jupyter notebook hosted on Binder.

  • Note that this could take a few minutes depending on if the notebook was recently launched or if there is a queue for launching Binder.
  • The page can occasionally not load properly (especially if it was not used for a certain amount of time). If that happens to you, you can just refresh the page or relaunch it a second time which should only take a few seconds.

Once the notebook is running, you should have this page loaded in your browser.

 

 

 

You can see there are a number of fields in the page, here is an overview of the most important ones for this task:

Basic Parameters

  • Password field for your secret Qarnot token
  • Button to upload the data you want to run the classification on (supports only .csv files for now). For this test case, make sure to upload the above linked data set, phpH4DHsK.csv.
  • Once you have uploaded your data file, a roll down menu target column will be available with the datasets column names.
    • As is mentioned in the notebook, only the first parameter (target column) has to be set by the user. The rest are optional and have default values. For this test case make sure it is set to Class in the roll down menu.
  • Optional task and bucket names
  • Number of nodes in the cluster lets you specify the number of processors that you want to use for your Auto-sklearn training task.
  • Two of the the parameters that a user might like to change are :
    • total training time: The total time (in minutes) allocated to Auto-sklearn for this training task. After which the training will stop and results are sent back.
    • per run training time: Auto-sklearn trains multiple models in parallel in the given time limit. This parameter governs the time limit for each individual model. It can be set to around 10% of the total time for longer training times (>60 minutes) and more for shorter training times (~33% for < 60 minutes). There is no rule for this and the user should experiment with different values.
    • For this use case, you can leave them to their default values.

Optional Parameters

  • As the name indicates these are optional parameters (a bit more technical) that already have default values and can be ignored completely. However feel free to experiment with different values if you like.
  • Number of cross validation folds dictates the number of times your data will be split and be used for training/validating your models. For example if it is set to 3, 2 thirds of your data will be used for training and 1 third will be used for validation (evaluating model hyper-parameter performance).
  • Maximum ensemble size indicates the maximum number of models that can compose your final ensemble.
  • Ensemble nbest is used for setting the number of models to keep from that ensemble. For example keep only the top 10 models.
  • Note about the estimators and pre-processors to include/exclude:
    • The last four fields are multiple choice selection, i.e. You can select multiple entries using shift and/or ctrl.
    • According to the Auto-sklearn API documentation, the include and exclude parameters are incompatible with each other. Meaning that only one should be set. For example, you cannot include the Adaboost estimator and exclude other estimators as they are already excluded by setting the first include parameter.

Once all the parameters have been set, you can launch the task on Qarnot by simply clicking on the button Start Training on Qarnot!.

Results

You will get a live progress of the different states of the task. Once Training is complete you can click on the Display outputs button to have a look at the graphs generated by the training (a confusion matrix and a plot of accuracy over time). It will look something like this.

 

If you wish, you can generate a link to download a zip file containing all the outputs of your task. Mainly the graphs you see above, the trained model, and various logs with detailed performance metrics.

It is also possible to view these results from your bucket explorer in Tasq by selecting the automl-binder-out.

Confusion Matrix

 

Accuracy over time graph

 

Wrapping up

That’s it! If you have any questions please contact qlab@qarnot.com and we will help you with pleasure!

If you are curious and would like to learn more about this particular use case and others, you can check out our blog article How to run AutoML on a cluster to predict electricity prices ? which goes into much more detail.

 

 

 

Share on networks