Computing Power to the People

The Official Qarnot Blog

< Back

Distributed machine learning model for handwritten digits recognition


by Yanik - January 27, 2017 - Data science
CC-BY / Jack73

In this article, we describe how to build a machine learning model for the automatic recognition of handwritten digits.  The machine learning model is generated from a distributed python script, executed by the Qarnot HPC service. The basis of handwritten digits we intend to recognize is the one of the MNIST database.

The problem

We assume a dataset (Train_Set) of images, each corresponding to a handwritten digit. The correspondences are encoded in a vector (labels) that given the index of an image  returns a digit from 0 to 9. The goal is to induce a model that would perfectly predict the digits on another set of images (Test_Set).

For Train_set and Test_set, we will consider the collection of handwritten digits of the MNIST database. This collection includes more than 70,000 sample images. The images are greyscaled and normalized to fit into 28 *28 pixels.  In the database, each digit is associated with several handwritten images. A short view of this diversity is illustrated in the image below.

A KNN-based solution

Input

To build a machine learning model, we need a numerical representation (an encoding) of the data from which we want to induce a model. In the case of MNIST, such a representation is already provided. Indeed, each image in the database is represented as a matrix of 28*28 integers, capturing the darkness level of each pixel. This level goes from 0 (white) to 255 (black). Below for instance, we show how a handwritten image corresponding to the digit 1 is encoded (normalized to 1.0).

The scikit-learn library implements built-in function for loading data from the MNIST database. Below, we show a script for loading the data from MNIST and creating Train_Set and Test_Set.

You can have a look at some values of X and y array. Each element of X array is a vector of 28x28=784 values between 0 and 255.

The MNIST database is classified as follows:

  • X[0] to X[5922] are 0’s
  • X[5923] to X[12664] are 1’s
  • X[5924] to X[18622] are 2’s

In the following scripts, we’ll only focus on 0’s, 1’s and 2’s within the X[:L], y[:L] subset.

Solving the problem with KNN

The k-nearest neighbors (KNN) method is one of the most intuitive method for building machine learning models.

The idea is the following. Let’s assume that we have a basis of images X_train, associated with the corresponding digits y_train. Then, if we want to know to which digit a new input x corresponds, we simply compute the similarity between x and all the other images in X_train. To define similarity, the simplest way is to use a basic distance measure. By default, the classifier will use the euclidean distance between vectors. For instance, if $x$ and $x’$ are two 784-dimensional vectors, their mutual euclidean distance is defined by:

$$D(x,x’) = \sqrt{\sum_{i=0}^{783}|x_i-x’_i|^2}$$

The K most similar images, i.e. those with the smallest distance with the candidate image, are then selected. We then pick the corresponding digits in y_train and return the most frequent.

With scikit-learn, we can simply implement this machine learning scheme with the following code.

The scores measure the average number of good predictions of the model. The higher it is, the better is the model.  In this example, we only considered the first 18622 images of the MNIST database. Otherwise the process  would have been more time consuming. By running the python script, we get:

Training set score: 0.994200
Test set score: 0.993986

The obtained score is high. But, one should notice that we did not considered the entire database. Moreover, better scores are possible. Indeed, we ran the KNN method with K=2 and in assuming that all points in each neighborhood are equally  weighted  (weight=’uniform’).  In practice, however, it might be more interesting to consider more neighbors (a greater value for K) or other definitions of the the most frequent digit.

To include these possibilities, we propose to modify the above script as follows.

The new mnist_knn2.py script proposes four possible combinations of parameters for the KNN model. The first two combinations generate a 2-NN models and the two others a 3-NN model.

smith@wesson python mnist_knn2.py 1
(n_neighbors: 2 , weigths: distance)
Training set score: 1.000000
Test set score: 0.999785

One can notice that by changing the weight policy from ‘uniform’ to ‘distance’, we get a better prediction score.  Thus, for obtaining the best KNN model, one should try all possible combinations of parameters. But, this could be time consuming, in particular if we do not restrict L to 18622.  A solution to reduce this time is to use the Qarnot HPC service for building the models in parallel.

KNN with the Qarnot HPC service

With the Qarnot SDK, we can launch build concurrently the KNN models of our problem. For this, we use the script

In this script, we first create a task with 4 frames

task = conn.create_task("mnist-digitRecognition", "docker-network", 4)

The tasks will be run with the dockerhub image whose name is  “buildo/docker-python2.7-scikit-learn”. This image includes python 2.7, scikit-learn and numpy. Then, each frame will process the command

python mnist_knn2.py ${FRAME_ID}

By the definition, the possible values for  ${FRAME_ID}  are 0, 1 , 2, 3. Thus, the frame number i will generate the KNN model whose parameters are defined in options[i]. In running this script, we obtain the results

The execution can also be monitored on Qarnot console: console.qarnot.com/tasks. Once finished, the result is available in the stdout tab:

For further information, the interested reader can have a look at the KNN parameters defined in [2] and can define a larger parallel exploration in redefining the options array. We also recommend to change the above script to use the 60 000 first images of the MNIST database for training and the remaining for testing.

Next steps!

With Qarnot HPC services, you can go well beyond what could be described in this article. You will probably come up with many more interesting use cases. So, go ahead! Try it now, and share your feedback with the community, so that we can keep improving on this product.

[1] The KNN algorithm wikipedia

[2] KNN classifier in Scikit-learn scikit-learn.org

[3] Similarity / distance measures for KNN youtube