Qarnot Technical Team
Engineers
HPC platform
Launch compute tasks in a few lines of code or a few clicks on Tasq, our HPC platform.

Sparkmagic on Qarnot Cloud - documentation

November 4, 2021 - HPC discovery, Documentation, Machine Learning / AI

Introduction

Sparkmagic lets the user connect to a remote Spark cluster from a local Jupyter Notebook and interact with it through Livy, a Spark REST server, with the help of magics, a set of commands for interactively running Spark code in multiple languages.

Here is a quick step by step guide to use Sparkmagic to interact with a spark cluster running on Qarnot from your local computer.

Versions

SoftwareRelease yearVersion

Hadoop

Spark

Sparkmagic

Livy

2021

2021

2021

2020

3.3.1

2.4.8

0.7.1

0.19.1

If you are interested in another version, please send us an email at qlab@qarnot.com.

Prerequisites

Before starting a calculation with the Python SDK, a few steps are required:

  • Retrieve the authentication token (here)
  • Install Qarnot’s Python SDK (here)

Note: in addition to the Python SDK, Qarnot provides C# and Node.js SDKs and a Command Line.

Test case

This tutorial will showcase how to count the number of words in the Iliad in a distributed way using Sparkmagic. The workflow is as follows:

  • Launch a spark cluster composed of 3 nodes, 1 master and 2 workers, on Qarnot
  • Connect to the cluster via a local Jupyter Notebook
  • Interact with the cluster locally using Sparkmagic

This can be visualized by the following figure.

Before moving forward, you should setup your working environment to contain the following files:

  • input
    • iliad.txt: text file containing the Iliad to be counted on Qarnot.
  • spark-magic.py : script for starting the cluster on Qarnot (see below).
  • wordcount.ipynb: jupyter notebook to connect to the Spark cluster.

Both input and wordcount.ipynb can be downloaded from the following link.

Launching the test case

Before moving on, make sure to install Sparkmagic by following these simple steps. Note that it is preferable to install it in a Python virtual environment.

1. Install the library

pip install sparkmagic

Note that the compilation of pykerberos, a Sparkmagic dependency can fail if you do not have the required library installed. If you encounter an error like : error: command 'x86_64-linux-gnu-gcc' failed with exit status 1, try running the following command sudo apt install krb5-multidev if you are using Ubuntu.

2. Make sure that ipywidgets is properly installed by running

jupyter nbextension enable --py --sys-prefix widgetsnbextension

3. Launching the task on Qarnot

Once everything is set up, use the following script to launch the cluster on Qarnot. To do so, copy the following code in a python script named spark-magic.py at the same level as input and wordcount.ipynb.

  • Be sure to copy your authentication token instead of <<<MY_SECRET_TOKEN>>> in line 10.
  • Copy your public ssh key in the script in place of <<<PUBLIC_SSH_KEY>>> in line 27.

spark-magic.py

To launch this script simply execute python3 spark-magic.py in your terminal. 

By default, it will connect you to Qarnot via ssh in a gnome-terminal, if you do not have this terminal app installed or wish to use another one you can run python3 spark-magic.py --terminal=<<<unix-terminal-app>>>. Additionally, if you want to disable this feature and only print out the command that you can run in your terminal on your own, you can set --terminal=off.

Once a new terminal spawns on your end it means that the ssh connection with the cluster is secured. You can then launch the provided notebook by running jupyter notebook wordcount.ipynb on your local terminal.

This notebook contains easy to follow steps to connect to the Spark cluster and complete this use case! The screenshot below shows what the notebook should look like once you have completed the use-case.

 

You also get access to the following forwarded dashboards by typing localhost:<port> in your browser.

  • 8088 the Hadoop Yarn UI
  • 8998 the Livy server UI

The Hadoop Yarn Dashboard

 

Results

At any given time, you can monitor the status of your task on  Tasq.

 

Once you are done with the task, just type exit in the ssh terminal to close the tunneling and make sure to abort the task from Tasq.. If you do not abort the task manually it will continue running and use your credits.

Wrapping up

That’s it! If you have any questions, please contact qlab@qarnot.com and we will help you with pleasure!

Share on networks