Sparkmagic lets the user connect to a remote Spark cluster from a local Jupyter Notebook and interact with it through Livy, a Spark REST server, with the help of magics, a set of commands for interactively running Spark code in multiple languages.
Here is a quick step by step guide to use Sparkmagic to interact with a spark cluster running on Qarnot from your local computer.
If you are interested in another version, please send us an email at email@example.com.
Before starting a calculation with the Python SDK, a few steps are required:
This tutorial will showcase how to count the number of words in the Iliad in a distributed way using Sparkmagic. The workflow is as follows:
- Launch a spark cluster composed of 3 nodes, 1 master and 2 workers, on Qarnot
- Connect to the cluster via a local Jupyter Notebook
- Interact with the cluster locally using Sparkmagic
This can be visualized by the following figure.
Before moving forward, you should setup your working environment to contain the following files:
iliad.txt: text file containing the Iliad to be counted on Qarnot.
spark-magic.py: script for starting the cluster on Qarnot (see below).
wordcount.ipynb: jupyter notebook to connect to the Spark cluster.
Both input and wordcount.ipynb can be downloaded from the following link.
Launching the test case
Before moving on, make sure to install Sparkmagic by following these simple steps. Note that it is preferable to install it in a Python virtual environment.
1. Install the library
pip install sparkmagic
Note that the compilation of
pykerberos, a Sparkmagic dependency can fail if you do not have the required library installed. If you encounter an error like :
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1, try running the following command
sudo apt install krb5-multidev if you are using Ubuntu.
2. Make sure that ipywidgets is properly installed by running
jupyter nbextension enable --py --sys-prefix widgetsnbextension
3. Launching the task on Qarnot
Once everything is set up, use the following script to launch the cluster on Qarnot. To do so, copy the following code in a python script named
spark-magic.py at the same level as
- Be sure to copy your authentication token instead of
<<<MY_SECRET_TOKEN>>>in line 10.
- Copy your public ssh key in the script in place of
<<<PUBLIC_SSH_KEY>>>in line 27.
To launch this script simply execute
python3 spark-magic.py in your terminal.
By default, it will connect you to Qarnot via ssh in a
gnome-terminal, if you do not have this terminal app installed or wish to use another one you can run
python3 spark-magic.py --terminal=<<<unix-terminal-app>>>. Additionally, if you want to disable this feature and only print out the command that you can run in your terminal on your own, you can set
Once a new terminal spawns on your end it means that the ssh connection with the cluster is secured. You can then launch the provided notebook by running
jupyter notebook wordcount.ipynb on your local terminal.
This notebook contains easy to follow steps to connect to the Spark cluster and complete this use case! The screenshot below shows what the notebook should look like once you have completed the use-case.
You also get access to the following forwarded dashboards by typing
localhost:<port> in your browser.
- 8088 the Hadoop Yarn UI
- 8998 the Livy server UI
The Hadoop Yarn Dashboard
At any given time, you can monitor the status of your task on the Console.
Once you are done with the task, just type
exit in the ssh terminal to close the tunneling and make sure to abort the task from your Console. If you do not abort the task manually it will continue running and use your credits.
That’s it! If you have any questions, please contact firstname.lastname@example.org and we will help you with pleasure!