Spark is a fast and general engine for large-scale data processing and computing on a distributed cluster. Apache Spark provides a simple standalone deploy mode that uses its own resource manager and allows the creation of a distributed master-slave architecture.
Here is a quick step by step guide on how setup a Spark Standalone Cluster on Qarnot, connect to it via SSH tunneling and submit a spark application that counts the number of words in the Iliad.
If you are interested in another version, please send us an email at email@example.com.
Before starting a calculation with the Python SDK, a few steps are required:
This tutorial will showcase how to start a Qarnot Spark cluster from your computer by following these simple steps:
- Start the cluster on Qarnot
- Connect to it via SSH tunneling
- Submit your app via the command line interface (CLI)
Before moving forward you should setup your working environment to contain the following files that can be downloaded from here :
iliad.txt: text file containing the Iliad to be counted on Qarnot.
word_count.py: script for counting the number of words in
ssh-spark.py: script for starting the cluster on Qarnot, see below
Launching the test case
Once everything is set up, use the following script in your terminal to start the cluster on Qarnot.
Be sure to copy your authentication token, your public ssh key (instead of
<<<PUBLIC_SSH_KEY>>>) in the script
ssh-spark.py. By default, your public SSH key can be found in
- Be sure to copy your authentication token instead of
<<<MY_SECRET_TOKEN>>>in line 10.
- Copy your public ssh key in the script in place of
<<<PUBLIC_SSH_KEY>>>in line 26.
- Lastly, you can change
<<<PORT>>>in line 66, to the port you want to use for SSH tunneling with Qarnot.
To launch this script, simply copy the above code in a Python script and execute
python3 ssh-spark.py in your terminal.
By default, it will connect you to Qarnot via ssh in a
gnome-terminal. If you do not have this terminal app installed or wish to use another one you can run
python3 run.py --terminal=<<<unix-terminal-app>>>. Additionally, if you want to disable this feature and only print out the command that you can run in your terminal on your own, you can set
Once a new terminal spawns on your end, it means that the ssh connection with the cluster is secured. All you have to do is run the following commands in your ssh terminal:
If you see the terminal pop up and quickly disappear, it most likely means that the port you chose is currently busy and the connection cannot be established. You can change which port to use and try again.
It is possible to access to the Spark UI forwarded dashboard via SSH tunneling by typing
localhost:<<<PORT>>> in your browser. You can monitor your cluster’s status and the different apps you have submitted
At any given time, you can monitor the status of your task on the Console.
The output of the count is written in a text file
output_iliad.txt that can be found in your output bucket
spark-ssh-out, along with the master and worker logs, all of which can be viewed in your output bucket.
Spark master log
Once you are done with the task, just type
exit in the ssh terminal to close the tunneling and make sure to abort the task from your Console. If you do not abort the task manually, it will continue running and use your credits.
That’s it! If you have any questions, please contact firstname.lastname@example.org and we will help you with pleasure!