Introduction
Spark is a fast and general engine for large-scale data processing and computing on a distributed cluster. Apache Spark provides a simple standalone deploy mode that uses its own resource manager and allows the creation of a distributed master-slave architecture.
Here is a quick step by step guide on how setup a Spark Standalone Cluster on Qarnot, connect to it via SSH tunneling and submit a spark application that counts the number of words in the Iliad.
Version
Software | Release year | Version |
---|---|---|
Hadoop Spark |
2021 2021 |
3.2 3.1.1 |
If you are interested in another version, please send us an email at qlab@qarnot.com.
Prerequisites
Before starting a calculation with the Python SDK, a few steps are required:
Note: in addition to the Python SDK, Qarnot provides C# and Node.js SDKs and a Command Line.
Test case
This tutorial will showcase how to start a Qarnot Spark cluster from your computer by following these simple steps:
- Start the cluster on Qarnot
- Connect to it via SSH tunneling
- Submit your app via the command line interface (CLI)
Before moving forward you should setup your working environment to contain the following files that can be downloaded from here :
input
:iliad.txt
: text file containing the Iliad to be counted on Qarnot.word_count.py
: script for counting the number of words iniliad.txt
ssh-spark.py
: script for starting the cluster on Qarnot, see below
Launching the test case
Once everything is set up, use the following script in your terminal to start the cluster on Qarnot.
Be sure to copy your authentication token, your public ssh key (instead of <<<MY_SECRET_TOKEN>>>
and <<<PUBLIC_SSH_KEY>>>
) in the script ssh-spark.py
. By default, your public SSH key can be found in ~/.ssh/<<<ssh_key>>>.pub
.
- Be sure to copy your authentication token instead of
<<<MY_SECRET_TOKEN>>>
in line 10. - Copy your public ssh key in the script in place of
<<<PUBLIC_SSH_KEY>>>
in line 26. - Lastly, you can change
<<<PORT>>>
in line 66, to the port you want to use for SSH tunneling with Qarnot.
ssh-spark.py
To launch this script, simply copy the above code in a Python script and execute python3 ssh-spark.py
in your terminal.
By default, it will connect you to Qarnot via ssh in a gnome-terminal
. If you do not have this terminal app installed or wish to use another one you can run python3 run.py --terminal=<<<unix-terminal-app>>>
. Additionally, if you want to disable this feature and only print out the command that you can run in your terminal on your own, you can set --terminal=off
.
Once a new terminal spawns on your end, it means that the ssh connection with the cluster is secured. All you have to do is run the following commands in your ssh terminal:
If you see the terminal pop up and quickly disappear, it most likely means that the port you chose is currently busy and the connection cannot be established. You can change which port to use and try again.
It is possible to access to the Spark UI forwarded dashboard via SSH tunneling by typing localhost:<<<PORT>>>
in your browser. You can monitor your cluster’s status and the different apps you have submitted
Results
At any given time, you can monitor the status of your task on Tasq.
The output of the count is written in a text file output_iliad.txt
that can be found in your output bucket spark-ssh-out
, along with the master and worker logs, all of which can be viewed in your output bucket.
Spark master log
Once you are done with the task, just type exit
in the ssh terminal to close the tunneling and make sure to abort the task from Tasq. If you do not abort the task manually, it will continue running and use your credits.
Wrapping up
That’s it! If you have any questions, please contact qlab@qarnot.com and we will help you with pleasure!
comments