Computing Power to the People

The Official Qarnot Blog

< Back

Spark with SSH on Qarnot – documentation


by Mehdi Oumnih - October 28, 2021 - Data science

Introduction

Spark is a fast and general engine for large-scale data processing and computing on a distributed cluster. Apache Spark provides a simple standalone deploy mode that uses its own resource manager and allows the creation of a distributed master-slave architecture.

Here is a quick step by step guide on how setup a Spark Standalone Cluster on Qarnot, connect to it via SSH tunneling and submit a spark application that counts the number of words in the Iliad.

Version

Software Release year Version
Hadoop 2021 3.2
Spark 2021 3.1.1

If you are interested in another version, please send us an email at qlab@qarnot.com.

Prerequisites

Before starting a calculation with the Python SDK, a few steps are required:

  • Retrieve the authentication token (here)
  • Install Qarnot’s Python SDK (here)

Note: in addition to the Python SDK, Qarnot provides C# and Node.js SDKs and a Command Line.

Test case

This tutorial will showcase how to start a Qarnot Spark cluster from your computer by following these simple steps:

  • Start the cluster on Qarnot
  • Connect to it via SSH tunneling
  • Submit your app via the command line interface (CLI)

Before moving forward you should setup your working environment to contain the following files that can be downloaded from here :

  • spark-ssh-input
    • iliad.txt: text file containing the Iliad to be counted on Qarnot.
    • word_count.py: script for counting the number of words in iliad.txt
  • ssh-spark.py : script for starting the cluster on Qarnot, see below

Launching the test case

Once everything is set up, use the following script in your terminal to start the cluster on Qarnot.

Be sure to copy your authentication token, your public ssh key (instead of <<<MY_SECRET_TOKEN>>> and <<<PUBLIC_SSH_KEY>>>) in the script ssh-spark.py. By default, your public SSH key can be found in ~/.ssh/<<<ssh_key>>>.pub.

  • Be sure to copy your authentication token instead of <<<MY_SECRET_TOKEN>>> in line 10.
  • Copy your public ssh key in the script in place of <<<PUBLIC_SSH_KEY>>> in line 26.
  • Lastly, you can change <<<PORT>>> in line 66, to the port you want to use for SSH tunneling with Qarnot.

ssh-spark.py

To launch this script, simply copy the above code in a Python script and execute python3 ssh-spark.py in your terminal. 

By default, it will connect you to Qarnot via ssh in a gnome-terminal. If you do not have this terminal app installed or wish to use another one you can run python3 run.py --terminal=<<<unix-terminal-app>>>. Additionally, if you want to disable this feature and only print out the command that you can run in your terminal on your own, you can set --terminal=off.

Once a new terminal spawns on your end, it means that the ssh connection with the cluster is secured. All you have to do is run the following commands in your ssh terminal:

If you see the terminal pop up and quickly disappear, it most likely means that the port you chose is currently busy and the connection cannot be established. You can change which port to use and try again.

It is possible to access to the Spark UI forwarded dashboard via SSH tunneling by typing localhost:<<<PORT>>> in your browser. You can monitor your cluster’s status and the different apps you have submitted

Results

At any given time, you can monitor the status of your task on the Console.

The output of the count is written in a text file output_iliad.txt that can be found in your output bucket spark-ssh-out, along with the master and worker logs, all of which can be viewed in your output bucket.

Spark master log

Once you are done with the task, just type exit in the ssh terminal to close the tunneling and make sure to abort the task from your Console. If you do not abort the task manually, it will continue running and use your credits.

Wrapping up

That’s it! If you have any questions, please contact qlab@qarnot.com and we will help you with pleasure!

 

 

comments

Leave a Reply

Your email address will not be published. Required fields are marked *