This blog post introduces the concept of sequence alignment, the BLAST algorithm, and an example of how to use it on Qarnot.
DeoxyriboNucleic Acid, also known as DNA, is the basic information of any living organism. This information is also defined as the genetic code, and works as an orchestrator for the other system levels, from the proteins to cells, tissues and organs. This is why the DNA, RNA or protein sequences analysis became a key challenge in biology, and a first evident problem held in comparing huge amounts of sequencing data.
A sequence alignment is a bioinformatics method allowing to rearrange and compare two sequences, mostly of the same kind (DNA, RNA or protein). In common cases, we have two datasets in input, containing both one or more sequences. The first dataset contains the query, which means the sequence(s) we need to analyse. The second one is called the reference, or database, which is the set of sequences that get compared with the query.
The final output is globally a human-interpretable text file , showing the mismatches of gaps between the queries and the reference sequence(s). A score is attached to each alignment result, based on the similarity and sequence complexity.
The DNA sequence alignment allows to interpret the results as point mutations, insertions or deletions, such as Single Nucleotide Polymorphism (SNP) or Single Nucleotide Variant (SNV). The alignment is used with High-Throughput Sequencing (HTS) data, to match the query sequences with a known sequence, or de novo. In this way, the RNA sequence alignment can also be used to quantify the genes expression. Finally, the protein sequences alignment allows to visualize the conserved regions and motifs, giving a functional point of view from the most representative amino acids. Another representation of this kind of alignment is a sequence logo (see example below).
Basic Local Alignment Search Tool (BLAST) is initially an online web-based tool allowing to find regions of similarity between biological sequences. The program compares nucleotide sequences to sequence databases and computes statistical significance. Depending on the sequencing data type, there are different specific tools, but in this article, we focus on the usage of blastn (which means the alignment of nucleotide sequences).
BLAST on Qarnot
In this part we describe a simple example of using BLAST, and more particularly the tool blastn, on Qarnot, using the python SDK. We will align a list of query DNA sequences against another list of reference DNA sequences.
Before we start, you need to create a Qarnot account, we offer 15€ worth of computation on your subscription.
First, in a Qarnot_blastn_example folder, create a folder named blastn_resources and save inside the following data which contains two local sequences.Download the resources here.
Now, let’s use Qarnot Python SDK to launch the distributed calculation. Save the following script as run.py in your Qarnot_blastn_example folder. In this script, you need to enter your Qarnot Token linked to your account (you can find it in the API Token section in your Qarnot account) to use our platform.
In the Qarnot_blastn_example folder, follow these steps to set up a Python virtual environment. Then, you can run the Python script from your terminal by typing chmod +x run.py and then ./run.py.
To summarize, the workflow provided by this project allows to submit two consecutive tasks:
- – Build a database based on a reference sequence;
- – Align a sequence (query) from a fasta file to the previously built database.
The database is built and transferred temporarily with a Bucket, allowing to work in a stateless mode.
You can then see the tasks details on your own console or on the Qarnot console by clicking on your task. When it’s finished the alignment results (results.out) will be downloaded on your computer.
The exercise here is straightforward, and allows you to get the first steps of submitting a simple bioinformatics job on Qarnot through the python SDK. You could also take the advantage of directly using Jobs to manage the tasks dependencies instead of the python code introduced here. Don’t hesitate to read further the documentation, or to contact us to have some advice or accompaniment for your needs and projects.