Breakout room 2: BLAST reprise, and RNA sequencing pipeline

Overview

Teaching: 0 min
Exercises: 30 min

Questions

Objectives

Re-run containerised BLAST with a custom database location

Execute a RNA sequencing pipeline in a container

Goal 1: BLAST reprise

In this first exercise, you’re going to repeat some of the steps of the previous BLAST exercise, with a key difference: this time, the reference database will not be located in the current working directory, but in a custom, distinct path somewhere else in the directory tree. You will then need to explicitly bind mount that path to be able to BLAST the sequence about it.

Start with cding into the exercise directory, containing the sequence to be BLASTED:

cd /data/bio-intro-containers/exercises/blast_2
ls *.fasta

P04156.fasta

Note that the reference database is in a distinct path, /data/bio-intro-containers/exercises/database_blast:

ls ../database_blast

zebrafish.1.protein.faa.gz

Guided preparation

You can re-use the same BLAST image you already downloaded. To save time in typing (and reduce typos!), you can store the full path in a shell variable for later use:

export blast_image="/data/bio-intro-containers/exercises/blast_1/blast_2.9.0--pl526h3066fca_4.sif"

Check this all works:

ls $blast_image

blast_2.9.0--pl526h3066fca_4.sif

Now cd into the database directory and prepare the database once again:

cd ../database_blast
gunzip zebrafish.1.protein.faa.gz
singularity exec $blast_image makeblastdb -in zebrafish.1.protein.faa -dbtype prot

Building a new DB, current time: 11/16/2019 19:14:43
New DB name:   /data/bio-intro-containers/exercises/database_blast/zebrafish.1.protein.faa
New DB title:  zebrafish.1.protein.faa
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 52951 sequences in 1.34541 seconds.

And finally cd back to the directory with the sequence file:

cd ../blast_2

Now run the alignment

This is the alignment command:
blastp -query P04156.fasta -db ../database_blast/zebrafish.1.protein.faa -out results.txt
If you run it as you did in the previously exercise, you will get a file not found error:
Error
singularity exec $blast_image blastp -query P04156.fasta -db ../database_blast/zebrafish.1.protein.faa -out results.txt
BLAST Database error: No alias or index file found for protein database [../database_blast/zebrafish.1.protein.faa] in search path [/data/work/gitrepos/Trainings/bio-intro-containers/exercises/blast_2::]
Then, try and run the alignment command from the container, while also bind mounting the directory containing the database using appropriate Singularity syntax.
Solution 1
singularity exec -B ../database_blast $blast_image blastp -query P04156.fasta -db ../database_blast/zebrafish.1.protein.faa -out results.txt
Solution 2
export SINGULARITY_BINDPATH="../database_blast"
singularity exec $blast_image blastp -query P04156.fasta -db ../database_blast/zebrafish.1.protein.faa -out results.txt
More possible solutions

Note that the full path of the database directory is /data/bio-intro-containers/exercises/database_blast.
Therefore, mounting any of these would also work: /data, /data/bio-intro-containers, /data/bio-intro-containers/exercises.

The final results are stored in results.txt:

Goal 2: RNA sequencing pipeline

In this exercise, you’re going to port a small RNA sequencing pipeline to containers.
This is based on the RNAseq-NF repository by the Nextflow developers. Here is a graphical representation:

Pipeline DAG

The pipeline uses three tools, salmon, fastqc and multiqc.

By containerising it, you’re going to further apply some skills you have just learnt about:

download images;
execute commands in containers through;
bind mount additional host directories using one of:
- execution flag -B;
- environment variable SINGULARITY_BINDPATH.

To save you some potentially long download times, we have pre-cached for you the required container images required by this workshop, so that pull commands will only take a few seconds. To take advantage of this, it’s important that you use the suggested images specifications.

Now, start with changing directory to the one for this exercise:

cd /data/bio-intro-containers/exercises/pipeline/data

It contains input files and useful scripts:

ls

clean_outputs.sh ggal_gut_1.fq    ggal_gut_2.fq    images.sh        original_pipe.sh solutions

Use two terminal windows if you can

If you can, open at least two terminal windows, and connect to the VM from both of them. In this way, you can use one to edit files, and one to execute commands, thus making your workflow more efficient.

Preparation: have a look at the pipeline

Let’s have a look at the pipeline with cat:

cat original_pipe.sh

So here is the contents:

#!/bin/bash


echo "Pipeline started..."


# step 1
cd ../reference
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
salmon index -t ggal_1_48850000_49020000.Ggal71.500bpflank.fa -i out_index &>log_index
cd ../data
echo " indexing over"


# step 2
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
salmon quant --libType=U -i ../reference/out_index -1 ggal_gut_1.fq -2 ggal_gut_2.fq -o ggal_gut &>log_quant
echo " quantification over"


# step 3
mkdir -p out_fastqc
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
fastqc -o out_fastqc -f fastq -q ggal_gut_1.fq ggal_gut_2.fq &>log_fq
echo " quality control over"


# step 4
mkdir -p out_multiqc
cd out_multiqc
ln -s ../ggal_gut .
ln -s ../out_fastqc .
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
multiqc -v . &>../log_mq
cd ..
echo " multiple quality control over"


echo "Pipeline finished.  Outcomes not checked, up to you."

You can see there’s some printing here are there using echo, just to monitor the progress at runtime.
Some housekeeping is also being done, such as changing directories and creating output directories.

The key execution lines are just four though, and make use of three packages (let’s stick to these versions, they have been tested for this session, and proven to work):

salmon 1.2.1
fastqc 0.11.9
multiqc 1.9

Pull the container images

To save time, you’re not going to search for the images on the online registry.
Instead, your first task is to download the following container images.

IMPORTANT: use only these images, as they have been pre-cached in the virtual machine to speed up the pull processes.
quay.io/biocontainers/salmon:1.2.1--hf69c8f4_0
quay.io/biocontainers/fastqc:0.11.9--0
quay.io/biocontainers/multiqc:1.9--pyh9f0ad1d_0
Solution
singularity pull docker://quay.io/biocontainers/salmon:1.2.1--hf69c8f4_0
singularity pull docker://quay.io/biocontainers/fastqc:0.11.9--0
singularity pull docker://quay.io/biocontainers/multiqc:1.9--pyh9f0ad1d_0

SIF images in your current directory

We’re in /data/bio-intro-containers/exercises/pipeline/data.
Check the downloaded image files (.sif extension) are actually in the current directory, using ls.
Solution
ls *.sif
fastqc_0.11.9--0.sif          multiqc_1.9--pyh9f0ad1d_0.sif salmon_1.2.1--hf69c8f4_0.sif

Containerise the pipeline

Now it’s the time to edit the pipeline script and make it work with containers!

Make a copy of the original pipeline to work with:

cp original_pipe.sh pipe.1.sh

You will need to modify ONLY FOUR LINES of this file, namely the ones marked with the comment # ONLY CHANGE THE NEXT LINE - EXECUTION LINE. Do not change anything else, nor move files around in the directory structure.

Hints

both nano and vi text editors are available, use whichever you prefer;
add singularity syntax to execute those four commands using the appropriate containers you downloaded;
the current directory is mounted by default in the container;
if you need other directories, however… (to bind mount use the execution flag approach for now, not the environment variable);
if you want to go in stages, you can temporarily add an exit command in the script after the pipeline step you’re working on, to avoid errors at the upcoming steps (that are not yet containerised);
if you need to clean up the outputs in the work directories to perform further tests, you can use the provided script clean_outputs.sh.

Step 1 - salmon index

Only the relevant edited line is reported.
Backslashes \ and newlines are used for the sake of readability of the code.

Hint: look in the script for the current directory for this step; you’ll need to use an appropriate path for the container image.
Solution
singularity exec \
  ../data/salmon_1.2.1--hf69c8f4_0.sif \
  salmon index -t ggal_1_48850000_49020000.Ggal71.500bpflank.fa -i out_index &>log_index

Step 2 - salmon quant

Hint: look at the script. Where are the reference data compared to working directory of this step?
Solution
singularity exec \
  -B ../reference \
  ./salmon_1.2.1--hf69c8f4_0.sif \
  salmon quant --libType=U -i ../reference/out_index -1 ggal_gut_1.fq -2 ggal_gut_2.fq -o ggal_gut &>log_quant

Step 3 - fastqc

Solution

singularity exec \
  ./fastqc_0.11.9--0.sif \
  fastqc -o out_fastqc -f fastq -q ggal_gut_1.fq ggal_gut_2.fq &>log_fq

Step 4 - multiqc

Hint: look at the script. Where are the relevant input files/data compared to the current directory?
Solution
singularity exec \
  -B .. \
  ../multiqc_1.9--pyh9f0ad1d_0.sif \
  multiqc -v . &>../log_mq

Whenever you’re happy with your updated pipeline script, run it:

./pipe.1.sh

It should take under a minute. If it runs to completion, among others you should have the output file out_multiqc/multiqc_report.html , about 1.2 MB in size:

ls -lh out_multiqc/multiqc_report.html

-rw-r--r-- 1 ubuntu ubuntu 1.2M Jun  8 13:01 out_multiqc/multiqc_report.html

You’ve just reflected on how to containerise an entire pipeline, well done!

Entire script solution

Here is the entire solution (solutions/pipe.1.sh):

Solution

#!/bin/bash


echo "Pipeline started..."


# step 1
cd ../reference
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
singularity exec \
  ../data/salmon_1.2.1--hf69c8f4_0.sif \
  salmon index -t ggal_1_48850000_49020000.Ggal71.500bpflank.fa -i out_index &>log_index
cd ../data
echo " indexing over"


# step 2
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
singularity exec \
  -B ../reference \
  ./salmon_1.2.1--hf69c8f4_0.sif \
  salmon quant --libType=U -i ../reference/out_index -1 ggal_gut_1.fq -2 ggal_gut_2.fq -o ggal_gut &>log_quant
echo " quantification over"


# step 3
mkdir -p out_fastqc
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
singularity exec \
  ./fastqc_0.11.9--0.sif \
  fastqc -o out_fastqc -f fastq -q ggal_gut_1.fq ggal_gut_2.fq &>log_fq
echo " quality control over"


# step 4
mkdir -p out_multiqc
cd out_multiqc
ln -s ../ggal_gut .
ln -s ../out_fastqc .
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
singularity exec \
  -B .. \
  ../multiqc_1.9--pyh9f0ad1d_0.sif \
  multiqc -v . &>../log_mq
cd ..
echo " multiple quality control over"


echo "Pipeline finished.  Outcomes not checked, up to you."

Bonus: alternate script with containers

If you want, have a look at this alternate solution (solutions/pipe.2.sh).

It solves the same requests than above, with 2 differences that improve the clarity and readability of the script:

image paths are stored in variables at the beginning;
bind mounts are done at the beginning using the variable SINGULARITY_BINDPATH.

Solution

#!/bin/bash

salmon_image="/data/bio-intro-containers/exercises/pipeline/data/salmon_1.2.1--hf69c8f4_0.sif"
fastqc_image="/data/bio-intro-containers/exercises/pipeline/data/fastqc_0.11.9--0.sif"
multiqc_image="/data/bio-intro-containers/exercises/pipeline/data/multiqc_1.9--pyh9f0ad1d_0.sif"

export SINGULARITY_BINDPATH="/data"


echo "Pipeline started..."


# step 1
cd ../reference
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
singularity exec \
  $salmon_image \
  salmon index -t ggal_1_48850000_49020000.Ggal71.500bpflank.fa -i out_index &>log_index
cd ../data
echo " indexing over"


# step 2
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
singularity exec \
  $salmon_image \
  salmon quant --libType=U -i ../reference/out_index -1 ggal_gut_1.fq -2 ggal_gut_2.fq -o ggal_gut &>log_quant
echo " quantification over"


# step 3
mkdir -p out_fastqc
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
singularity exec \
  $fastqc_image \
  fastqc -o out_fastqc -f fastq -q ggal_gut_1.fq ggal_gut_2.fq &>log_fq
echo " quality control over"


# step 4
mkdir -p out_multiqc
cd out_multiqc
ln -s ../ggal_gut .
ln -s ../out_fastqc .
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
singularity exec \
  $multiqc_image \
  multiqc -v . &>../log_mq
cd ..
echo " multiple quality control over"


echo "Pipeline finished.  Outcomes not checked, up to you."

Key Points

Bind mount additional directories in the container with the flag -B, or the variable SINGULARITY_BINDPATH

Containerise an entire pipeline by making use of singularity exec <image> and few other markups

previous episode

Using Containers in Bioinformatics -- An Introduction

next episode

Breakout room 2: BLAST reprise, and RNA sequencing pipeline

Overview

Goal 1: BLAST reprise

Guided preparation

Now run the alignment

Error

Solution 1

Solution 2

More possible solutions

Goal 2: RNA sequencing pipeline

Use two terminal windows if you can

Preparation: have a look at the pipeline

Pull the container images

Solution

SIF images in your current directory

Solution

Containerise the pipeline

Hints

Step 1 - salmon index

Solution

Step 2 - salmon quant

Solution

Step 3 - fastqc

Solution

Step 4 - multiqc

Solution

Entire script solution

Solution

Bonus: alternate script with containers

Solution

Key Points

previous episode

next episode