Session 1: containerising a pipeline

Overview

Teaching: 10 min
Exercises: 50 min
Questions
Objectives

Get ready for the session

By now, you’ll be connected to the Nimbus VM you got assigned to.

First thing, we need to download the workshop materials from Github:

$ cd /data
$ git clone https://github.com/PawseySC/containers-bioinformatics-workshop.git
$ cd containers-bioinformatics-workshop
$ export WORK=$(pwd)

To save you some potentially long download times, we have cached for you most of the container images required by this workshop; this will greatly reduce waiting times when you execute the singularity pull commands in this workshop.

Use at least two terminal tabs

We advise you to open at least two terminal tabs, and connect to the VM from both of them. In this way, you can use one to edit files, and one to execute commands, thus making your workflow more efficient.

Goals

In this first session, we’re going to port a small RNA sequencing pipeline to containers. This is based on the RNAseq-NF repository by the Nextflow developers.
Here is a graphical representation:

Pipeline DAG

The pipeline uses three tools, salmon, fastqc and multiqc.

By containerising it, we’re going to apply some skills we learnt about during the webinars, namely:

If you need to brush up on these concepts, have a look at these webinar episodes:

Towards the end, we’re going to build on these concepts, and see how we can eventually simplify the user experience with containerised packages.

GUIDED - Get the container images ready

Let’s cd in the directory for this exercise:

$ cd exercises/pipeline/data

and have a look at the pipeline:

$ cat original_pipe.sh
#!/bin/bash

echo "Pipeline started..."

# step 1
cd ../reference
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
salmon index -t ggal_1_48850000_49020000.Ggal71.500bpflank.fa -i out_index &>log_index
cd ../data
echo " indexing completed"

# step 2
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
salmon quant --libType=U -i ../reference/out_index -1 ggal_gut_1.fq -2 ggal_gut_2.fq -o ggal_gut &>log_quant
echo " quantification completed"

# step 3
mkdir out_fastqc
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
fastqc -o out_fastqc -f fastq -q ggal_gut_1.fq ggal_gut_2.fq &>log_fq
echo " quality control completed"

# step 4
mkdir out_multiqc
cd out_multiqc
ln -s ../ggal_gut .
ln -s ../out_fastqc .
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
multiqc -v . &>../log_mq
cd ..
echo " multiple quality control completed"

echo "Pipeline finished!"

You can see there’s some printing here are there using echo, to monitor the progress at runtime.
Some housekeeping is also being done, such as changing directories and creating output directories.

The key execution lines are just four though, and make use of three packages (let’s stick to these versions, they have been tested for this session, and proven to work):

Our first task is to look for container images for this pipeline. For each package, we’re going to use the most recent image available in the registry. Let’s discuss in detail the case of Salmon.

Find a container image for Salmon

Today we’re using the web registry Quay, at https://quay.io, to search the images we need. This registry contains all the images provided by the BioContainers project, so there are good chances of finding what we need here. We could have gone directly at the BioContainers home page, https://biocontainers.pro, however its user interface is a bit less friendly right now.

Now let’s try and find the most recent container image for Salmon version 1.2.1 by BioContainers, using the Quay web site.

Solution

  • Type salmon in the Search field on the top right of the page; in alternative, use the Explore button at the top;
  • In the results list, look for biocontainers/salmon and click on it;
  • Click on the Tags icon on the left, and locate the most recent one for version 1.2.1 at the top of the list; it should be 1.2.1--hf69c8f4_0;
  • Click on the Fetch icon at the rightmost side of the record, select Pull by Tag, and then select and copy to clipboard the string starting with quay.io/, i.e. quay.io/biocontainers/salmon:1.2.1--hf69c8f4_0.

Pull the container image for Salmon

To this end let’s use the appropriate singularity command.

Solution

$ singularity pull docker://quay.io/biocontainers/salmon:1.2.1--hf69c8f4_0

At the end an image SIF file for Salmon is downloaded:

$ ls salmon*
salmon_1.2.1--hf69c8f4_0.sif

Find and pull images for FastQC and MultiQC

To save time, we won’t browse Quay for these other two images.
Here are the two images repositories you need:

docker://quay.io/biocontainers/fastqc:0.11.9--0
docker://quay.io/biocontainers/multiqc:1.9--pyh9f0ad1d_0

Can you download them with Singularity?

Solution

$ singularity pull docker://quay.io/biocontainers/fastqc:0.11.9--0
$ singularity pull docker://quay.io/biocontainers/multiqc:1.9--pyh9f0ad1d_0

SIF images in your current directory

We’re in $WORK/exercises/pipeline/data.

Right now we should have these three images in it, together with the inputs and scripts for the pipeline:

Output

$ ls
clean_outputs.sh              images.sh                     solutions
fastqc_0.11.9--0.sif          multiqc_1.9--pyh9f0ad1d_0.sif wrappers
ggal_gut_1.fq                 original_pipe.sh
ggal_gut_2.fq                 salmon_1.2.1--hf69c8f4_0.sif

GUIDED - Test that the downloaded images work

It is a good practice to check whether the applications in the container images you download actually work.
This can range from testing whether the help output can be displayed, to running small input cases.

In this context, to save time you’ll just test ONE image, by running an application of your choice with the --help flag, e.g.. multiqc --help.

Example: multiqc

Solution

$ singularity exec ./multiqc_1.9--pyh9f0ad1d_0.sif multiqc --help

ROOM - Containerise the pipeline - first pass

Now it’s your time to work a bit with Singularity!

Make a copy of the original pipeline to work with:

$ cp original_pipe.sh pipe.1.sh

At first, you will need to modify ONLY FOUR LINES of this file. Do not change anything else, nor move files around in the directory structure.

Hints

Step 1 - salmon index

Only the relevant edited line is reported.
Backslashes \ and newlines are used for the sake of readability of the code.

Hint: look in the script for the current directory for this step; you’ll need to use an appropriate path for the container image.

Solution

singularity exec \
  ../data/salmon_1.2.1--hf69c8f4_0.sif \
  salmon index -t ggal_1_48850000_49020000.Ggal71.500bpflank.fa -i out_index &>log_index

Step 2 - salmon quant

Hint: where are the reference data compared to working directory of this step?

Solution

singularity exec \
  -B ../reference \
  ./salmon_1.2.1--hf69c8f4_0.sif \
  salmon quant --libType=U -i ../reference/out_index -1 ggal_gut_1.fq -2 ggal_gut_2.fq -o ggal_gut &>log_quant

Step 3 - fastqc

Solution

singularity exec \
  ./fastqc_0.11.9--0.sif \
  fastqc -o out_fastqc -f fastq -q ggal_gut_1.fq ggal_gut_2.fq &>log_fq

Step 4 - multiqc

Hint: where are the relevant input files/data compared to the current directory?

Solution

singularity exec \
  -B .. \
  ../multiqc_1.9--pyh9f0ad1d_0.sif \
  multiqc -v . &>../log_mq

Whenever you’re happy with your updated pipeline script, run it:

$ ./pipe.1.sh

It should take under a minute. If it runs to completion, among others you should have the output file out_multiqc/multiqc_report.html , about 1.2 MB in size:

$ ls -lh out_multiqc/multiqc_report.html 
-rw-r--r-- 1 ubuntu ubuntu 1.2M Jun  8 13:01 out_multiqc/multiqc_report.html

ROOM - Containerise the pipeline - second pass

Now, let’s build from this first successful iteration of a containerised pipeline.

You’re going to edit it by making wider use of environment variables.
This will make the script tidier and more readable; some patterns might even pop up, that were not evident at first.

Let’s clean the outputs from the previous run:

$ ./clean_outputs.sh

and then make a new copy of the pipeline script (if you had troubles, you can pick it up from solutions/pipe.1.sh):

$ cp pipe.1.sh pipe.2.sh

Container image paths

At the beginning of the script, just after the shebang #!/bin/bash, define variables for each of the three container images we’re using. Use absolute paths, for instance:

salmon_image="$WORK/exercises/pipeline/data/salmon_1.2.1--hf69c8f4_0.sif"

Then, use these variables at the lines where you use each container.

Solution

Only relevant pieces of code are shown:

#!/bin/bash

salmon_image="$WORK/exercises/pipeline/data/salmon_1.2.1--hf69c8f4_0.sif"
fastqc_image="$WORK/exercises/pipeline/data/fastqc_0.11.9--0.sif"
multiqc_image="$WORK/exercises/pipeline/data/multiqc_1.9--pyh9f0ad1d_0.sif"
[..]

singularity exec \
  $salmon_image \
[..]

singularity exec \
  -B ../reference \
  $salmon_image \
[..]

singularity exec \
  $fastqc_image \
[..]

singularity exec \
  -B .. \
  $multiqc_image \
[..]

Bind mounted directories

Can you think of a single, common host directory path to bind mount, that includes all of the required directories for the pipeline?

Now get rid of the occurrences of the -B flags at the execution lines.
In substitution, add an appropriate definition for the variable SINGULARITY_BINDPATH just after the image paths definition at the beginning of the script. Use the common host path you have just identified.

Solution

The path $WORK/exercises/pipeline will do the job.
Note that any parent directory of this one would work, too (just do not mount /, or the container will crash).

#!/bin/bash

salmon_image="$WORK/exercises/pipeline/data/salmon_1.2.1--hf69c8f4_0.sif"
fastqc_image="$WORK/exercises/pipeline/data/fastqc_0.11.9--0.sif"
multiqc_image="$WORK/exercises/pipeline/data/multiqc_1.9--pyh9f0ad1d_0.sif"

export SINGULARITY_BINDPATH="$WORK/exercises/pipeline"
[..]

singularity exec \
  $salmon_image \
[..]

singularity exec \
  $salmon_image \
[..]

singularity exec \
  $fastqc_image \
[..]

singularity exec \
  $multiqc_image \
[..]

Once again, you can test your modified pipeline with:

$ ./pipe.2.sh

And look for the output file out_multiqc/multiqc_report.html , about 1.2 MB in size.

If you had troubles, you can pick up the final pipeline script for this step from solutions/pipe.2.sh

ROOM - Questions for Reflection

These are a set of questions designed to kick off the discussion within your breakout room:

  • Go back on the web browser, on https://quay.io, and look for, say, 3 packages you normally use in your research; can you find them within the biocontainers/ repository?
  • Look back at the two steps of containerisation you just went through. What was the most interesting learning outcome? Which was the hardest bit to digest?
  • Look at the final containerised pipeline. Do you see any patterns in the way it looks compared to the original one?
  • Right now, do you think using containers would save you time compared to building/installing research software yourself?

GUIDED - Streamlining the user experience of containers

Before proceeding, I can show you the content of the pipeline main output file, out_multiqc/multiqc_report.html, from my screen.

Now, in the last iteration of the containerised pipeline you might have noticed some patterns.
Indeed, each containerised command has the form:

singularity exec $image <COMMAND> <ARGUMENTS>

where the image variable is defined at the top of the script.

Could this become a sort of template for a bash script, that executes a command through a container, without us having to type the full syntax all the time?

Let’s take the example of salmon, and have a look at this script, that you can find at wrappers/salmon:

#!/bin/bash

image="$WORK/exercises/pipeline/data/salmon_1.2.1--hf69c8f4_0.sif"

cmd="salmon"
args="$@"

singularity exec $image $cmd $args

For clarity, we’re using variables to define each component of the template syntax we identified above.
Note how we’re defining the arguments variable, args, as having the value $@. This is the bash syntax representing the arguments of a script we’re calling. In practice, when we execute this script, any argument we put after the script name will be passed onto args.

We can use this script in substitution of the articulated singularity syntax we explored in the previous exercises.

Let’s give it a go!

The directory wrappers has scripts of this type for the three application commands we need. Each script name matches that of the corresponding command.

Let’s start again by cleaning the outputs from the previous run:

$ ./clean_outputs.sh

and then by making a copy of the original pipeline:

$ cp original_pipe.sh pipe.3.sh

Now, we’re going to only make two edits:

We’re making these edits right at the beginning of the script:

#!/bin/bash

PATH="$WORK/exercises/pipeline/data/wrappers:$PATH"
export SINGULARITY_BINDPATH="$WORK/exercises/pipeline"

[..]

Let’s try it:

$ ./pipe.3.sh

After about a minute, the output file out_multiqc/multiqc_report.html will be there, about 1.2 MB in size.

That’s it! We’ve run a containerised pipeline, by using a script that almost looks like the original one. Most of the additional syntax for singularity is inside the wrapper scripts for each command.

Finally note that, for a single containerised application, you might need to create multiple wrapper scripts, one for each command that ships with that application (e.g. for BLAST you will have blastn, blastp and so on).
There are ways of further generalising this approach, that are beyond the scope of this session.

Key Points