Reproducible scientific workflows
Overview
Teaching: 5 min
Exercises: 10 minQuestions
Objectives
Get an idea of the interplay between containers and workflow engines
Scientific workflow engines
Let’s see how Singularity containers can be used in conjunction with a popular workflow engine.
Scientific workflow engines are particularly useful for data-intensive domains including (and not restricted to) bioinformatics and radioastronomy, where data analysis and processing is made up of a number of tasks to be repeatedly executed across large datasets. Some of the most popular ones, including Nextflow and Snakemake, provide interfaces to container engines. The combination of container and workflow engines can be very effective in enforcing reproducible, portable, scalable science.
Now, let’s try and use Singularity and Nextflow to run a demo RNA sequencing pipeline based on RNAseq-NF.
Install Nextflow
First, if it’s not already on your system, you’ll need to install Nextflow. You’ll need to install a Java runtime and download the Nextflow executable. It will take a few minutes to download all of the required dependencies, but the process is fairly automated. This is a template install script for a Linux box.
If you’re running on the Pawsey Nimbus cloud, just run the above script via: bash $SC19/files/install-nextflow.sh
.
If you’re running at Pawsey e.g. on Zeus, all you need is to module load nextflow
.
Run a workflow using Singularity and Nextflow
Let’s cd
into the appropriate directory:
$ cd $SC19/demos/10_nextflow
For convenience, the content of the pipeline RNAseq-NF is already made available in this directory. There are two critical files in here, namely main.nf
, that contains the translation of the scientific pipeline in the Nextflow language, and nextflow.config
, that contains several profiles for running with different software/hardware setups. Here we are going to use the profile called singularity
.
It’s time to launch the pipeline with Nextflow:
$ nextflow run main.nf -profile singularity
We’ll get some information on the pipeline, along with the notice that the appropriate container is being downloaded:
N E X T F L O W ~ version 19.10.0
Pulling marcodelapierre/rnaseq-nf ...
downloaded from https://github.com/marcodelapierre/rnaseq-nf.git
Launching `marcodelapierre/rnaseq-nf` [hopeful_almeida] - revision: 91dd162c00 [master]
R N A S E Q - N F P I P E L I N E
===================================
transcriptome: /data/work/.nextflow/assets/marcodelapierre/rnaseq-nf/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa
reads : /data/work/.nextflow/assets/marcodelapierre/rnaseq-nf/data/ggal/ggal_gut_{1,2}.fq
outdir : results
WARN: Singularity cache directory has not been defined -- Remote image will be stored in the path: /data/work/sc19-test/nxf/work/singularity
Pulling Singularity image docker://nextflow/rnaseq-nf:latest [cache /data/work/sc19-test/nxf/work/singularity/nextflow-rnaseq-nf-latest.img]
It will take a bunch of minutes to download the container image, then the pipeline will run:
[9e/a8a999] Submitted process > fastqc (FASTQC on ggal_gut)
[6a/4ec5ee] Submitted process > index (ggal_1_48850000_49020000)
[91/109c65] Submitted process > quant (ggal_gut)
[ab/081287] Submitted process > multiqc
Done! Open the following report in your browser --> results/multiqc_report.html
The final output of this pipeline is an HTML report of a quality control task, which you might eventually want to download and open up in your browser.
However, the key question here is: how could the sole flag -profile singularity
trigger the containerised execution? This is the relevant snippet from the nextflow.config
file:
singularity {
process.container = 'nextflow/rnaseq-nf:latest'
singularity.enabled = true
singularity.autoMounts = true
}
The image name is specified using the process.container
keyword. Also, singularity.autoMounts
is required to have the directory paths with the input files automatically bind mounted in the container. Finally, singularity.enabled
triggers the use of Singularity.
Based on this configuration file, Nextflow is able to handle all of the relevant Singularity commands by itself, i.e. pull
and exec
with the appropriate flags, such as -B
for bind mounting host directories. In this case, as a user you don’t need to know in detail the Singularity syntax, but just the name of the container!
More information on configuring Nextflow to run Singularity containers can be found at Singularity containers.
Key Points
Some workflow engines offer transparent APIs for running containerised applications
If you need to run data analysis pipelines, the combination of containers and workflow engines can really make your life easier!