A short introduction to Python containers

Overview

Teaching: 5 min
Exercises: 10 min
Questions
Objectives

Get ready for the session

First thing, we need to download the workshop materials from Github. cd to a directory of your choice, and then:

$ git clone https://github.com/PawseySC/containers-astronomy-workshop.git
$ cd containers-astronomy-workshop/exercises
$ export EXERCISES=$(pwd)

Here we’re defining the variable EXERCISES, pointing to the subdirectory of the repository that contains inputs and scripts for the various examples.

Trick: get rid of sudo docker

Docker requires administrative rights to be used, so in principle every command requires sudo, as in sudo docker.

To save typing, you may want to add your user to the docker user group:

$ sudo usermod -aG docker $USER

then exit the terminal session and open a fresh new one.

From now on, you can run docker without sudo.
Note that under the hood docker commands will still require admin rights.

A principle for containerised applications

When containerising sets of applications, two approaches are possible:

  1. one container per application, or
  2. one container for the whole software stack of a workflow

Often a workflow relies on a set of standalone packages, for instance a bunch of C applications in bioinformatics. In this case, we typically advise for the first approach, that is simpler to maintain and more modular, in that changes in one application or workflow tasks do not impact the others.

However, the situation is different when a workflow is built upon a set of packages that are written in Python (or R): in this case, very often these rely on a large set of dependencies, creating a complex dependency tree. Here, building distinct containers for each package would imply building multiple dependency trees, with potential issues in maintenance and even inconsistencies at runtime, that might manifest when incompatible versions of the same dependency are used for distinct packages.

In this context, then, we recommend to create one single container for the full set of Python (or R) packages required for a given workflow.

Using Singularity and Docker for containers in HPC

In HPC clusters Docker is not usable for running containers, mostly due to the security issues (it requires admin rights to run!). For this reason, Singularity is used instead at runtime.

However, for building containers the Docker image format and the Dockerfile specification are more popular, and cross-compatible, than the Singularity counterparts. Therefore, we suggest to use Docker for building container images; you can do that in your laptop, on a workstation, or a cloud instance, provided you have admin rights on that machine.

Some caveats in running Python containers

  1. Clean shell environment

    As Python is often used by system services and utilities, it is common in HPC clusters to have Python related variables defined in the shell environment. These may include PYTHONPATH, PYTHONUSERBASE, PYTHONSTARTUP and others.
    Now, Singularity by default favours integration over isolation, and thus passes the host shell environment onto the container.

    As a result of these two factors, a Python container may show unintended and uncontrolled behaviours.

    As an example, try and open a Python console using the docker://python:3.8-slim container image:

     $ singularity exec docker://python:3.8-slim python3
    
     [..]
    
     Python 3.8.5 (default, Aug  4 2020, 16:24:08) 
     [GCC 8.3.0] on linux
     Type "help", "copyright", "credits" or "license" for more information.
     >>> 
    

    This run is fine; close the shell typing exit, or pressing Ctrl-D.

    But now suppose the host variable PYTHONSTARTUP is defined:

     $ export PYTHONSTARTUP="/etc/pythonstart"
     $ singularity exec docker://python:3.8-slim python3
    
     Python 3.8.5 (default, Aug  4 2020, 16:24:08) 
     [GCC 8.3.0] on linux
     Type "help", "copyright", "credits" or "license" for more information.
     Could not open PYTHONSTARTUP
     FileNotFoundError: [Errno 2] No such file or directory: '/etc/pythonstart'
     >>> 
    

    Now you get a warning as that path is non existent in the container.
    This is innocuous, but the interference of host Python variables is often a source of errors.

    To be safe, always run Python containers with singularity exec -e, or --cleanenv, to isolate the container shell environment:

     $ export PYTHONSTARTUP="/etc/pythonstart"
     $ singularity exec -e docker://python:3.8-slim python3
    
     Python 3.8.5 (default, Aug  4 2020, 16:24:08) 
     [GCC 8.3.0] on linux
     Type "help", "copyright", "credits" or "license" for more information.
     >>> 
    

    WARNING for MPI applications: it turns out that resource managers, such as Slurm, make heavy use of environment variables to properly spawn MPI jobs. In this case the -e option will break the MPI run. The workaround here is to not use -e, but instead unset any Python-related variables in the session for good. Have a look at this convenient one-liner:

     $ unset $( env | grep ^PYTHON | cut -d = -f 1 | xargs )
     $ srun singularity exec docker://python:3.8-slim python3 <PYTHON SCRIPT FILE>
    
  2. Singularity syntax to pass shell variables

    As the tip no. 1 above suggests to isolate the container shell environment from the host one, how can you then set environment variables in the container when you need to?

    Well, you can use the dedicated Singularity syntax. This involves setting the required variable in the host, prepended with the prexif SINGULARITYENV_, as in:

     $ export SINGULARITYENV_VARIABLE="value"
    

    Then, you get the variable in the container:

     $ singularity exec -e docker://python:3.8-slim bash -c 'echo $VARIABLE'
    
     value
    
  3. Mounting a fake home for writing files

    It may happen that some packages need to write service or configuration files in the user’s HOME directory. This can be an issue on HPC clusters, where user’s homes might not be mounted by default:

     $ singularity exec -e docker://python:3.8-slim ls $HOME
    
     ls: cannot access '/home/ubuntu': No such file or directory
    

    You might be tempted to bind mount it at runtime, but this is not your best option: we recommend AVOIDING mounting home for improved security.

    Instead, you can create a service directory in the host filesystem, to be used as a fake home in your container:

     mkdir fake_home
     $ singularity exec -e -B fake_home:$HOME docker://python:3.8-slim touch $HOME/testfile
    

    Then, files written in the fake home will be available from the host; you can keep them for future sessions if you want/need to, or clean them up after runtime:

     $ ls fake_home
    
     testfile
    

Key Points