Python containers: build best practices
Overview
Teaching: 20 min
Exercises: 40 minQuestions
Objectives
Useful base images
Depending on the type of application you need to containerise, various public images can represent an effective starting point, in shipping ready-to-use utilities that have been containerised, tested and optimised by their own authors. Relevant to this workshop are:
-
Python images, such as
python:3.8
andpython:3.8-slim
(a lightweight version); we’re going to use theslim
version to avoid including unneeded packages in our images; here are the Docker Hub repo and the Dockerfile; -
Conda images by Anaconda, such as
continuumio/miniconda3:4.8.2
; again, we preferminiconda3
overanaconda3
to exclude unnecessary packages; see Docker Hub and Dockerfile; -
Intel optimised Python images, i.e.
intelpython/intelpython3_core:2020.2
andintelpython/intelpython3_full:2020.2
; we’re going to use thecore
image for the same reasons as above; see Docker Hub and Dockerfile; -
Pawsey MPI images, such as the MPICH one that we’re using here,
pawsey/mpich-base:3.1.4_ubuntu18.04
; see Docker Hub and Dockerfile; -
CUDA images, including:
nvidia/cuda:10.2-base-ubuntu18.04
;nvidia/cuda:10.2-runtime-ubuntu18.04
;nvidia/cuda:10.2-devel-ubuntu18.04
;
we’re using the
devel
image as we need the full CUDA SDK for Numba; see Docker Hub and Dockerfiles: base, runtime, devel.
Other useful base images, not directly used in this context, include:
-
Jupyter images, in particular the
jupyter/
repository by Jupyter Docker Stacks (unfortunately making extensive use of thelatest
tag); for instance, the scientific Python imagejupyter/scipy-notebook:latest
, see DockerHub; -
OS images, such as
ubuntu:18.04
,debian:buster
andcentos:7
.
Note that the python
, miniconda3
, and intelpython
images are all based on Debian OS (currently version 10 Buster). The Pawsey mpich-base
and cuda
images are based on Ubuntu 18.04.
All of the mentioned images are currently hosted in Docker Hub. In addition to the image itself, it is worth having an idea of what the corresponding Dockerfile looks like, both to know how the image was created and to get tips on how to optimally use it.
Having the Dockerfiles is also useful in case one needs an image with multiple utilities. Then, intelligently merging the Dockerfiles will do the job.
Some building best practices
Always keep in mind that writing a Dockerfile is almost an art, which you can refine over time with practice. Here we don’t mean to be exhaustive; instead, we’re providing some good practices to start with, with some Ubuntu/Debian examples where relevant. These practices will then be applied to the examples further down, and in the next episode.
-
Condensate commands into few
RUN
instructionsThis reduces the number of image caching layers, and thus the total size of the Docker image.
Why is it so?
At one extreme, imagine oneRUN
per command. Any edit to the same files in the container or any deletion of unnecessary files would create a separate layer (i.e. snapshot). The final container would be larger.
On the other extreme, concentrating every command in oneRUN
would minimise image size, as there would be one single layer. As a disadvantage, readability of the Dockerfile would be reduced, and caching during build would become less effective.In the end, the best practice is to group together commands that relate to the same component of the installation. Readability and caching are improved in this way, and you would not gain space anyway if you further grouped them, as these groups of commands operate on different files.
Note how this is a benefit only for images in the Docker format. When converting to the Singularity SIF format, layers are squashed in a single object anyway.
-
Clean the installation process
Installation files that are not required by the application runtime can be deleted, contributing to reduce the final image size (when coupled with practice
1.
above).As a first Ubuntu example, here’s how to run a clean
apt
installation:In our example with samtools, you would clean the
apt
installations with the following commands:RUN apt-get update \ && apt-get -y install \ build-essential \ gfortran \ && apt-get clean all && \ apt-get purge && \ rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
and here is how to tidy up a software compilation:
RUN mkdir -p /tmp/tool-build \ && cd /tmp/tool-build \ [..] <INSTALL YOUR TOOL> [..] && cd / \ && rm -rf /tmp/tool-build
-
Abstract package versions, if you can
The
ARG
instruction can be used to define variables in the Dockerfile, that only exist within the build process. This can be especially useful to specify package versions in a more general and flexible way.As a general example:
ARG TOOL_VERSION="2.0"
Then, you can use
$TOOL_VERSION
along the package build commands.A nice feature of
docker build
is that you can dynamically assign theseARG
at build time using the--build-arg
flag, e.g. if you want to install version3.0
:$ docker build -t tool --build-arg TOOL_VERSION="3.0" .
-
Consider build reproducibility, if you can
A big part of using containers is runtime reproducibility.
However, it can be worth reflecting on build time reproducibility, too.
In other words, if I re-run an image build with the same Dockerfile over time, am I getting the same image as output? This can be important for an image curator, to be able to guarantee to some extent that the image they’re build and shipping is the one they mean to.A couple of general advice here:
- rely on explicit versioning of the key packages in your image as much as possible. This may include OS version, key library and key dependency versions, end-user application version (this latter is possibly obvious); typically this involves a trade-off between versioning too much and too little;
- for Python/Conda, track versioned packages to install using a
requirements
approach; see examples further down; - avoid generic package manager update commands, which make you lose control on versions. In particular, you should avoid
apt-get upgrade
,pip install --upgrade
,conda update
and so on; - avoid downloading sources/binaries for
latest
versions, specify the version number instead.
-
Know and set some useful environment variables
Dockerfile installations are non-interactive by nature, i.e. no installer can ask you questions during the process.
In Ubuntu/Debian, you can define a variable prior to running anyapt
command, that informs a shell in Ubuntu or Debian that you are not able to interact, so that no questions will be asked:ENV DEBIAN_FRONTEND="noninteractive"
Another pair of useful variables, again to be put at the beginning of the Dockerfile, are:
ENV LANG="C.UTF-8" LC_ALL="C.UTF-8"
These variables are used to specify the language localisation, or locale, to the value
C.UTF-8
in this case. Leaving this undefined can result, for some programs, in warnings or even in unintended behaviours (both at build and run time). -
Document your Dockerfile with labels
This is a way to provide information to third party users of your container image, to enable a more effective use.
For instance you can add amaintainer
label:LABEL maintainer="john.doe@nowhere.com"
-
Think about the default command
The Docker instruction
CMD
can be used to set the default command that gets executed bydocker run <IMAGE>
(without arguments) orsingularity run <IMAGE>
.
If you don’t specify it, the default will be theCMD
specified by the base image, or the shell if the latter was not defined.Our suggestion for the vast majority of containers is to set it to
/bin/bash
.
You might think of using an application binary instead, but in the end this is not very useful. In fact, theCMD
default gets overridden by any command/argument you add to therun
commands, in which case you would need to explicitly state the main command anyway. Moreover, if your container ships multiple executables, some will be left out anyway.There are only a few exception to this general advice.
For Python and R containers, you might set it to thepython
andR
interpreter, respectively.
Do not setCMD
forrocker/
RStudio andjupyter/
Jupyter images; these come with articulated setups that allow to spawn the web servers, so your choice would be ignored anyway.
A Ubuntu example
To get started, let’s comment on a simple example, where we’re using a Ubuntu image and we want to add some developer tools using apt
: some build tools, git and **. We’re going to use it to practice with advice 1 to 3 above. The corresponding files are in the directory 2.ubuntu/
.
The simplest Dockerfile, Dockerfile.1
, looks like:
FROM ubuntu:18.04
RUN apt-get update \
&& apt-get install -y \
build-essential \
gfortran \
git \
wget
We can build an image called u:1
via:
$ docker build -t u:1 -f Dockerfile.1 .
Once built, interestingly we can see from docker images
that the image size is 379 MB:
REPOSITORY TAG IMAGE ID CREATED SIZE
[..]
u 1 f513ff384910 28 hours ago 379MB
[..]
Let’s see if we can reduce the image size (practice no. 2), using Dockerfile.2
:
FROM ubuntu:18.04
RUN apt-get update \
&& apt-get install -y \
build-essential \
gfortran \
git \
wget
RUN apt-get clean all \
&& rm -r /var/lib/apt/lists/*
Now, let’s build the new image. Can you see the process is almost instantaneous? This is layer caching in action!
However, if we have a look at the size, we can see that .. it’s still 379 MB! Well, this makes sense if you think about practice no. 1: we’re cleaning using a separate RUN
command, thus creating a distinct layer. Then, the files being cleaned up are still snapshotted in the first layer. Let’s then consolidate the two RUN
s in to one (Dockerfile.3
):
FROM ubuntu:18.04
RUN apt-get update \
&& apt-get install -y \
build-essential \
gfortran \
git \
wget \
&& apt-get clean all \
&& rm -r /var/lib/apt/lists/*
The size of the new image is now 350 MB, down 29 MB or about 8%. Not too bad .. but the benefits of cleaning the image become more significant as you install more packages in it.
Let’s now integrate practice 3 in the Dockerfile, to allow flexibility in package versions. Well, with apt
we’re typically quite constrained in picking versions, so in this case let’s try and become flexible with the Ubuntu version, see Dockerfile.4
:
ARG OS_VERSION="18.04"
FROM ubuntu:$OS_VERSION
RUN apt-get update \
&& apt-get install -y \
build-essential \
gfortran \
git \
wget \
&& apt-get clean all \
&& rm -r /var/lib/apt/lists/*
With this Dockerfile, by default we’re going to build the image starting from Ubuntu 18.04. However, we can now change this behaviour at build time, for instance if need 20.04:
$ docker build -t u:4-20.04 -f Dockerfile.4 --build-arg OS_VERSION="20.04" .
A Python example: astropy using pip
Let’s practice installing Python packages in a container using pip; we’ll use the base image python:3.8-slim
, and the package astropy
as an example (files under 2.pip/
).
Here is the first, minimal iteration of the image recipe, Dockerfile.1
:
FROM python:3.8-slim
RUN pip install astropy
CMD [ "/bin/bash" ]
Seems straightforward, right? Note how the Dockerfile is re-defining the default command to the bash
shell; this is because python
images set it to the Python console. Have a look at the Dockerfile for python:3.8-slim
as a reference. If you prefer the latter setting, just delete the last line.
Let’s build the image, have a look at the output, and then check the image size with docker images
:
$ docker build -t p:1 -f Dockerfile.1 .
Sending build context to Docker daemon 2.048kB
Step 1/3 : FROM python:3.8-slim
---> 38cd21c9e1a8
Step 2/3 : RUN pip install astropy
---> Running in 1873f952be21
Collecting astropy
Downloading astropy-4.0.1.post1-cp38-cp38-manylinux1_x86_64.whl (6.5 MB)
Collecting numpy>=1.16
Downloading numpy-1.19.1-cp38-cp38-manylinux2010_x86_64.whl (14.5 MB)
Installing collected packages: numpy, astropy
Successfully installed astropy-4.0.1.post1 numpy-1.19.1
Removing intermediate container 1873f952be21
---> 6e8eaaaa85dc
Step 3/3 : CMD [ "/bin/bash" ]
---> Running in cba660461191
Removing intermediate container cba660461191
---> 06cc7bee12cd
Successfully built 06cc7bee12cd
Successfully tagged p:1
A couple of notes here:
- at the time of writing, we’re installing
astropy
version4.0.1.post1
; astropy
also depends onnumpy
, so thepip
process installs both these packages;- the final image size is 228 MB.
Can we reduce image size? Sure, let’s have a look at Dockerfile.2
:
FROM python:3.8-slim
RUN pip --no-cache-dir install astropy
CMD [ "/bin/bash" ]
If we build this image, we can see that using the option pip --no-cache-dir
reduces the size by 22 MB, or 10%, to 206 MB.
Now, let’s try and add version control to this image. As in the Dockerfile we’re asking pip
to install astropy
, we can make its version explicit as in Dockerfile.3
:
FROM python:3.8-slim
ARG ASTRO_VERSION="3.2.3"
RUN pip --no-cache-dir install astropy==$ASTRO_VERSION
CMD [ "/bin/bash" ]
In this example, the default installed version is 3.2.3
. This can be changed at build time with --build-arg ASTRO_VERSION=<ALTERNATE VERSION>
.
This was easy enough. Now, how about build reproducibility? Or, put in other words, are there other packages for which we might need to keep explicit track of the version?
Well, when we install Python packages, most of them come with some dependency; in this case it’s numpy
. Let’s see ways to track these when building a container. We’re going to see two examples, both of which rely on using a requirements
file.
pip build reproducibility, way 1: pip freeze
We’re now going to adopt a pretty useful strategy when developing Docker files, that is running interactive container sessions to trial things. First, let’s write a requirements.in
file specifying the package we’re after:
astropy==3.2.3
And now, let’s start an interactive session with our base image, python:3.8-slim
. We need the current directory to be bind mounted in the container:
$ docker run --rm -it -v $(pwd):/data -w /data python:3.8-slim bash
Now, from inside the container let’s execute the prepare4.sh
script:
#!/bin/bash
# run this from the miniconda3 container
# docker run --rm -it -v $(pwd):/data -w /data python:3.8-slim bash
pip install -r requirements.in
REQ_FILE="requirements4-3sep.txt"
pip freeze >$REQ_FILE
Here we’re performing a trial run of the installation we’re after, using the requirements
file via pip install -r requirements.in
.
Then, the useful bit: let’s save the obtained pip
configuration in a file, using pip freeze
. The end result, requirements4-3sep.txt
, contains all the final packages (two in this case) with explicit versions:
astropy==3.2.3
numpy==1.19.1
We can then use this file as a reference in the Dockerfile, see Dockerfile.4
:
FROM python:3.8-slim
ARG REQ_FILE="requirements4-3sep.txt"
ADD requirements.in /
ADD $REQ_FILE /requirements.txt
RUN pip --no-cache-dir install --no-deps -r /requirements.txt
CMD [ "/bin/bash" ]
We’re copying the requirements
files in the image using ADD
, and then using the second one to run the pip
installation (the former is copied just to document it in the image, it’s not really required). We’re using the additional flag install --no-deps
to make sure pip
is only installing the packages that are listed in the requirements; this is a complete list of packages, as we got it from a real installation.
Now, if we run this build repeatedly over time, we’re always ending up with the same set of packages (and versions) in the container image!
pip build reproducibility, way 2: pip-tools
With pip
, we’ve got an alternate way to generate our fully specified requirements
file, that does not require running a full installation interactively.
This alternate way makes use of a Python package called pip-tools
, see its Github page. We need it installed on the host machine we use to build Docker images, which we can achieve via pip install pip-tools
.
Then, starting from our initial requirements.in
file, we can generate the final one simply running:
$ pip-compile -o requirements5-3sep.txt requirements.in
The list of packages is consistent with the pip freeze way, just with some extra comments:
#
# This file is autogenerated by pip-compile
# To update, run:
#
# pip-compile --output-file=requirements2-3sep.txt requirements.in
#
astropy==3.2.3 # via -r requirements.in
numpy==1.19.1 # via astropy
astropy using conda
For folks who prefer conda
over pip
, let’s see how we can get an astropy
container image with this approach (files under 2.conda/
).
Let’s start with the basic Dockerfile.1
, which keeps track of astropy
version only:
FROM continuumio/miniconda3:4.8.2
# note, as of september 2020, "--no-update-deps" seems not to be respected
ARG ASTRO_VERSION="3.2.3"
RUN conda install -y --no-update-deps astropy==$ASTRO_VERSION
First, we’re starting from the continuumio/miniconda3:4.8.2
image; have a look at the Dockerfile if you want.
Then, note how we’re using the conda install
flag --no-update-deps
to ask conda
not to update any package that ships with the base image. This is intended for better build reproducibility, in that these packages should be defined only by the choice of the base image itself. However, unfortunately at the time of writing this flag does not seem to work as intended.
We can build the image with:
$ docker build -t c:1 -f Dockerfile.1 .
As per typical conda
installations, the list of new/updated packages is longer than with pip
, over 30 packages.
Also, note the final image size is 1.6 GB.
First, let’s see how we can reduce the image size using conda clean
, see Dockerfile.2
:
FROM continuumio/miniconda3:4.8.2
# note, as of september 2020, "--no-update-deps" seems not to be respected
ARG ASTRO_VERSION="3.2.3"
RUN conda install -y --no-update-deps astropy==$ASTRO_VERSION \
&& conda clean -ay
The corresponding image is 1.37 GB large, with a reduction of 230 MB or 14%.
Now, let’s focus on build reproducibility, taking again an approach using a requirements
file.
Let’s start with our specification, requirements.in
:
astropy==3.2.3
Now, similar to the pip case, let’s start an interactive session:
$ docker run --rm -it -v $(pwd):/data -w /data continuumio/miniconda3:4.8.2 bash
And run this preparation script, prepare3.sh
:
#!/bin/bash
# run this from the miniconda3 container
# docker run --rm -it -v $(pwd):/data -w /data continuumio/miniconda3:4.8.2 bash
conda install --no-update-deps -y --file requirements.in
REQ_LABEL="3sep"
ENV_FILE="environment-${REQ_LABEL}.yaml"
conda env export >${ENV_FILE}
REQ_FILE="requirements-${REQ_LABEL}.yaml"
cp $ENV_FILE $REQ_FILE
sed -i -n '/dependencies/,/prefix/p' $REQ_FILE
sed -i -e '/dependencies:/d' -e '/prefix:/d' $REQ_FILE
sed -i 's/ *- //g' $REQ_FILE
Here we’re running a trial installation using conda install --file requirements.in
.
Then we can export the versioned packages in the active environment using conda env export
. This has a caveat, though.
Environment export in conda
creates a YAML file that allows creation of a completely new environment, including information on the environment name, prefix and channels (see environment-3sep.yaml
in the directory of this example).
As we just want this information to install packages in the preexisting base environment of the base image, we need to polish this file, e.g. using sed
. A bunch of edits will return use the final requirements-3sep.yaml
(see example directory), which only contain the list of versioned packages.
This is the requirements file we can use in the Dockerfile, see Dockerfile.3
:
FROM continuumio/miniconda3:4.8.2
# note, as of september 2020, "--no-update-deps" seems not to be respected
ARG REQ_FILE="requirements-3sep.yaml"
ADD requirements.in /
ADD $REQ_FILE /requirements.yaml
RUN conda install -y --no-deps --file /requirements.yaml \
&& conda clean -ay
Note how we’re now using the option conda install --no-deps
, to tell conda
not to consider any package dependency for installation, but just those packages in the requirements list. In principle, this is dangerous and can lead to broken environments, but here we’re safe as we obtained this list by exporting a real, functional environment.
Shell variables and conda environment settings
This is one more aspect worth mentioning when dealing with conda
container images.
So, conda activate
run in a Dockerfile would not work as intended, as variable settings would only leave inside the corresponding RUN
layer.
Then, another way might be to embed environment sourcing inside profile files, such as ~/.bashrc
, ~/.profile
, or even something like /etc/profile.d/conda.sh
. However, these files are only sourced when bash
is launched, so for instance not when running a python
execution directly. Also, files under home, ~/
, would not work with Singularity: Docker home is root’s home, whereas Singularity runs as the host user.
In summary, the most robust way to ensure shell variables for the conda environment are set is to set them explicitly in the Dockerfile using ENV
instructions.
In terms of general conda variables, the continuumio
base images all set a modified PATH
variable, so that conda and binaries in the base environment are found (see Dockerfile).
Explicitly setting also the CONDA_PREFIX
is not done in the base image, so it does not hurt doing it in our Dockerfile, see Dockerfile.4
:
FROM continuumio/miniconda3:4.8.2
# note, as of september 2020, "--no-update-deps" seems not to be respected
ARG REQ_FILE="requirements-3sep.yaml"
ADD requirements.in /
ADD $REQ_FILE /requirements.yaml
RUN conda install -y --no-deps --file /requirements.yaml \
&& conda clean -ay
# conda activate is not robustly usable in a container.
# then, go for an environment check in a test container,
# to see if you need to set any package specific variables in the container:
#
# run this from the miniconda3 container
# docker run --rm -it -v $(pwd):/data -w /data continuumio/miniconda3:4.8.2 bash
#
# env >before
# conda install <..>
# env >after
# diff before after
# this one is always good to have
ENV CONDA_PREFIX="/opt/conda"
How about the extra comments in this Dockerfile?
Well, this is not the case for astropy
but, depending on the installed packages, additional variables might be added to the shell environment.
It’s possible to capture them by running another trial installation interactively, and comparing the environment before and after it’s done:
$ docker run --rm -it -v $(pwd):/data -w /data continuumio/miniconda3:4.8.2 bash
/# env >before
/# conda install <..>
/# env >after
/# diff before after
If there are any spare variables, it’s advisable to review them, and include relevant ones in the Dockerfile using ENV
instructions.
Python with MPI: mpi4py
Now, let’s have a look at a more articulated example: suppose we need a container image that is able to run MPI Python code, i.e. using the package mpi4py
(files under 2.mpi.python/
).
How is this different compared to previous examples? Well, in brief:
- we know that, beside the
mpi4py
Python package, we also need some system utilities such as compilers and MPI libraries/wrappers; - compilers are available with
apt
, but MPI libraries are better compiled from scratch .. - .. in fact, as we’ll see in the next episode, this configuration requires care, so that at runtime we can dynamically link
mpi4py
to the host MPI libraries rather than the container ones.
So, here’s our plan:
- install compilers and build tools with
apt
; - compile an MPI library;
- install
mpi4py
usingpip
.
Note how in our container image we need both Python and MPI utilities. We know we have base images for both, e.g. python:3.8-slim
and pawsey/mpich-base:3.1.4_ubuntu18.04
.
Can we combine them? Upon inspection, we will notice that there are no incompatible steps amongst the two, so .. yes we can combine them.
How to combine them? Well, there’s no Docker instruction to achieve this from the two images, so the only option is to pick one and then install the other set of utilities explicitly in the Dockerfile.
This is when it gets handy to have a look at the Dockerfiles of our base images of interest: python, pawsey/mpich-base.
The former is 143 lines long, the latter only 64, and we actually only need the first 38 (after that, it’s about installing a benchmark suite, which we don’t need here) .. so looks more convenient to embed the latter on top of the former.
As regards mpi4py
, if we run a trial interactive installation we’ll discover that this package has no further pip
package dependencies, so we can specify its version straight in the Dockerfile.
Let’s have a look at how the final Dockerfile
looks like:
FROM python:3.8-slim
RUN apt-get update -qq \
&& apt-get -y --no-install-recommends install \
build-essential \
ca-certificates \
gdb \
gfortran \
wget \
&& apt-get clean all \
&& rm -r /var/lib/apt/lists/*
ARG MPICH_VERSION="3.1.4"
ARG MPICH_CONFIGURE_OPTIONS="--enable-fast=all,O3 --prefix=/usr"
ARG MPICH_MAKE_OPTIONS="-j4"
RUN mkdir -p /tmp/mpich-build \
&& cd /tmp/mpich-build \
&& wget http://www.mpich.org/static/downloads/${MPICH_VERSION}/mpich-${MPICH_VERSION}.tar.gz \
&& tar xvzf mpich-${MPICH_VERSION}.tar.gz \
&& cd mpich-${MPICH_VERSION} \
&& ./configure ${MPICH_CONFIGURE_OPTIONS} \
&& make ${MPICH_MAKE_OPTIONS} \
&& make install \
&& ldconfig \
&& cp -p /tmp/mpich-build/mpich-${MPICH_VERSION}/examples/cpi /usr/bin/ \
&& cd / \
&& rm -rf /tmp/mpich-build
ARG MPI4PY_VERSION="3.0.3"
RUN pip --no-cache-dir install --no-deps mpi4py==${MPI4PY_VERSION}
CMD [ "/bin/bash" ]
This makes sense, doesn’t it?
We can build the image using:
$ docker build -t m:1 .
We’re going to use this image again in the next episode.
Key Points