Supplement: Creating Docker Images for Workflows
Overview
Teaching: 10 min
Exercises: 1 minQuestions
How do I create Docker images from scratch?
What some best practices for Docker images?
Objectives
Understand how to get started writing Dockerfiles
This lesson has migrated to https://doc.arvados.org/rnaseq-cwl-training/08-supplement-docker/index.html
Common Workflow Language supports running tasks inside software containers. Software container systems (such as Docker) create an execution environment that is isolated from the host system, so that software installed on the host system does not conflict with the software installed inside the container.
Programs running inside a software container get a different (and generally restricted) view of the system than processes running outside the container. One of the most important and useful features is that the containerized program has a different view of the file system. A program running inside a container, searching for libraries, modules, configuration files, data files, etc, only sees the files defined inside the container.
This means that, usually, a given file path refers to different actual files depending from the persective of being inside or outside the container. It is also possible to have a file from the host system appear at some location inside the container, meaning that the same file appears at different paths depending from the persective of being inside or outside the container.
The complexity of translating between the container and its host environment is handled by the Common Workflow Language runner. As a workflow author, you only need to worry about the environment inside the container.
What are Docker images?
The Docker image describes the starting conditions for the container.
Most importantly, this includes starting layout and contents of the
container’s file system. This file system is typically a lightweight
POSIX environment, providing a standard set of POSIX utilities like a
sh
, ls
, cat
, etc and organized into standard POSIX directories
like /bin
and /lib
.
The image is is made up of multiple “layers”. Each layer modifies the layer below it by adding, removing or modifying files to produce a new layer. This allows lower layers to be re-used.
Writing a Dockerfile
In this example, we will build a Docker image containing the
Burrows-Wheeler Aligner (BWA) by Heng Li. This is just for
demonstration, in practice you should prefer to use existing
containers from BioContainers, which
includes bwa
.
Each line of the Docker file consists of a COMMAND in all caps, following by the parameters of that command.
The first line of the file will specify the base image that we are going to build from. As mentioned, images are divided up into “layers”, so this tells Docker what to use for the first layer.
FROM debian:10-slim
This starts from the lightweight (“slim”) Debian 10 Docker image.
Docker images have a special naming scheme.
A bare name like “debian” or “ubuntu” means it is an official Docker image. It has an implied prefix of “library”, so you may see the image referred to as “library/debian”. Official images are published on Docker Hub.
A name with two parts separated by a slash is published on Docker Hub
by someone else. For example, amazon/aws-cli
is published by
Amazon. These can also be found on Docker Hub.
A name with three parts separated by slashes means it is published on
a different container register. For example,
quay.io/biocontainers/subread
is published by quay.io
.
Following image name, separated by a colon is the “tag”. This is typically the version of the image. If not provided, the default tag is “latest”. In this example, the tag is “10-slim” indicating Debian release 10.
The Docker file should also include a MAINTAINER (this is purely metadata, it is stored in the image but not used for execution).
MAINTAINER Peter Amstutz <peter.amstutz@curii.com>
Next is the default user inside the image. By making choosing root, we can change anything inside the image (but not outside).
The body of the Dockerfile is a series of RUN
commands.
Each command is run with /bin/sh
inside the Docker container.
Each RUN
command creates a new layer.
The RUN
command can span multiple lines by using a trailing
backslash.
For the first command, we use apt-get
to install some packages that
will be needed to compile bwa
. The build-essential
package
installs gcc
, make
, etc.
RUN apt-get update -qy && \
apt-get install -qy build-essential wget unzip
Now we do everything else: download the source code of bwa, unzip it,
make it, copy the resulting binary to /usr/bin
, and clean up.
# Install BWA 07.7.17
RUN wget https://github.com/lh3/bwa/archive/v0.7.17.zip && \
unzip v0.7.17 && \
cd bwa-0.7.17 && \
make && \
cp bwa /usr/bin && \
cd .. && \
rm -rf bwa-0.7.17
Because each RUN
command creates a new layer, having the build and
clean up in separate RUN
commands would mean creating a layer that
includes the intermediate object files from the build. These would
then be carried around as part of the container image forever, despite
being useless. By doing the entire build and clean up in one RUN
command, only the final state of the file system, with the binary
copied to /usr/bin
, is committed to a layer.
To build a Docker image from a Dockerfile, use docker build
.
Use the -t
option to specify the name of the image. Use -f
if the
file isn’t named exactly Dockerfile
. The last part is the directory
where it will find the Dockerfile
and any files that are referenced
by COPY
(described below).
docker build -t training/bwa -f Dockerfile.single-stage .
Exercise
Create a
Dockerfile
based on this lesson and build it for yourself.Solution
FROM debian:10-slim MAINTAINER Peter Amstutz <peter.amstutz@curii.com> RUN apt-get update -qy RUN apt-get install -qy build-essential wget unzip zlib1g-dev # Install BWA 07.7.17 RUN wget https://github.com/lh3/bwa/archive/v0.7.17.zip && \ unzip v0.7.17 && \ cd bwa-0.7.17 && \ make && \ cp bwa /usr/bin && \ cd .. && \ rm -rf bwa-0.7.17
Adding files to the image during the build
Using the COPY
command, you can copy files from the source directory
(this is the directory your Dockerfile was located) into the image
during the build. For example, you have a requirements.txt
next to
Dockerfile:
COPY requirements.txt /tmp/
RUN pip install --requirement /tmp/requirements.txt
Multi-stage builds
As noted, it is good practice to avoiding leaving files in the Docker image that were required to build the program, but not to run it, as those files are simply useless bloat. Docker offers a more sophisticated way to create clean builds by separating the build steps from the creation of the final container. These are called “multi-stage” builds.
A multi stage build has multiple FROM
lines. Each FROM
line is a
separate container build. The last FROM
in the file describes the
final container image that will be created.
The key benefit is that the different stages are independent, but you can copy files from one stage to another.
Here is an example of the bwa build as a multi-stage build. It is a little bit more complicated, but the outcome is a smaller image, because the “build-essential” tools are not included in the final image.
# Build the base image. This is the starting point for both the build
# stage and the final stage.
# the "AS base" names the image within the Dockerfile
FROM debian:10-slim AS base
MAINTAINER Peter Amstutz <peter.amstutz@curii.com>
# Install libz, because the bwa binary will depend on it.
# As it happens, this already included in the base Debian distribution
# because lots of things use libz specifically, but it is good practice
# to explicitly declare that we need it.
RUN apt-get update -qy
RUN apt-get install -qy zlib1g
# This is the builder image. It has the commands to install the
# prerequisites and then build the bwa binary.
FROM base as builder
RUN apt-get install -qy build-essential wget unzip zlib1g-dev
# Install BWA 07.7.17
RUN wget https://github.com/lh3/bwa/archive/v0.7.17.zip
RUN unzip v0.7.17
RUN cd bwa-0.7.17 && \
make && \
cp bwa /usr/bin
# Build the final image. It starts from base (where we ensured that
# libz was installed) and then copies the bwa binary from the builder
# image. The result is the final image only has the compiled bwa
# binary, but not the clutter from build-essentials or from compiling
# the program.
FROM base AS final
# This is the key command, we use the COPY command described earlier,
# but instead of copying from the host, the --from option copies from
# the builder image.
COPY --from=builder /usr/bin/bwa /usr/bin/bwa
Best practices for Docker images
Docker has published guidelines on building efficient images.
Some additional considerations when building images for use with Workflows:
Store Dockerfiles in git, alongside workflow definitions
Dockerfiles are scripts and should be managed with version control just like other kinds of code.
Be specific about software versions
Instead of blindly installing the latest version of a package, or
checking out the master
branch of a git repository and building from
that, be specific in your Dockerfile about what version of the
software you are installing. This will greatly aid the
reproducibility of your Docker image builds.
Similarly, be as specific as possible about the version of the base
image you want to use in your FROM
command. If you don’t specify a
tag, the default tag is called “latest”, which can change at any time.
Tag your builds
Use meaningful tags on your own Docker image so you can tell versions of your Docker image apart as it is updated over time. These can reflect the version of the underlying software, or a version you assign to the Dockerfile itself. These can be manually assigned version numbers (e.g. 1.0, 1.1, 1.2, 2.0), timestamps (e.g. YYYYMMDD like 20220126) or the hash of a git commit.
Avoid putting reference data to Docker images
Bioinformatics tools often require large reference data sets to run. These should be supplied externally (as workflow inputs) rather than added to the container image. This makes it easy to update reference data instead of having to rebuild a new Docker image every time, which is much more time consuming.
Small scripts can be inputs, too
If you have a small script, e.g. a self-contained single-file Python script which imports Python modules installed inside the container, you can supply the script as a workflow input. This makes it easy to update the script instead of having to rebuild a new Docker image every time, which is much more time consuming.
Don’t use ENTRYPOINT
The ENTRYPOINT
Dockerfile command modifies the command line that is
executed inside the container. This can produce confusion when the
command line that supplied to the container and the command that
actually runs are different.
Be careful about the build cache
Docker build has a useful feature where if it has a record of the
exact RUN
command against the exact base layer, it can re-use the
layer from cache instead of re-running it every time. This is a great
time-saver during development, but can also be a source of
frustration: build steps often download files from the Internet. If
the file being downloaded changes without the command being used to
download it changing, it will reuse the cached step with the old copy
of the file, instead of re-downloading it. If this happens, use
--no-cache
to force it to re-run the steps.
Episode solution
Key Points
Docker images contain the initial state of the filesystem for a container
Docker images are made up of layers
Dockerfiles consist of a series of commands to install software into the container.