Docker containers

This section explains further the interoperability recommendation to use Docker containers to execute tools from CWL workflows.

What are containers?

In short, a container is a set of processes executed as an isolated part of the operating system, with its own network and file system hierarchy; that is / and localhost inside the container is separate from the host. The file system is initialized from a container image which typically have a miniature Linux distribution and the required binaries pre-installed along with their dependencies.

Containers are implemented by a set of Linux kernel features and can be considered an evolution of Solaris Zones and FreeBSD Jails although with different focus.

The most popular container technology is Docker, which adds command line tools and daemons to manage and build containers, as well as to download (“pull”) pre-built container images from online repositories.

In dev-ops environments Docker is often used as a way to combine networked microservices (one per container) or to deploy long-running server applications (as a light-weight VM alternative); however for CWL CommandLineTool containers are used for individual executions of command line tools that exchange files. In CWL you'll typically wrap each tool binary as a separate container image, and each step invocation creates a new temporary container that is removed after the workflow finishes.

It is possible to build your own Docker images using a reproducible Dockerfile recipe with shell script commands for file system changes to add on top of a base image. The built image can be kept locally, or deposited in a repository like Docker Hub or Quay.io.

Containers in workflows

For workflows the use of containers provides a consistent runtime environment for individual tools, ensuring that all required dependencies and configurations are included and in predictable paths. Containers also provide a level of isolation, which means that your workflow can combine tools that could otherwise have conflicting dependency requirements.

On macOS and Windows desktops, containers transparently execute tools in a Linux Virtual Machine, ensuring workflow tools are executed in the same environment across host operating systems. Care should be taken to ensure the VM has sufficient memory required by the workflow’s tools.

How the workflow engine use containers

The CWL engine will handle staging of input/output files to the container, as long as they are formally declared (see Avoiding off-workflow data flows below). The engine will transparantly compose the right docker commands, typically using bind mounts to expose only the required working directory of that particular step.

Because the container is otherwise isolated from the host, the use of containers also ensures that all file dependencies of the tool are either included in the container image or declared as explicit input in CWL.

Although workflows can mix container and non-container steps, it is recommended to transition all steps to use containers so that the whole workflow is portable.

cwltool can take an option --default-container debian:9 which ensures all command line tools run in a container with the given base image. It is recommended to transition workflows to use explicit images with DockerRequirement even for “regular” POSIX tools like grep and sed.

Docker in Docker?

Note that normally the CWL engine itself does not run in a container, as it is responsible for coordinating the workflow and the other containers by calling the docker command line.

If a tool use containers internally, it can be tricky or insecure to execute that tool from an outer container, as Docker-in-Docker requires privileged mode which in effect is giving away full root access to the container.

Although it may be possible to achieve nested containers in a more secure way using Singularity, CWL's DockerRequirement does not have currently have support for privilege options.

Finding container images

See finding tools for how to find existing container images for many bioinformatics tools.

The below CWL example assumes we want to run https://hub.docker.com/r/curlimages/curl from the Docker Hub, which image name corresponds to curlimages/curl:

hints:
  DockerRequirement:
    dockerPull: curlimages/curl
  

For using images from other repositories like https://quay.io/, the dockerPull must be qualified with a hostname:

hints:
  DockerRequirement:
    dockerPull: quay.io/opencloudio/curl
  

It is recommended to use the official Docker image from a tool's project when it's available, although there can be many reasons to pick a different image:

Container has not been updated for latest release
Desired plugins or compile options were not enabled
Upstream container image is unnecessarily large, an alternative based on alpine and multi-stage builds might reduce the Docker image size
Upstream image does not have desired hardware optimization or support, for example MPI or CUDA support.

It is not recommended to reference private Docker repositories or use the DockerRequirement options of dockerLoad/dockerImport, except for proprietary software which can't be distributed in public repositories.

Similarly the dockerPull option should only be used with a local Dockerfile if customizations are needed in order to run the tool from CWL. If a desired open source tool do not exist as a container image, you can use dockerFile as an experiment before then publishing the image to Docker Hub under your user or organization's own namespace, replacing dockerFile with the updated referencing in dockerPull:

hints:
  DockerRequirement:
    #dockerFile: curl-Dockerfile
    dockerPull: stain/curl
  

Which container engine?

In order for the CWL engine to be able to execute tools in a container, the execution node(s) will need to have a container system installed. The most popular choice is the open source Docker Engine.

It is generally recommended to use the latest stable version of Docker Engine rather than a version provided by the operating system, although with recent distributions either should work.

For Windows and macOS desktops, Docker also provide Docker Desktop which also helps manage the Virtual Machine. This install is a bit more extensive and includes components that are not open source. The stable version is recommended for this use.

Docker on Windows 10 also supports running Windows OS inside containers, avoiding the need for virtual machines. While this is in theory would work with CWL, it would make the CWL workflow incompatible with other operating systems.

It is not possible to run concurrently Linux and Windows containers on the same node, and most bioinformatics tools found as Docker images assume Linux x64 containers.

The Windows 10 Subsystem for Linuxcan also support Docker but this configurationis currently not recommended for CWL users; bioinformatics tools in such containers may crash due to subtle differences between Linux kernel and the WSL2 subsystem, particularly for POSIX file system access.

Alternative container engines

All major CWL engines support Docker, but some of them also support alternative container systems, although their configuration vary slightly:

cwltool: Docker by default, or cwltool --singularity or cwltool --user-space-docker-cmd=udocker or cwltool --user-space-docker-cmd=nvidia-docker
Toil: Docker by default, or toil-cwl-runner --singularity
Cromwell: Docker by default, Singularity through configuration
Arvados: Only Docker supported
CWL-Airflow: Only Docker supported
REANA: Only Docker supported

Singularity is popular in HPC centres as it permits more fine-grained access control and is considered more secure than Docker. Singularity can use its own container images or load Docker images; however with CWL DockerRequirement only loading of Docker images are supported.

nvidia-docker is a wrapper of docker that handles bindings and permissions for NVIDIA GPUs. It can be used together with CUDA-optimized containers like gromacs/gromacs.

udocker is a lighweight user-space alternative to Docker, with a largely compatible command line interface. By not requiring root privileges it canbe easier to install, but instead of using Linux kernel features for containersthis instead rewrites library calls to pretend the

An alternative to containers that do not require root privileges is to use Conda packages.

Tips and caveats

When executing on cloud nodes or on a local cluster, the chosen container technology will need to be installed on each node or be part of the node's base image. It is recommended to keep the same version on the workflow head node as on the compute nodes.

Even though a workflow does not have any DockerRequirements, containers may be used by the CWL engine to evaluate Javascript expressions if no compatible version of node is installed.