Conda & BioConda

An alternative to containers is to use a packaging system like Conda, particularly as the BioConda initiative have wrapped an more than 7000 bioinformatics packages and are frequently updated.

Similar to containers, a Conda environment is also a file hierarchy with its own bin and lib, but activating a Conda environment is less intrusive and requires no root permissions as it only modifies shell environments like PATH. Conda environments do not gain the safety from isolation as containers do, but they typically do include most of the system dependencies like libstdc.so.

Conda in CWL

To indicate in CWL that a package is available from Conda, we add a SoftwareRequirement item to the hints of the CommandLineTool object.

We could place the SoftwareRequirement under requirements instead, but this would prevent workflow execution if Conda was not available, even if the command line tool was already available on PATH. Similarly we also usually place DockerRequirement under hints so that the workflow engine can try both options.

In CWL, Conda dependencies are identified by their URL in a https://anaconda.org/ channel - which typically should be one of:

hints:
  SoftwareRequirement:
    packages:
      curl:
        specs: 
          - https://anaconda.org/conda-forge/curl

Using Conda in the shell

One advantage of Conda is that it can also be easily used on the command line for experimenting with the same version of the tool as your workflow will eventually use:

(base) stain@biggie:~$ conda create -n bam bamtools
 (...)
(base) stain@biggie:~$ conda activate bam

(bam) stain@biggie:~$ type bamtools 
bamtools is hashed (/home/stain/miniconda3/envs/bam/bin/bamtools)

(bam) stain@biggie:~$ bamtools  --version
bamtools 2.5.1

The binaries will mainly reference the Conda environment /home/stain/miniconda3/envs/bam/ but will also reference some core libraries in the main operating system.

(bam) stain@biggie:~$ ldd /home/stain/miniconda3/envs/bam/bin/bamtools
	linux-vdso.so.1 (0x00007ffdcc54d000)
	libz.so.1 => /home/stain/miniconda3/envs/bam/bin/../lib/libz.so.1 (0x00007f5c13290000)
	libstdc++.so.6 => /home/stain/miniconda3/envs/bam/bin/../lib/libstdc++.so.6 (0x00007f5c1311b000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f5c12fb4000)
	libgcc_s.so.1 => /home/stain/miniconda3/envs/bam/bin/../lib/libgcc_s.so.1 (0x00007f5c12fa0000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f5c12dae000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f5c133fe000)

Conda tips and caveats

If a BioConda package is outdated or missing, contributing a recipe is often welcomed by the community, and can often be accepted within a couple of hours.

One disadvantage of Conda is that it may take long to download and initialize from an empty environment. It is also slightly more fragile than containers, as you can install multiple tools, or later updates could cause library dependency version conflicts. CWL engines using Conda will create a new environment for each CWL CommandLineTool.

It is possible to list multiple conda packages in CWL, although usually the main recipe will have all the required dependencies included.

Conda can be used to install the CWL engines conda-forge/cwltool and bioconda/toil, and they can be used to run CWL workflows that use Conda or Docker.

As bioconda/toil depend on a particular version of cwltool, install that first. If you desire a newer cwltool create a separate Conda environment. Install conda-forge/cwltool, not the outdated bioconda/cwltool!

Conda packages are compiled for each operating system and may have subtle differences. While Conda runs on Linux, macOS and Windows, it should be noted that BioConda packages are only available for macOS and Linux.

Conda can be used to get a consistent/updated set of GNU coreutils, sed, Python etc. on macOS and Windows.