Writing incremental tests

Providing tests for workflows is important - they enable new users to be sure that a workflow behaves as expected, and also gives developers a reference to use to ensure that any changes made to the tools doesn't break the workflow. Setting up tests can be daunting, and time consuming at the start, but doing so will pay off in the long term. Below we discuss some of the details of how to do this.

Find/create test data for each step

If the data is open then partial data from earlier steps is easiest to use.

Intermediate data for a workflow can be generated by calling the CWL reference implementation cwl-runner with the --cachedir option. This will cache all step outputs, which you can then copy to create test data for distribution. A perhaps more tedious alternative is to add workflow level outputs mapped from every step's outputs and copy test data manually.

The --cachedir option, sometimes combined with --target is also useful for incremental testing of a workflow, as it will reuse old outputs as long as a tool configuration and tool inputs match exactly a previous execution.Note that this means the step outputs are assumed to be reusable - for tools with mutable state (e.g. wget of current weather or a database lookup) add the WorkReuse hint with enableReuse: false to force re-execution.

Structure workflow files to support testing

CWL workflows can be nested. When writing a complex workflow it can be better to break this into several scripts, so that each can be tested independently.

This also allows you to make siblings of the parent workflow to tweak data handling, while retaining most of the reusable steps inside the nested workflows.

Setting up automated testing with GitHub Actions

If you are using GitHub to host your git repository, then you can make use of GitHub Actions to automate your testing process. Setting up the tests can be a (relatively) straightforward process, if you follow a few simple rules, and approach this in an iterative manner.

The minimum requirements are:

  • GitHub Actions script
  • package dependency file for your script

Your dependency file will describe the packages needed to run CWL (in the first instance - you can add other dependencies later). If you use conda for installing the cwl_runner then this could be as simple as:

name: cwlrunner
channels:
  - conda-forge
  - defaults
dependencies:
  - cwl_runner

This should be saved in your git repository - in this example the file will be called env.yml.

The GitHub actions script should be saved in a folder .github/workflows/ within your git repository. The basic layout this takes is:

name: CI testing

on: [push, pull_request]

jobs:

  workflow_validation:
  
    runs-on: ubuntu-latest
    
    steps:
    
    - name: Checkout repository
      uses: actions/checkout@v2
    
    - name: Set up Conda
      uses: conda-incubator/setup-miniconda@v2
      with:
        auto-update-conda: true
        python-version: 3.8
        activate-environment: cwlrunner
        environment-file: env.yml
     
     - name: Validate Script
       run: |
         conda run -n cwlrunner cwltool --validate script.cwl

This will validate the script.cwl CWL script when a push is made to the GitHub repository, or when a pull_request is made to the repository.

The first two steps, Checkout repository and Set up Conda, use the checkout and setup-miniconda actions from the GitHub marketplace (these are, generally, a lot simpler to use than writing your own scripts, and should work on most available testing OS).

The last step, Validate Script, runs a simple shell command (the shell is not a login shell, so conda cannot be activated normally and conda run -n [env] must be used instead) that carries out the validation of the workflow script.

Just like in computational workflows, it is important to lock down versionsof GitHub Actions you rely in, as shown above with @v2 and python-version.

If your repository uses git submodules for loading libraries, then these can be loaded by adapting the checkout step to enable submodules:

    - name: Checkout repository
      uses: actions/checkout@v2
      with:
        submodules: recursive

Once you have a script which successfully runs the validation test (even if not all steps validate successfully), you can start adding other tests as extra steps in the script. These will require inputs for your CWL workflow, which can be gathered as described above, and stored in a tests/ directory, along with the configuration file for the tests. These can be added as extra steps in your script as they are provided, e.g.:

     - name: Run Testcase Script
       run: |
         conda run -n cwlrunner cwltool script.cwl tests/testcase.yml

Defensive CWL programming

It is good practice to specify a format for all File objects passed into / out from your tools. These should, where possible, refer to an existing formal vocabulary description, such as the EDAM PDB format, e.g.:

inputs:
  input_file:
    type: File
    format: http://edamontology.org/format_1476

or (if your script uses namespaces):

$namespaces:
  edam: http://edamontology.org/
inputs:
  input_file:
    type: File
    format: edam:format_1476

Doing this will provide a prompt for the user to check the format of their input file. Note that the file itself is not checked by the cwltool, only the workflow annotation, so it is of course possible for a user to simply label their incorrect input file as if it was the correct format, in order to try to run the workflow. But adding this format check will assist them in debugging their issues once they see that this does not work.

You can specify your own, local, format for the File objects - this allows for quick development of new tools and workflows, at the cost of some traceability. If you do this it is recommended to include format information in the documentation that you provide with the tool and, in the long-term, to submit your format for inclusion in an ontology or vocabulary (such as EDAM).