Ensuring workflow is scalable

When you have developed your workflow to the point of giving promising results, verified by running with your test data it's time to scale up.

One of the advantages of using a workflow management system with reproducibility and interoperability features is that a workflow running locally can be transferred to run on a larger compute infrastructure, without requiring significant changes.

This enables further development of the same workflow stay in sync across deployments. A workflow where all dataflows are explicit and tools run in independent containers also allows job-level “embarrassingly parallel” upscaling without needing to handle job queues or file moving.

However, simply executing a workflow at a larger compute infrastructure will not always give your workflow an automatic boost; the way you structure your workflow and annotate it with execution hint will help the engine along with utillizing your compute resources more efficiently.

Choosing compute infrastructure and workflow engine

The optimizations appropriate for your workflow will partially depend on which particular compute infrastructure, storage solution and workflow engine you have settled on.

While the reference engine cwltool aims to support all of CWL's features, it is by design a single-node local executor that focus on correctness and workflow debugging. By default cwltool run one tool invocation at a time, and “fail fast” the whole workflow on any error. This can be tweaked with --parallel --on-error continue which, together with --cachedir on repeated invocations, may provide significant speedups when testing a workflow locally. Nevertheless, for larger computational workflows it will be necessary to execute across multiple nodes on a local cluster or cloud infrastructure.

For the Common Workflow Language, multiple implementations are available, with support for many different computational backends. These engines vary in their complexity, features and documentation available for different scenarios and there is not a particular engine that is always the best choice.

The BioExcel Best Practice Guide How to choose which CWL engine to deploy covers the main choices of workflow engines for running CWL workflows, but you may also want to do your own research and trials depending on your particular setup and requirement.

If you have followed the interoperability advice then your CWL workflow should work in all compliant engines, but that does not mean you should compare their performance based only on a workflow that worked sufficently on a desktop computer.

The rest of this page show additional hints and techniques that can be used to help your workflow engine in efficiently executing your workflow.

Handle large-scale scattering

Considerations for data handling on cloud execution

https://doc.arvados.org/user/cwl/cwl-extensions.html

Where to scatter? Structuring iterations with nested workflows

The principal best followed for scattering workflows is to try to create the longest continuous workflows possible. Scatter instructions should enclose workflows where each step relies only on global outputs from steps previous to the scatter instruction, or on local outputs from steps after the scatter instruction. Creating a workflow where each step is scattered individually is inefficient, as each step will take as long as the longest individual process.

Scattering can be best organised using nested workflows - creating a single workflow which covers all tasks within a scattered process, then scattering that will keep the dependencies clear.

An example external script:

cwlVersion: v1.0
class: Workflow

requirements:
  SubworkflowFeatureRequirement: {}
  ScatterFeatureRequirement: {}

inputs:
  step_input_files:
    type:
      type: array
      items: File
  input_config: string

outputs:
  output_file:
    type: File
    outputSource: step_outer/return_file

steps:
  step_outer:
    run: inner_script.cwl
    scatter: input_file
    in:
      input_file: step_input_files
      config_information: input_config
    out: [return_file]

And the example internal script:

cwlVersion: v1.0
class: Workflow

inputs:
  input_file: File
  config_information: string

outputs:
  return_file:
    type: File
    outputSource: step_two/final_file

steps:
  step_one:
    run: process_one.cwl
    in:
      file_prepare: input_file
      prep_config: config_information
    out: [intermediate_file]

  step_two:
    run: process_two.cwl
    in:
      file_process: step_one/intermediate_file
    out: [final_file]

In this example, because step_two depends only on the single intermediate_file generated by step_one (and not on the intermediate_file generated by any other of the scattered workflows), we enclose these two steps within the same scatter step.

Handling many files