8  Review of Workflow Frameworks

A lot of different workflow frameworks exist, and there are a lot of factors to consider when choosing the right one for your project. Wratten, Wilm, and Göke (2021) conducted a review of popular workflow managers for bioinformatics, evaluating them based on several key aspects, including ease of use, expressiveness, portability, scalability, and learning resources (Table 8.1).

Table 8.1: Overview of workflow managers for bioinformatics (Wratten, Wilm, and Göke 2021).
Tool Class Ease of use Expressiveness Portability Scalability Learning resources Pipeline initiatives
Galaxy Graphical ●●● ●○○ ●●● ●●● ●●● ●●○
KNIME Graphical ●●● ●○○ ○○○ ●●◐ ●●● ●●○
Nextflow DSL ●●○ ●●● ●●● ●●● ●●● ●●●
Snakemake DSL ●●○ ●●● ●●◐ ●●● ●●○ ●●●
GenPipes DSL ●●○ ●●● ●●○ ●●○ ●●○ ●●○
bPipe DSL ●●○ ●●● ●●○ ●●◐ ●●○ ●○○
Pachyderm DSL ●●○ ●●● ●○○ ●●○ ●●● ○○○
SciPipe Library ●●○ ●●● ○○○ ○○○ ●●○ ○○○
Luigi Library ●●○ ●●● ●○○ ●●◐ ●●○ ○○○
Cromwell + WDL Execution + workflow specification ●○○ ●●○ ●●● ●●◐ ●●○ ●●○
cwltool + CWL Execution + workflow specification ●○○ ●●○ ●●◐ ○○○ ●●● ●●○
Toil + CWL/WDL/Python Execution + workflow specification ●○○ ●●● ●◐○ ●●● ●●○ ●●○

Even more interesting is the accompanying GitHub repository (GoekeLab/bioinformatics-workflows), which contains a Proof of Concept (PoC) RNA-seq workflow implemented in the different workflow frameworks. These implementations were contributed and reviewed by the developers of the respective frameworks themselves!

Looking at these implementations, at first glance, one would think that the differences between the frameworks are minimal, and that the choice of framework is mostly a matter of personal preference.

8.1 Comparing PoC Workflows to Community-Made Modules

However, comparing the POC workflows (left) to community-made modules (right), it becomes clear that creating production-ready components requires a lot more than specifying a command’s input and output files.

8.1.1 Nextflow

Wratten et al. 2021 PoC (Source):

process FASTQC {
  publishDir params.outdir

  input:
    path index
    path left
    path right
  output:
    path 'qc'

  """
    mkdir qc && fastqc --quiet '${params.left}' '${params.right}' --outdir qc
  """
}

nf-core (Source):

channels:
  - conda-forge
  - bioconda
dependencies:
  - bioconda::fastqc=0.12.1
process FASTQC {
    tag "$meta.id"
    label 'process_medium'

    conda "${moduleDir}/environment.yml"
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/fastqc:0.12.1--hdfd78af_0' :
        'biocontainers/fastqc:0.12.1--hdfd78af_0' }"

    input:
    tuple val(meta), path(reads)

    output:
    tuple val(meta), path("*.html"), emit: html
    tuple val(meta), path("*.zip") , emit: zip
    path  "versions.yml"           , emit: versions

    when:
    task.ext.when == null || task.ext.when

    script:
    def args = task.ext.args ?: ''
    def prefix = task.ext.prefix ?: "${meta.id}"
    // Make list of old name and new name pairs to use for renaming in the bash while loop
    def old_new_pairs = reads instanceof Path || reads.size() == 1 ? [[ reads, "${prefix}.${reads.extension}" ]] : reads.withIndex().collect { entry, index -> [ entry, "${prefix}_${index + 1}.${entry.extension}" ] }
    def rename_to = old_new_pairs*.join(' ').join(' ')
    def renamed_files = old_new_pairs.collect{ old_name, new_name -> new_name }.join(' ')

    // The total amount of allocated RAM by FastQC is equal to the number of threads defined (--threads) time the amount of RAM defined (--memory)
    // https://github.com/s-andrews/FastQC/blob/1faeea0412093224d7f6a07f777fad60a5650795/fastqc#L211-L222
    // Dividing the task.memory by task.cpu allows to stick to requested amount of RAM in the label
    def memory_in_mb = MemoryUnit.of("${task.memory}").toUnit('MB') / task.cpus
    // FastQC memory value allowed range (100 - 10000)
    def fastqc_memory = memory_in_mb > 10000 ? 10000 : (memory_in_mb < 100 ? 100 : memory_in_mb)

    """
    printf "%s %s\\n" $rename_to | while read old_name new_name; do
        [ -f "\${new_name}" ] || ln -s \$old_name \$new_name
    done

    fastqc \\
        $args \\
        --threads $task.cpus \\
        --memory $fastqc_memory \\
        $renamed_files

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        fastqc: \$( fastqc --version | sed '/FastQC v/!d; s/.*v//' )
    END_VERSIONS
    """

    stub:
    def prefix = task.ext.prefix ?: "${meta.id}"
    """
    touch ${prefix}.html
    touch ${prefix}.zip

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        fastqc: \$( fastqc --version | sed '/FastQC v/!d; s/.*v//' )
    END_VERSIONS
    """
}
name: fastqc
description: Run FastQC on sequenced reads
keywords:
  - quality control
  - qc
  - adapters
  - fastq
tools:
  - fastqc:
      description: |
        FastQC gives general quality metrics about your reads.
        It provides information about the quality score distribution
        across your reads, the per base sequence content (%A/C/G/T).
        You get information about adapter contamination and other
        overrepresented sequences.
      homepage: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
      documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/
      licence: ["GPL-2.0-only"]
input:
  - meta:
      type: map
      description: |
        Groovy Map containing sample information
        e.g. [ id:'test', single_end:false ]
  - reads:
      type: file
      description: |
        List of input FastQ files of size 1 and 2 for single-end and paired-end data,
        respectively.
output:
  - meta:
      type: map
      description: |
        Groovy Map containing sample information
        e.g. [ id:'test', single_end:false ]
  - html:
      type: file
      description: FastQC report
      pattern: "*_{fastqc.html}"
  - zip:
      type: file
      description: FastQC report archive
      pattern: "*_{fastqc.zip}"
  - versions:
      type: file
      description: File containing software versions
      pattern: "versions.yml"
authors:
  - "@drpatelh"
  - "@grst"
  - "@ewels"
  - "@FelixKrueger"
maintainers:
  - "@drpatelh"
  - "@grst"
  - "@ewels"
  - "@FelixKrueger"
nextflow_process {

    name "Test Process FASTQC"
    script "../main.nf"
    process "FASTQC"

    tag "modules"
    tag "modules_nfcore"
    tag "fastqc"

    test("sarscov2 single-end [fastq]") {

        when {
            process {
                """
                input[0] = Channel.of([
                    [ id: 'test', single_end:true ],
                    [ file(params.modules_testdata_base_path + 'genomics/sarscov2/illumina/fastq/test_1.fastq.gz', checkIfExists: true) ]
                ])
                """
            }
        }

        then {
            assertAll (
                { assert process.success },
                // NOTE The report contains the date inside it, which means that the md5sum is stable per day, but not longer than that. So you can't md5sum it.
                // looks like this: <div id="header_filename">Mon 2 Oct 2023<br/>test.gz</div>
                // https://github.com/nf-core/modules/pull/3903#issuecomment-1743620039
                { assert process.out.html[0][1] ==~ ".*/test_fastqc.html" },
                { assert process.out.zip[0][1] ==~ ".*/test_fastqc.zip" },
                { assert path(process.out.html[0][1]).text.contains("<tr><td>File type</td><td>Conventional base calls</td></tr>") },
                { assert snapshot(process.out.versions).match() }
            )
        }
    }

    /* The rest of the tests are omitted */
}

8.1.2 Snakemake

Wratten et al. 2021 PoC (Source):

rule fastqc:
    input:
        get_fastqs,
    output:
        directory("results/fastqc/{sample}"),
    log:
        "logs/fastqc/{sample}.log",
    conda:
        "envs/fastqc.yaml"
    params:
        "--quiet --outdir",
    shell:
        "mkdir {output}; fastqc {input} {params} {output} 2> {log}"

snakemake-wrappers (Source):

channels:
  - conda-forge
  - bioconda
  - nodefaults
dependencies:
  - fastqc =0.12.1
  - snakemake-wrapper-utils =0.6.2
name: fastqc
description: |
  Generate fastq qc statistics using fastqc.
url: https://github.com/s-andrews/FastQC
authors:
  - Julian de Ruiter
input:
  - fastq file
output:
  - html file containing statistics
  - zip file containing statistics
"""Snakemake wrapper for fastqc."""

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "julianderuiter@gmail.com"
__license__ = "MIT"


from os import path
import re
from tempfile import TemporaryDirectory
from snakemake.shell import shell
from snakemake_wrapper_utils.snakemake import get_mem

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
# Define memory per thread (https://github.com/s-andrews/FastQC/blob/master/fastqc#L201-L222)
mem_mb = int(get_mem(snakemake, "MiB") / snakemake.threads)


def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)
    # Remove file extension(s) (similar to the internal fastqc approach)
    base = re.sub("\\.gz$", "", base)
    base = re.sub("\\.bz2$", "", base)
    base = re.sub("\\.txt$", "", base)
    base = re.sub("\\.fastq$", "", base)
    base = re.sub("\\.fq$", "", base)
    base = re.sub("\\.sam$", "", base)
    base = re.sub("\\.bam$", "", base)

    return base


# If you have multiple input files fastqc doesn't know what to do. Taking silently only first gives unapreciated results

if len(snakemake.input) > 1:
    raise IOError("Got multiple input files, I don't know how to process them!")

# Run fastqc, since there can be race conditions if multiple jobs
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell(
        "fastqc"
        " --threads {snakemake.threads}"
        " --memory {mem_mb}"
        " {extra}"
        " --outdir {tempdir:q}"
        " {snakemake.input[0]:q}"
        " {log}"
    )

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path:q} {snakemake.output.html:q}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path:q} {snakemake.output.zip:q}")
rule fastqc:
    input:
        "reads/{sample}.fastq"
    output:
        html="qc/fastqc/{sample}.html",
        zip="qc/fastqc/{sample}_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
    params:
        extra = "--quiet"
    log:
        "logs/fastqc/{sample}.log"
    threads: 1
    resources:
        mem_mb = 1024
    wrapper:
        "master/bio/fastqc"

8.1.3 Toil + WDL

Wratten et al. 2021 PoC (Source):

task FastQCone {
  input {
     File reads
  }

  command {
     zcat "${reads}" | fastqc stdin:readsone
  }

  output {
     File fastqc_res = "readsone_fastqc.html"
  }
  
  runtime {
     docker: 'pegi3s/fastqc'
  }
}

BioWDL (Source):

version 1.0

# ... license ...

task Fastqc {
    input {
        File seqFile
        String outdirPath
        Boolean casava = false
        ## ... other arguments ...

        # Set javaXmx a little high.
        String javaXmx="1750M"
        Int threads = 1
        String memory = "2GiB"
        Int timeMinutes = 1 + ceil(size(seqFile, "G")) * 4
        String dockerImage = "quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0"

        Array[File]? noneArray
        File? noneFile
    }

    # Chops of the .gz extension if present.
    String name = basename(sub(seqFile, "\.gz$",""))
    # This regex chops of the extension just as fastqc does it.
    String reportDir = outdirPath + "/" + sub(name, "\.[^\.]*$", "_fastqc")

    # We reimplement the perl wrapper here. This has the advantage that it
    # gives us more control over the amount of memory used.
    command <<<
        set -e
        mkdir -p "~{outdirPath}"
        FASTQC_DIR="/usr/local/opt/fastqc-0.12.1"
        export CLASSPATH="$FASTQC_DIR:$FASTQC_DIR/sam-1.103.jar:$FASTQC_DIR/jbzip2-0.9.jar:$FASTQC_DIR/cisd-jhdf5.jar"
        java -Djava.awt.headless=true -XX:ParallelGCThreads=1 \
        -Xms200M -Xmx~{javaXmx} \
        ~{"-Dfastqc.output_dir=" + outdirPath} \
        ~{true="-Dfastqc.casava=true" false="" casava} \
        # ... other arguments ...
        ~{"-Dfastqc.kmer_size=" + kmers} \
        ~{"-Djava.io.tmpdir=" + dir} \
        uk.ac.babraham.FastQC.FastQCApplication \
        ~{seqFile}
    >>>

    output {
        File htmlReport = reportDir + ".html"
        File reportZip = reportDir + ".zip"
        File? summary = if extract then reportDir + "/summary.txt" else noneFile
        File? rawReport = if extract then reportDir + "/fastqc_data.txt" else noneFile
        Array[File]? images = if extract then glob(reportDir + "/Images/*.png") else noneArray
    }

    runtime {
        cpu: threads
        memory: memory
        time_minutes: timeMinutes
        docker: dockerImage
    }

    parameter_meta {
        # inputs
        seqFile: {description: "A fastq file.", category: "required"}
        outdirPath: {description: "The path to write the output to.", catgory: "required"}
        # ... other arguments ...
        dockerImage: {description: "The docker image used for this task. Changing this may result in errors which the developers may choose not to address.", category: "advanced"}

        # outputs
        htmlReport: {description: "HTML report file."}
        reportZip: {description: "Source data file."}
        summary: {description: "Summary file."}
        rawReport: {description: "Raw report file."}
        images: {description: "Images in report file."}
    }

    meta {
        WDL_AID: {
            exclude: ["noneFile", "noneArray"]
        }
    }
}
version 1.0

# Copyright (c) 2017 Leiden University Medical Center
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

task Fastqc {
    input {
        File seqFile
        String outdirPath
        Boolean casava = false
        Boolean nano = false
        Boolean noFilter = false
        Boolean extract = false
        Boolean nogroup = false

        Int? minLength
        String? format
        File? contaminants
        File? adapters
        File? limits
        Int? kmers
        String? dir

        # Set javaXmx a little high. Equal to fastqc default with 7 threads.
        # This is because some fastq files need more memory. 2G per core
        # is a nice cluster default, so we use all the rest of the memory for
        # fastqc so we should have as little OOM crashes as possible even with
        # weird edge case fastq's.
        String javaXmx="1750M"
        Int threads = 1
        String memory = "2GiB"
        Int timeMinutes = 1 + ceil(size(seqFile, "G")) * 4
        String dockerImage = "quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0"

        Array[File]? noneArray
        File? noneFile
    }

    # Chops of the .gz extension if present.
    # The Basename needs to be taken here. Otherwise paths might differ
    # between similar jobs.
    String name = basename(sub(seqFile, "\.gz$",""))
    # This regex chops of the extension and replaces it with _fastqc for
    # the reportdir.
    # Just as fastqc does it.
    String reportDir = outdirPath + "/" + sub(name, "\.[^\.]*$", "_fastqc")

    # We reimplement the perl wrapper here. This has the advantage that it
    # gives us more control over the amount of memory used.
    command <<<
        set -e
        mkdir -p "~{outdirPath}"
        FASTQC_DIR="/usr/local/opt/fastqc-0.12.1"
        export CLASSPATH="$FASTQC_DIR:$FASTQC_DIR/sam-1.103.jar:$FASTQC_DIR/jbzip2-0.9.jar:$FASTQC_DIR/cisd-jhdf5.jar"
        java -Djava.awt.headless=true -XX:ParallelGCThreads=1 \
        -Xms200M -Xmx~{javaXmx} \
        ~{"-Dfastqc.output_dir=" + outdirPath} \
        ~{true="-Dfastqc.casava=true" false="" casava} \
        ~{true="-Dfastqc.nano=true" false="" nano} \
        ~{true="-Dfastqc.nofilter=true" false="" noFilter} \
        ~{true="-Dfastqc.unzip=true" false="" extract} \
        ~{true="-Dfastqc.nogroup=true" false="" nogroup} \
        ~{"-Dfastqc.min_length=" + minLength} \
        ~{"-Dfastqc.sequence_format=" + format} \
        ~{"-Dfastqc.threads=" + threads} \
        ~{"-Dfastqc.contaminant_file=" + contaminants} \
        ~{"-Dfastqc.adapter_file=" + adapters} \
        ~{"-Dfastqc.limits_file=" + limits} \
        ~{"-Dfastqc.kmer_size=" + kmers} \
        ~{"-Djava.io.tmpdir=" + dir} \
        uk.ac.babraham.FastQC.FastQCApplication \
        ~{seqFile}
    >>>

    output {
        File htmlReport = reportDir + ".html"
        File reportZip = reportDir + ".zip"
        File? summary = if extract then reportDir + "/summary.txt" else noneFile
        File? rawReport = if extract then reportDir + "/fastqc_data.txt" else noneFile
        Array[File]? images = if extract then glob(reportDir + "/Images/*.png") else noneArray
    }

    runtime {
        cpu: threads
        memory: memory
        time_minutes: timeMinutes
        docker: dockerImage
    }

    parameter_meta {
        # inputs
        seqFile: {description: "A fastq file.", category: "required"}
        outdirPath: {description: "The path to write the output to.", catgory: "required"}
        casava: {description: "Equivalent to fastqc's --casava flag.", category: "advanced"}
        nano: {description: "Equivalent to fastqc's --nano flag.", category: "advanced"}
        noFilter: {description: "Equivalent to fastqc's --nofilter flag.", category: "advanced"}
        extract: {description: "Equivalent to fastqc's --extract flag.", category: "advanced"}
        nogroup: {description: "Equivalent to fastqc's --nogroup flag.", category: "advanced"}
        minLength: {description: "Equivalent to fastqc's --min_length option.", category: "advanced"}
        format: {description: "Equivalent to fastqc's --format option.", category: "advanced"}
        contaminants: {description: "Equivalent to fastqc's --contaminants option.", category: "advanced"}
        adapters: {description: "Equivalent to fastqc's --adapters option.", category: "advanced"}
        limits: {description: "Equivalent to fastqc's --limits option.", category: "advanced"}
        kmers: {description: "Equivalent to fastqc's --kmers option.", category: "advanced"}
        dir: {description: "Equivalent to fastqc's --dir option.", category: "advanced"}
        javaXmx: {description: "The maximum memory available to the program. Should be lower than `memory` to accommodate JVM overhead.", category: "advanced"}
        threads: {description: "The number of cores to use.", category: "advanced"}
        memory: {description: "The amount of memory this job will use.", category: "advanced"}
        timeMinutes: {description: "The maximum amount of time the job will run in minutes.", category: "advanced"}
        dockerImage: {description: "The docker image used for this task. Changing this may result in errors which the developers may choose not to address.", category: "advanced"}

        # outputs
        htmlReport: {description: "HTML report file."}
        reportZip: {description: "Source data file."}
        summary: {description: "Summary file."}
        rawReport: {description: "Raw report file."}
        images: {description: "Images in report file."}
    }

    meta {
        WDL_AID: {
            exclude: ["noneFile", "noneArray"]
        }
    }
}

8.2 Limitations of the study

However, the Supplementary Table shows that the comparison in Table 8.1 was rather limited, since the score of each category was only based on a single criterion. Of the following categories, only “Scalability” was determined by more than one criterion:

  • Ease of Use: Graphical interface with execution environment (score of 3), programming interface with in-built execution environment (score of 2), separated development and execution environment (score of 1).
  • Expressiveness: Based on an existing programming language (3) or a new language or restricted vocabulary (2), primary interaction with graphical user interface (1).
  • Portability: Integration with three or more container and package manager platforms (3), two platforms are supported (2), one platform is supported (1).
  • Scalability: Considers cloud support, scheduler and orchestration tool integration, and executor support. Please refer to Supplementary Table 1 - Sheet 2 (Scalability).
  • Learning resources: Official tutorials, forums and events (3), tutorials and forums (2), tutorials or forums (1).
  • Pipelines Initiatives: Community and curated (3), community or curated (2), not community or curated (1).

By comparing the example code of the respective workflow frameworks, it also becomes clear that we need not only look at example code of POC workflows, but actual production-ready workflows and pipelines. Such code often require a lot more functionality, including:

  • Error handling
  • Logging
  • Data provenance
  • Parameterization
  • Testing
  • Documentation
  • Containerization
  • Resource management