Create a Workflow by Composing Tools
Overview
Teaching: 30 min
Exercises: 10 minQuestions
What is the syntax of CWL?
What are the key components of a workflow?
Objectives
Write a workflow based on the source shell script, making use of existing tool wrappers.
This lesson has migrated to https://doc.arvados.org/rnaseq-cwl-training/02-workflow/index.html
Source shell script
In this lesson, we will develop an initial workflow inspired by the following shell script.
rnaseq_analysis_on_input_file.sh
#!/bin/bash
# Based on
# https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/07_automating_workflow.html
#
# This script takes a fastq file of RNA-Seq data, runs FastQC and outputs a counts file for it.
# USAGE: sh rnaseq_analysis_on_input_file.sh <name of fastq file>
set -e
# initialize a variable with an intuitive name to store the name of the input fastq file
fq=$1
# grab base of filename for naming outputs
base=`basename $fq .subset.fq`
echo "Sample name is $base"
# specify the number of cores to use
cores=4
# directory with genome reference FASTA and index files + name of the gene annotation file
genome=rnaseq/reference_data
gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
# make all of the output directories
# The -p option means mkdir will create the whole path if it
# does not exist and refrain from complaining if it does exist
mkdir -p rnaseq/results/fastqc
mkdir -p rnaseq/results/STAR
mkdir -p rnaseq/results/counts
# set up output filenames and locations
fastqc_out=rnaseq/results/fastqc
align_out=rnaseq/results/STAR/${base}_
counts_input_bam=rnaseq/results/STAR/${base}_Aligned.sortedByCoord.out.bam
counts=rnaseq/results/counts/${base}_featurecounts.txt
echo "Processing file $fq"
# Run FastQC and move output to the appropriate folder
fastqc $fq
# Run STAR
STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within
# Create BAM index
samtools index $counts_input_bam
# Count mapped reads
featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
CWL Syntax
CWL documents are written using a format called “YAML”. Here is a crash-course in YAML:
Data fields are written with the name, followed by a colon :
, a space,
and then the value.
fieldName: value
The value is the remaining text to the end of the line.
Special characters in YAML include :
, {
, }
[
, ]
, #
, !
and %
. If your text begins with any of these characters, you must
surround the string in single or double quotes.
fieldName: "#quoted-value"
You can write multi-line text by putting |-
and writing an indented
block. The leading whitespace will be removed from the actual value.
fieldName: |-
This is a multi-
line string.
Horray!
Nested sections are indented:
section1:
field1: value1
field2: value2
Nested sections can also be wrapped in curly brackets. In this case, fields must be comma-separated.
section1: {field1: value1, field2, value2}
When each item is on its own line starting with a dash -
, it is a list.
section2:
- value1
- value2
List can also be wrapped in square brackets. In this case, values must be comma-separated.
section2: [value1, value2]
Comments start with #
.
# This is a comment about field3
field3: stuff
field4: stuff # This is a comment about field4
Finally, YAML is a superset of JSON. Valid JSON is also valid YAML, so you may sometimes see JSON format being used instead of YAML format for CWL documents.
Workflow header
Create a new file “main.cwl”
Let’s start with the header.
cwlVersion: v1.2
class: Workflow
label: RNAseq CWL practice workflow
- cwlVersion - Every file must include this. It declares the version of CWL in use.
- class - This is the type of CWL document. We will see other types in future lessons.
- label - Optional title of your workflow.
Workflow Inputs
The purpose of a workflow is to consume some input parameters, run a series of steps, and produce output values.
For this analysis, the input parameters are the fastq file and the reference data required by STAR.
In the source shell script, the following variables are declared:
# initialize a variable with an intuitive name to store the name of the input fastq file
fq=$1
# directory with genome reference FASTA and index files + name of the gene annotation file
genome=rnaseq/reference_data
gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
In CWL, we will declare these variables in the inputs
section.
The inputs section lists each input parameter and its type. Valid
types include File
, Directory
, string
, boolean
, int
, and
float
.
In this case, the fastq and gene annotation file are individual files. The STAR index is a directory. We can describe these inputs in CWL like this:
inputs:
fq: File
genome: Directory
gtf: File
Workflow Steps
A workflow consists of one or more steps. This is the steps
section.
Now we need to describe the first step of the workflow. In the source
script, the first step is to run fastqc
.
# Run FastQC and move output to the appropriate folder
fastqc $fq
A workflow step consists of the name of the step, the tool to run
,
the input parameters to be passed to the tool in in
, and the output
parameters expected from the tool in out
.
The value of run
references the tool file. The tool file describes
how to run the tool (we will discuss how to write tool files in lesson
4). If we look in bio-cwl-tools
(which you should have imported
when setting up a practice repository in the initial setup
instructions) we find bio-cwl-tools/fastqc/fastqc_2.cwl
.
Next, the in
block is mapping of input parameters to the tool and
the workflow parameters that will be assigned to those inputs. We
need to know what input parameters the tool accepts.
Let’s open up the tool file and take a look:
Find the inputs
section of bio-cwl-tools/fastqc/fastqc_2.cwl
:
inputs:
reads_file:
type:
- File
inputBinding:
position: 50
doc: |
Input bam,sam,bam_mapped,sam_mapped or fastq file
Now we know we need to provide an input parameter called reads_file
.
Next, the out
section is a list of output parameters from the tool
that will be used later in the workflow, or as workflow output. We
need to know what output parameters the tool produces. Find the
outputs
section of bio-cwl-tools/fastqc/fastqc_2.cwl
:
outputs:
zipped_file:
type:
- File
outputBinding:
glob: '*.zip'
html_file:
type:
- File
outputBinding:
glob: '*.html'
summary_file:
type:
- File
outputBinding:
glob: |
${
return "*/summary.txt";
}
Now we know to expect an output parameter called html_file
.
Putting this all together, the fastq
step consists of a run
, in
and out
subsections, and looks like this:
steps:
fastqc:
run: bio-cwl-tools/fastqc/fastqc_2.cwl
in:
reads_file: fq
out: [html_file]
Running alignment with STAR
The next step is to run the STAR aligner.
# Run STAR
STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within
We will go through the same process as the first section. We find
there is bio-cwl-tools/STAR/STAR-Align.cwl
. We will open the file
and look at the inputs
section to determine what input parameters
correspond to the command line parmeters from our source script.
Exercise
Look at
STAR-Align.cwl
and identify the input parameters that correspond to the command line arguments used in the source script:--runThreadN
,--genomeDir
,--outSAMtype
, and--outSAMunmapped
. Also identify the name of the output parameter.Solution
--runThreadN
→RunThreadN
--genomeDir
→GenomeDir
--readFilesIn
→ForwardReads
--outSAMtype
→OutSAMtype
,SortedByCoordinate
--outSAMunmapped
→OutSAMunmapped
output parameter name:
alignment
Command line flags
Command line flags generally appear appear in either the
arguments
field, or theprefix
field of theinputBinding
section of an input parameter declaration. For example, this section ofSTAR-Align.cwl
tells us that theGenomeDir
input parameter corresponds to the--genomeDir
command line parameter.GenomeDir: type: Directory inputBinding: prefix: "--genomeDir"
Default values
Sometimes we want to provide input values to a step without making them as workflow-level inputs. We can do this with
{default: N}
. For example:in: RunThreadN: {default: 4}
Exercise
Using the input and output parameters identified in the previous exercise, write the
run
,in
andout
sections of the STAR step.Solution
STAR: run: bio-cwl-tools/STAR/STAR-Align.cwl in: RunThreadN: {default: 4} GenomeDir: genome ForwardReads: fq OutSAMtype: {default: BAM} SortedByCoordinate: {default: true} OutSAMunmapped: {default: Within} out: [alignment]
Running samtools
The third step is to generate an index for the aligned BAM.
# Create BAM index
samtools index $counts_input_bam
For this step, we need to use the output of a previous step as input
to this step. We refer the output of a step by with name of the step
(STAR), a slash, and the name of the output parameter (alignment), e.g. STAR/alignment
This creates a dependency between steps. This means the samtools
step will not run until the STAR
step has completed successfully.
samtools:
run: bio-cwl-tools/samtools/samtools_index.cwl
in:
bam_sorted: STAR/alignment
out: [bam_sorted_indexed]
featureCounts
# Count mapped reads
featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
As of this writing, the subread
package that provides
featureCounts
is not available in bio-cwl-tools
(and if it has been
added since then, let’s pretend that it isn’t there.) We will go over
how to write a CWL wrapper for a command line tool in lesson 4. For
now, we will leave off the final step.
Workflow Outputs
The last thing to do is declare the workflow outputs in the outputs
section.
For each output, we need to declare the type of output, and what parameter has the output value.
Output types are the same as input types, valid types include File
,
Directory
, string
, boolean
, int
, and float
.
The outputSource
field refers the a step output in the same way that
the in
block does, the name of the step, a slash, and the name of
the output parameter.
For our final outputs, we want the results from fastqc and the aligned, sorted and indexed BAM file.
outputs:
qc_html:
type: File
outputSource: fastqc/html_file
bam_sorted_indexed:
type: File
outputSource: samtools/bam_sorted_indexed
Episode solution
Key Points
CWL documents are written using a syntax called YAML.
The key components of the workflow are: the header, the inputs, the steps, and the outputs.