Analyzing Multiple Samples
Overview
Teaching: 30 min
Exercises: 30 minQuestions
How can you run the same workflow over multiple samples?
Objectives
Modify the workflow to process multiple samples, then perform a joint analysis.
This lesson has migrated to https://doc.arvados.org/rnaseq-cwl-training/05-scatter/index.html
In the previous lesson, we completed converting the function of the original source shell script into CWL. This lesson expands the scope by demonstrating what changes to make to the workflow to be able to analyze multiple samples in parallel.
Subworkflows
In addition to running command line tools, a workflow step can also execute another workflow.
First, copy main.cwl
to alignment.cwl
.
Next, open main.cwl
for editing. We are going to replace the steps
and outputs
sections.
Remove all the steps and replace them with a single alignment
step
which invokes the alignment.cwl
we just copied.
steps:
alignment:
run: alignment.cwl
in:
fq: fq
genome: genome
gtf: gtf
out: [qc_html, bam_sorted_indexed, featurecounts]
In the outputs
section, all the output sources are from the alignment step:
outputs:
qc_html:
type: File
outputSource: alignment/qc_html
bam_sorted_indexed:
type: File
outputSource: alignment/bam_sorted_indexed
featurecounts:
type: File
outputSource: alignment/featurecounts
We also need add “SubworkflowFeatureRequirement” to tell the workflow runner that we are using subworkflows:
requirements:
SubworkflowFeatureRequirement: {}
Running the workflow
Run this workflow. You should get exactly the same results as before, as all we have done so far is to wrap the inner workflow with an outer workflow.
Part 1 solution
Scattering
The “wrapper” step lets us do something useful. We can modify the
outer workflow to accept a list of files, and then invoke the inner
workflow step for every one of those files. We will need to modify
the inputs
, steps
, outputs
, and requirements
sections.
First we change the fq
parameter to expect a list of files:
inputs:
fq: File[]
genome: Directory
gtf: File
Next, we add scatter
to the alignment step. The means we want to
run run alignment.cwl
for each value in the list in the fq
parameter.
steps:
alignment:
run: alignment.cwl
scatter: fq
in:
fq: fq
genome: genome
gtf: gtf
out: [qc_html, bam_sorted_indexed, featurecounts]
Because the scatter produces multiple outputs, each output parameter becomes a list as well:
outputs:
qc_html:
type: File[]
outputSource: alignment/qc_html
bam_sorted_indexed:
type: File[]
outputSource: alignment/bam_sorted_indexed
featurecounts:
type: File[]
outputSource: alignment/featurecounts
We also need add “ScatterFeatureRequirement” to tell the workflow runner that we are using scatter:
requirements:
SubworkflowFeatureRequirement: {}
ScatterFeatureRequirement: {}
Part 2 solution
Input parameter lists
The fq
parameter needs to be a list. You write a list in yaml by
starting each list item with a dash. Example main-input.yaml
fq:
- class: File
location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq
format: http://edamontology.org/format_1930
- class: File
location: rnaseq/raw_fastq/Mov10_oe_2.subset.fq
format: http://edamontology.org/format_1930
- class: File
location: rnaseq/raw_fastq/Mov10_oe_3.subset.fq
format: http://edamontology.org/format_1930
- class: File
location: rnaseq/raw_fastq/Irrel_kd_1.subset.fq
format: http://edamontology.org/format_1930
- class: File
location: rnaseq/raw_fastq/Irrel_kd_2.subset.fq
format: http://edamontology.org/format_1930
- class: File
location: rnaseq/raw_fastq/Irrel_kd_3.subset.fq
format: http://edamontology.org/format_1930
genome:
class: Directory
location: hg19-chr1-STAR-index
gtf:
class: File
location: rnaseq/reference_data/chr1-hg19_genes.gtf
Running the workflow
Run this workflow. You will now get results for each one of the input fastq files.
Combining results
Each instance of the alignment workflow produces its own
featurecounts.tsv
file. However, to be able to compare results
easily, we would like single file with all the results.
We can modify the workflow to run featureCounts
once at the end of
the workflow, taking all the bam files listed on the command line.
We will need to change a few things.
First, in featureCounts.cwl
we need to modify it to accept either a
single bam file or list of bam files.
inputs:
gtf: File
counts_input_bam:
- File
- File[]
Second, in alignment.cwl
we need to remove the featureCounts
step from alignment.cwl, as well as the featurecounts
output parameter.
Third, in main.cwl
we need to remove featurecounts
from the alignment
step
outputs, and add a new step:
steps:
alignment:
run: alignment.cwl
scatter: fq
in:
fq: fq
genome: genome
gtf: gtf
out: [qc_html, bam_sorted_indexed]
featureCounts:
requirements:
ResourceRequirement:
ramMin: 500
run: featureCounts.cwl
in:
counts_input_bam: alignment/bam_sorted_indexed
gtf: gtf
out: [featurecounts]
Last, we modify the featurecounts
output parameter. Instead of a
list of files produced by the alignment
step, it is now a single
file produced by the new featureCounts
step.
outputs:
...
featurecounts:
type: File
outputSource: featureCounts/featurecounts
Running the workflow
Run this workflow. You will still have separate results from fastq and and STAR, but now you will only have a single
featurecounts.tsv
file with a column for each bam file.
Episode solution
Key Points
Separate the part of the workflow that you want to run multiple times into a subworkflow.
Use a scatter step to run the subworkflow over a list of inputs.
The result of a scatter is an array, which can be used in a combine step to get a single result.