Snakemake—simple data processing for researchers

Liang-Bo Wang, 2022-09-25.

Slide notes

Snakemake—simple data processing
for researchers

Liang-Bo Wang
Taipei Bioinformatics Omnibus
2022-09-25

Esc or swipe up
with three fingers to exit
← → to navigate

This talk is for researchers who need to automate things in a handy way

Some advice may not apply to a bioinformatics core lab, who develop pipelines as their main job
This is not a complete Snakemake tutorial;
please learn it properly from the official tutorial
I will share my experience of Snakemake and how to best utilize it

DALL·E generated art using the prompt 'a hard-working bee working on multiple tasks, digital illustration' — A bee-sy dry-lab researcher

How to learn Snakemake from scratch using official materials?

Go through the official tutorial and best practices
You are probably ready to implement your own pipeline
If you need to run Snakemake on HPC or a cloud, go through the executor tutorial and the documentation about the cluster/cloud execution
Check out the rest of the documentation once you get a better understanding of Snakemake
Keep an eye on Snakemake's changelog for new features

About me

PhD in Computational and Systems Biology
from Washington University in St. Louis
My thesis focuses on cancer genomics
Worked in related consortia including CPTAC, TCGA, and HTAN with thousands of samples
Common tasks as a researcher: prototyping new tools/pipelines, data wrangling, and analyses

My top needs for a pipeline/workflow

To process multiple samples on different computing platforms

DALL·E generated art using the prompt 'four repeats in a two by two grid of a robot assembling car parts on a conveyor belt, digital art'

To reproduce my analysis steps and results

DALL·E generated art using the prompt 'a robot arm handling multiple sheets of papers with scientific charts and illustrations on a wood table, digital art'

Scenario 1: A pipeline to process multiple samples on different computing platforms

Run in different environments: standalone server, HPC cluster, and cloud
HPC cluster has its own job queue and computing architecture (LSF at my school)
We use public cloud services (Google and Amazon cloud) and we manage the usage ourselves

Illustration of the scenario of a reproducible analysis workflow. DALL·E generated art.

Scenario 2: A workflow to reproduce my analysis steps and results

Steps to build the dataset for downstream analysis
Figure generation in the manuscript
Update results when new data comes in
Preferably, others can easily set up the workflow and reproduce in an unknown environment

	Data processing	Reproducible analysis
Inputs	Lots of samples	Mostly the same inputs
Run type	Once per sample	Multiple times
Workflow stability	Stable	Subject to change
Computing environment	Known and multiple Local, HPC, and cloud	Known and limited* Mostly local

There are quite a few completely opposite properties about these two scenarios, which are listed in the table.

For the reproducible analysis scenario, the computing environment is known and limited only when we try to reproduce it on our own machines. In my experience, having a clear and simple instructions is easier for other researchers to work with the analysis than say a fully automated and dockerized setup.

What is Snakemake?

Snakemake is a Python-based workflow management tool.
A Snakemake workflow contains a set of rules:

A rule describes how to create output file patterns from input file patterns
The workflow file runs like a Python script
Can import external Python libraries
Can create one's own Python functions

How Snakemake works in general

Example job dependency graph — An example task
dependency graph

Trace the workflow backwards. Snakemake starts with the rule that generates the final outputs:

If all the inputs of this rule exist, it's ready to run its commands
(create a new task)
If some inputs do not exist, find other rules to generate the required input files (create a dependent task)
Repeat the process until all outputs can be generated from existing files (complete a task dependency graph)
Snakemake will start running the corresponding commands/tasks

The task dependency graph has the property as a directed acyclic graph (DAG). For Snakemake, such graph is constructed backwards from the final outputs, also known as the pull-based approach.

Reading the pipeline backwards can be a bit counterintuitive at first. My tip is to always start with the final outputs (usually at the end of the file) and trace the workflow backwards.

Screenshot of the README of the example pipeline — Screenshot of the pipeline README

A simple RNA-seq pipeline in action

Calculate gene read counts from RNA-seq BAMs

Find the local path of the sample BAMs
Generate read counts (featureCounts)
Clean up and compress the outputs
Calculate FPKM and FPKM-UQ values

Full pipeline here

I implemented two RNA-seq pipelines in Snakemake that are lossly based on GDC's mRNA analysis pipeline:

https://github.com/ding-lab/cptac_rna_expression that starts from GDC aligned BAMs
https://github.com/ding-lab/HTAN_bulkRNA_expression that starts from FASTQs

These pipelines were developed a while ago, when some of my computing environments were quite old and limited. So I took a relative conservative approach by using minimal features and very few dependencies (just the Python standard library). These pipelines can certainly be improved to better conform to the community's best practices. With the ubiquitous conda and Docker support and new Snakemake features, some possible improvements include:

Replace {WORKFLOW_ROOT} with workflow.source_path()
Use pandas to process data frames
Modularization, though my pipelines are very short (less than 300 lines)

rule readcount:
    output: count_tsv='readcount/{sample}.tsv'
    input: bam='inputs/{sample}.bam',
		   bai='inputs/{sample}.bam.bai'
    log: 'logs/readcount/{sample}.log'
    params: gtf=GENE_GTF_PTH
    threads: 16
    shell:
        'featureCounts -T {threads} -a {params.gtf} ...'
        '-o {output.count_tsv} {input.bam} 2> {log}'

{sample} is a wildcard whose value is determined by matching the output patterns

If you have wrote a Makefile before, snakemake rules share the same spirit as a Makefile's rule. And they run in a similar way.

Snakemake syntax explained

{sample}: Known as wildcards that matches the output file pattern with the existing files.
input:, output:: File patterns for inputs and outputs. Use key-value pairs key='...' when there are multiple patterns. They are just Python strings and thus can be referred elsewhere by rules.myrule.output.key.
shell:: The command that will run in a bash shell to generate the desired outputs.

Check out the documentation for the rest parts of the rule.

How snakemake creates a task

Due to the way snakemake executes, it's easier to resolve the output first by realizing all the wildcard variables. Then expand the rest of the rule (inputs, parameters, and etc.)

For example, to generate readcount/my_sample_a.tsv using this task, {sample} expands to my_sample_a in the rest parts of the rule. So Snakemake will look for the corresponding inputs and run the following shell commands:

featureCounts -T 16 -a /path/to/genes.gtf \
	-o readcount/my_sample_a.tsv \
	inputs/my_sample_a.bam \
	2> logs/readcount/my_sample_a.log

Process multiple samples using `snakemake all`

To make the pipeline more extensible, use a run-all rule and a config file to process all samples of a batch. As shown here, we obtain the list of samples to process from config file by reading in the list from the file at config['sample_list']. In this case, SAMPLES is a Python list of ["my_sample_a", "my_sample_b", ..., "my_sample_d"]. In the all rule, we use a Snakemake function expand() to generates all the output targets by "filling in" the wildcards of the output pattern using every value of SAMPLES.

Other commonly used features

When the simple string pattern matching is not enough, Snakemake also allows many parts of the rule to be determined by Python functions, including input, params, and resources. These functions are known as input functions in Snakemake's documentation. They make the rule more flexible. As shown here, different samples may have their input files at different locations. The same task will ask for more memory if it has been failed before.

unpack() is a helper function to automatically unpack the return value of the input function. So a function can return multiple key-value pairs.

file_map_df is a pandas DataFrame object loaded from the tabular file specified by the config (config["file_map"]).

$ snakemake -j 32 readcount/my_sample_{a,b,c,d}.tsv
# featureCounts ... inputs/my_sample_a.bam
# featureCounts ... inputs/my_sample_b.bam
# ... (two jobs in parallel)

Alternatively, add a run-all rule to snakemake all:

with open(config['sample_list']) as f:
	SAMPLES = f.read().splitlines()
rule all:
	input: expand(rules.readcount.output.count_tsv, \
				  sample=SAMPLES)

Other powerful features that extend the wildcard file pattern matching:

rule readcount:
	output: count_tsv='readcount/{sample}.tsv'
    input: unpack(find_bam_path)  # provides input.bam and input.bai
    resources:  # dynamic resource allocation
		mem_mb=lambda wildcards, attempt: \
			16000 + 16000 * (attempt - 1)  # retry with more memory
	...
def find_bam_path(wildcards):
	# Retrieve local paths using the file map from config (a DataFrame)
	bam_pth = file_map_df.at[wildcards.sample, "rna_genomic_bam_pth"]
	return {'bam': bam_pth,
			'bai': bam_pth + '.bai'}

Things I like about Snakemake

Task definition and execution are separated
- Maximally utilize the available computing resources (CPUs and memory)
- Smart re-run on only the incomplete tasks (for iterative development)
- Integration with external job schedulers (e.g., HPC and Kubernetes)
Powerful and easy-to-use features
- Flexible rules by custom Python functions
- Temporary and piped outputs
- Package management (conda), containerization (docker), and tool wrappers
- New features are constantly being introduced

I want to mention the dry run mode is very useful too (snakemake -n).

Things I don't like about Snakemake

Running large number of tasks (> 100k) can be slow/fragile
Flexible features can be a double-edged sword; not the most robust
- Unique combinations of features can trigger weird bugs or crash the runtime
- Though bugs are usually fixed eventually, it may take up too much of the development time for a researcher
Running on a cloud comes with large setup/runtime overhead
- All files need to be tracked (e.g., inputs, outputs, and references)
- Overhead in downloading/uploading files and setting up the environment
- No one simple workflow for all setups; require optimization

My best practices for data processing

Process samples in batches; use config file
Use file checksums to track upstream sources and outputs
Start with a big fat Docker/conda environment that contains all dependencies*
Do I really need a "pipeline"? Do I run for 10+ times?
Good old bash scripts plus GNU Parallel may be sufficient or more worth the time
Running on cloud will be difficult than local/HPC unless there is a team maintaining the cloud infrastructure

How to set up the conda environment or the Docker image?

I actually had better experience to simply put together all the dependencies in one big fat Docker image or conda environment. This setup can live outside Snakemake, so it doesn't take care of pulling the docker image or creating the conda environment. This comes with some advantages:

It's easier to debug and learn when one is new to Snakemake
If one creates a pipeline from scratch, it's likely you will first try out the steps in a conda environment anyway. Then one can re-use the environment
Some computing environments have very non-standard (broken) ways to use conda and Docker, which breaks Snakemake

However, this approach is against the best practice because it reduce the reusability of the rules and sometimes different rules have incompatible dependencies. One should consider switching to a rule-based environments/images later on.

Sample batches, config, and file map

pipeline_src/
├── scripts/...
└── Snakefile

batch_yyyy-mm-dd/  # from a template
├── config.json
├── sample.list
├── output/...
└── output_manifest.csv

# Under the batch folder
snakemake --configfile=config.json \
  -s /pipeline_src/Snakefile \
  -j 50 --attempt 3 \
  --resources io_heavy=4 -- \
  all

// config.json
{
  "sample_list": "samples.list",
  "file_map": "/path/to/file_map.csv",
  "gtf": "/path/to/my_annotation.gtf",
  ...
}

# file_map.csv
sample,file_type,local_path,uuid,md5_checksum,...
sample_A,rna_bam,/path/to/sample_A.bam,5415b...
sample_B,rna_bam,/path/to/sample_B.bam,289c7...

# output_manifest.csv
sample,file_type,local_path,md5_checksum,...
sample_A,tsv,/path/to/readcount/sample_A.tsv,...
sample_B,tsv,/path/to/readcount/sample_B.tsv,...

Advantages of using batch folders, config, file map, and output manifest

Batch folders make the pipeline source code and the outputs live in different locations. Good for version tracking the pipeline and reusing the pipeline. We can also create a template folder for new batches. Config file is a natural solution to specify the batch-specific parameters.

Input file map is useful when the upstream files are tracked externally (say, in a database or some centralized system). It also retains the file metadata (e.g. checksums, UUID, experiment details, and other cross-references). For example, our lab maintains a file catalog to track all CPTAC data on GDC. It also tracks the files we have downloaded to the different storage systems in our lab. Thus, our example pipeline looks up this catalog to find the input BAM files.

Checksum for both inputs and outputs

Not sure if the file is the same? Verify its checksum!

Detect if inputs have changed or corrupted
Ensure output file integrity
I generally use MD5 because of its speed, unless there is a security concern
This can be implemented outside of the pipeline too

Advantages of using checksums everywhere

There are quite a few possible reasons lead to a input change or corruption:

With increasing data size, it's more likely to get rotten bits from failing storage
Silent file transfer errors
Updates from the upstream

In general, it's a good idea to keep checksums all the time. Here is a snippet to generate MD5 checksums of everything under a folder in parallel and verify the content using the generated checksums:

fd --type file --exclude 'README*' --exclude '*CHECKSUMS' \
	| sort \
	| parallel --keep-order -j4 --bar 'md5sum {}' \
	> MD5_CHECKSUMS
md5sum -c MD5_CHECKSUMS

My best practices for reproducible analysis

Work on the workflow later when the project is in shape (refactoring takes time)
Start with making the independent reproducible scripts
(e.g., R Markdown/Jupyter notebooks)
Construct the workflow in stages:
1. Make a data freeze. Better if it contains a changelog and checksums
2. Run the analyses and number them (e.g. 09_xxx.Rmd)
3. Generate figures/tables/number for publication
Regularly test the workflow

Snakemake vs CWL/WDL (and Nextflow)

CWL, WDL, and Nextflow ...

Based on an explicit and forward task dependency (push-based DAG)
Generate unique IDs per workflow run; useful for large scale processing, monitoring, and integration with LIMS
Rigorous, but less flexible and more tedious to write
Cloud friendly; support diverse backends
Ideal for a core lab and a workflow that will run for 100+ times
Not very useful for reproducing analyses/figures

Did someone mention Airflow?

Very few use Airflow in the field
Strengths: great workflow monitoring ability; widely used by the big techs
Weaknesses: hard to set up locally and iterate the dev; requires its own infra (and DevOps); not HPC friendly; lack network effect

Tweet screenshot by @BenSiranosian — From @BenSiranosian's tweet and his blog post.

When to use Snakemake or others?

If you already know Python, are new to workflows, or don't develop pipelines for a living, pick Snakemake.
To make your research results reproducible, also pick Snakemake
For pipelines that solely run on a cloud, run at a large scale constantly, or require monitoring, pick other solutions:
- CWL is the most popular approach in my community, which is used by NCI GDC, Terra, and my lab/institution. Highly modular and portable
- Nextflow has a strong community promoting and contributing to it. It also maintains a collection of off-the-shelf workflows

Summary

Snakemake is an easy-to-learn and versatile workflow management tool, suitable for researchers prototyping and analyses
Snakemake works by file pattern based rules and backward task dependency
Best practices: bundle deps by conda/Docker, run in batches and stages, use checksums