Snakemake—simple data processing for researchers

Liang-Bo Wang, 2022-09-25.

  • Snakemake—simple data processing
    for researchers

    Liang-Bo Wang
    Taipei Bioinformatics Omnibus

    Esc or swipe up
    with three fingers to exit
    to navigate
    Creative Commons Attribution 4.0 International License

    This talk is for researchers who need to automate things in a handy way

    DALL·E generated art using the prompt 'a hard-working bee working on multiple tasks, digital illustration'
    A bee-sy dry-lab researcher

    How to learn Snakemake from scratch using official materials?

    1. Go through the official tutorial and best practices
    2. You are probably ready to implement your own pipeline
    3. If you need to run Snakemake on HPC or a cloud, go through the executor tutorial and the documentation about the cluster/cloud execution
    4. Check out the rest of the documentation once you get a better understanding of Snakemake
    5. Keep an eye on Snakemake's changelog for new features

    About me

    My protrait

    My top needs for a pipeline/workflow

    To process multiple samples on different computing platforms

    DALL·E generated art using the prompt 'four repeats in a two by two grid of a robot assembling car parts on a conveyor belt, digital art'

    To reproduce my analysis steps and results

    DALL·E generated art using the prompt 'a robot arm handling multiple sheets of papers with scientific charts and illustrations on a wood table, digital art'

    Scenario 1: A pipeline to process multiple samples on different computing platforms

    Illustration of the scenario of a reproducible analysis workflow. DALL·E generated art.

    Scenario 2: A workflow to reproduce my analysis steps and results

    Illustration of the scenario of a reproducible analysis workflow. DALL·E generated art.
    Illustration of the scenario of a reproducible analysis workflow. DALL·E generated art.
    Data processing
    Illustration of the scenario of a reproducible analysis workflow. DALL·E generated art.
    Reproducible analysis
    Inputs Lots of samples Mostly the same inputs
    Run type Once per sample Multiple times
    Workflow stability Stable Subject to change
    Computing environment Known and multiple
    Local, HPC, and cloud
    Known and limited*
    Mostly local

    There are quite a few completely opposite properties about these two scenarios, which are listed in the table.

    For the reproducible analysis scenario, the computing environment is known and limited only when we try to reproduce it on our own machines. In my experience, having a clear and simple instructions is easier for other researchers to work with the analysis than say a fully automated and dockerized setup.

    Logo of Snakemake

    What is Snakemake?

    Snakemake is a Python-based workflow management tool.
    A Snakemake workflow contains a set of rules:

    How Snakemake works in general

    Example job dependency graph
    An example task
    dependency graph

    Trace the workflow backwards. Snakemake starts with the rule that generates the final outputs:

    1. If all the inputs of this rule exist, it's ready to run its commands
      (create a new task)
    2. If some inputs do not exist, find other rules to generate the required input files (create a dependent task)
    3. Repeat the process until all outputs can be generated from existing files (complete a task dependency graph)
    4. Snakemake will start running the corresponding commands/tasks

    The task dependency graph has the property as a directed acyclic graph (DAG). For Snakemake, such graph is constructed backwards from the final outputs, also known as the pull-based approach.

    Reading the pipeline backwards can be a bit counterintuitive at first. My tip is to always start with the final outputs (usually at the end of the file) and trace the workflow backwards.

    Screenshot of the README of the example pipeline
    Screenshot of the pipeline README

    A simple RNA-seq pipeline in action

    Calculate gene read counts from RNA-seq BAMs

    1. Find the local path of the sample BAMs
    2. Generate read counts (featureCounts)
    3. Clean up and compress the outputs
    4. Calculate FPKM and FPKM-UQ values

    Full pipeline here

    I implemented two RNA-seq pipelines in Snakemake that are lossly based on GDC's mRNA analysis pipeline:

    These pipelines were developed a while ago, when some of my computing environments were quite old and limited. So I took a relative conservative approach by using minimal features and very few dependencies (just the Python standard library). These pipelines can certainly be improved to better conform to the community's best practices. With the ubiquitous conda and Docker support and new Snakemake features, some possible improvements include:

    rule readcount:
        output: count_tsv='readcount/{sample}.tsv'
        input: bam='inputs/{sample}.bam',
        log: 'logs/readcount/{sample}.log'
        params: gtf=GENE_GTF_PTH
        threads: 16
            'featureCounts -T {threads} -a {params.gtf} ...'
            '-o {output.count_tsv} {input.bam} 2> {log}'

    {sample} is a wildcard whose value is determined by matching the output patterns

    If you have wrote a Makefile before, snakemake rules share the same spirit as a Makefile's rule. And they run in a similar way.

    Snakemake syntax explained

    Known as wildcards that matches the output file pattern with the existing files.
    input:, output:
    File patterns for inputs and outputs. Use key-value pairs key='...' when there are multiple patterns. They are just Python strings and thus can be referred elsewhere by rules.myrule.output.key.
    The command that will run in a bash shell to generate the desired outputs.

    Check out the documentation for the rest parts of the rule.

    How snakemake creates a task

    Due to the way snakemake executes, it's easier to resolve the output first by realizing all the wildcard variables. Then expand the rest of the rule (inputs, parameters, and etc.)

    For example, to generate readcount/my_sample_a.tsv using this task, {sample} expands to my_sample_a in the rest parts of the rule. So Snakemake will look for the corresponding inputs and run the following shell commands:

    featureCounts -T 16 -a /path/to/genes.gtf \
    	-o readcount/my_sample_a.tsv \
    	inputs/my_sample_a.bam \
    	2> logs/readcount/my_sample_a.log

    Process multiple samples using snakemake all

    To make the pipeline more extensible, use a run-all rule and a config file to process all samples of a batch. As shown here, we obtain the list of samples to process from config file by reading in the list from the file at config['sample_list']. In this case, SAMPLES is a Python list of ["my_sample_a", "my_sample_b", ..., "my_sample_d"]. In the all rule, we use a Snakemake function expand() to generates all the output targets by "filling in" the wildcards of the output pattern using every value of SAMPLES.

    Other commonly used features

    When the simple string pattern matching is not enough, Snakemake also allows many parts of the rule to be determined by Python functions, including input, params, and resources. These functions are known as input functions in Snakemake's documentation. They make the rule more flexible. As shown here, different samples may have their input files at different locations. The same task will ask for more memory if it has been failed before.

    unpack() is a helper function to automatically unpack the return value of the input function. So a function can return multiple key-value pairs.

    file_map_df is a pandas DataFrame object loaded from the tabular file specified by the config (config["file_map"]).

    $ snakemake -j 32 readcount/my_sample_{a,b,c,d}.tsv
    # featureCounts ... inputs/my_sample_a.bam
    # featureCounts ... inputs/my_sample_b.bam
    # ... (two jobs in parallel)

    Other powerful features that extend the wildcard file pattern matching:

    rule readcount:
    	output: count_tsv='readcount/{sample}.tsv'
        input: unpack(find_bam_path)  # provides input.bam and input.bai
        resources:  # dynamic resource allocation
    		mem_mb=lambda wildcards, attempt: \
    			16000 + 16000 * (attempt - 1)  # retry with more memory
    def find_bam_path(wildcards):
    	# Retrieve local paths using the file map from config (a DataFrame)
    	bam_pth =[wildcards.sample, "rna_genomic_bam_pth"]
    	return {'bam': bam_pth,
    			'bai': bam_pth + '.bai'}

    Things I like about Snakemake

    I want to mention the dry run mode is very useful too (snakemake -n).

    Things I don't like about Snakemake

    Illustration of the scenario of a reproducible analysis workflow. DALL·E generated art.

    My best practices for data processing

    How to set up the conda environment or the Docker image?

    I actually had better experience to simply put together all the dependencies in one big fat Docker image or conda environment. This setup can live outside Snakemake, so it doesn't take care of pulling the docker image or creating the conda environment. This comes with some advantages:

    However, this approach is against the best practice because it reduce the reusability of the rules and sometimes different rules have incompatible dependencies. One should consider switching to a rule-based environments/images later on.

    Sample batches, config, and file map

    ├── scripts/...
    └── Snakefile
    batch_yyyy-mm-dd/  # from a template
    ├── config.json
    ├── sample.list
    ├── output/...
    └── output_manifest.csv
    # Under the batch folder
    snakemake --configfile=config.json \
      -s /pipeline_src/Snakefile \
      -j 50 --attempt 3 \
      --resources io_heavy=4 -- \
    // config.json
      "sample_list": "samples.list",
      "file_map": "/path/to/file_map.csv",
      "gtf": "/path/to/my_annotation.gtf",
    # file_map.csv
    # output_manifest.csv

    Advantages of using batch folders, config, file map, and output manifest

    Batch folders make the pipeline source code and the outputs live in different locations. Good for version tracking the pipeline and reusing the pipeline. We can also create a template folder for new batches. Config file is a natural solution to specify the batch-specific parameters.

    Input file map is useful when the upstream files are tracked externally (say, in a database or some centralized system). It also retains the file metadata (e.g. checksums, UUID, experiment details, and other cross-references). For example, our lab maintains a file catalog to track all CPTAC data on GDC. It also tracks the files we have downloaded to the different storage systems in our lab. Thus, our example pipeline looks up this catalog to find the input BAM files.

    Checksum for both inputs and outputs

    Not sure if the file is the same? Verify its checksum!

    Advantages of using checksums everywhere

    There are quite a few possible reasons lead to a input change or corruption:

    In general, it's a good idea to keep checksums all the time. Here is a snippet to generate MD5 checksums of everything under a folder in parallel and verify the content using the generated checksums:

    fd --type file --exclude 'README*' --exclude '*CHECKSUMS' \
    	| sort \
    	| parallel --keep-order -j4 --bar 'md5sum {}' \
    md5sum -c MD5_CHECKSUMS
    Illustration of the scenario of a reproducible analysis workflow. DALL·E generated art.

    My best practices for reproducible analysis

    Snakemake vs CWL/WDL (and Nextflow)

    CWL, WDL, and Nextflow ...

    Did someone mention Airflow?

    Tweet screenshot by @BenSiranosian
    From @BenSiranosian's tweet and his blog post.

    When to use Snakemake or others?


    DALL·E generated art using the prompt 'gray concrete wall and floor, multiple colorful anacondas in pipe-like shape in parallel and interconnected, wide shot, 24mm, 4k, photo realistic, golden hour'
    Snakemake can be our great buddy of data processing pipelines and reproducible analysis workflows

    Thank You