Introduction to Snakemake

By Bobo (Liang²) under CC 4.0 BY license

Esc to overview
← → to navigate

What is Snakemake?

Snakemake is a Python-based workflow management tool.

A workflow contains a set of rules
A rule describes how to create output files/patterns
from input files/pattern
The workflow file runs like a Python script

Can import external Python libraries
Can write one's own functions

How does Snakemake work?

Given the final output files, snakemake will find the rule that generates them
The rule will specify the required input files
If the input files exist, then we are good to run the commands
If some input does not exist but there are rules to generate them, use those rules to find the required input files
Repeat the process until all outputs can be generated from existing files
Snakemake will start running the corresponding commands

Run snakemake in action

Write RNA-seq pipeline in Snakemake to assess transcript abundance:

Build the alignment index
Map sample reads to genome
Assess sample's transcript abundance

Tools to use: HISAT2 + StringTie (new Tuxedo pipeline)

Setup

Setup the environment

We will use Python 3.6 and install all the tools using bioconda.
Before continue, make sure one has a working:

Conda (any miniconda or anaconda 2/3)
Docker
Access to Google Cloud Console and create VM instances.

Working directory and example data

Everything will be under ~/snakemake_example

Example data from Griffith Lab’s RNA-seq tutorial: 2 conditions, 3 repeats.

cd ~/snakemake_example
wget https://storage.googleapis.com/lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr.tar.gz
tar xf griffithlab_brain_vs_uhr.tar.gz

Setup Bioconda

Bioconda is a conda channel providing many bioinfomatics software.

conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

Create a conda env

Conda env can isolate all the tools so one can run multiple pipelines without worrying about version conflict of a same tool.
Create an env called new-tuxedo to install all the required tools for this pipeline.

conda create -n new_tuxedo \
    python=3.6 snakemake hisat2 stringtie samtools

Use the conda env

$ source activate new_tuxedo       # Use the conda env
(new_tuxedo) $ hisat2 --version
~/miniconda3/envs/new_tuxedo/bin/hisat2-align-s version 2.1.0
(new_tuxedo) $ deactivate           # Exit the env

# The tool is not available outside
$ hisat2 --version
bash: hisat2: command not found

Pipeline
(Write Snakefile)

Pipeline overview

Build HISAT2 genome reference index for alignment (we are here)
Align sample reads to the genome by HISAT2
Assess transcript by StringTie

The pipeline will be written to a file Snakefile

How to build HISAT2 index?

hisat2_extract_splice_sites.py genes.gtf > genes.ss
hisat2_extract_exons.py genes.gtf > genes.exon
hisat2-build -p 4 \
	--ss genes.ss --exon genes.exon \
	index_prefix

Index files will be: index_prefix.1.ht2, ..., index_prefix.8.ht2
Last comand depends on the output of the first two commands

Snakemake rules for the first two commands

rule extract_genome_splice_sites:
    input: "griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf"
    output: "hisat2_index/chr22_ERCC92.ss"
	shell: "hisat2_extract_splice_sites.py {input} > {output}"

rule extract_genome_exons:
    input: "griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf"
    output: "hisat2_index/chr22_ERCC92.exon"
	shell: "hisat2_extract_exons.py {input} > {output}"

Try run snakemake

snakemake [OPTIONS] [RULE or OUTPUT]
  -n --dryrun   Don't actually run the pipeline
  -p            Print the command of each job

snakemake hisat2_index/chr22_ERCC92.ss
rm hisat2_index/chr22_ERCC92.ss

For the third command

rule build_hisat_index:
    input:
        genome_fa="griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/chr22_ERCC92.fa",
        splice_sites="hisat2_index/chr22_ERCC92.ss",
        exons="hisat2_index/chr22_ERCC92.exon",
    output: expand("hisat2_index/chr22_ERCC92.{ix}.ht2", ix=range(1, 9))
    shell:
        "hisat2-build -p 4 {input.genome_fa} "
        "--ss {input.splice_sites} --exon {input.exons} "
        "hisat2_index/chr22_ERCC92"

expand() is a Snakemake utility function. Run snakemake again by:

snakemake -p build_hisat_index

Move on to alignment

Build HISAT2 genome reference index for alignment
Align sample reads to the genome by HISAT2 (we are here)
Assess transcript by StringTie

How to use HISAT2 for alignment?

For each sample, supply their pair-end FASTQs and index, and pipe the output to samtools. Output will be a BAM file per sample.

hisat2 -p 4 --dta -x index_prefix \
    -1 sample_r1.fastq -2 sample_r2.fastq | \
	samtools sort -@ 4 -o sample.bam

How to apply a rule for different samples?

rule align_hisat:
    input:
        fq1="griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/{sample}.read1.fastq.gz",
        fq2="griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/{sample}.read2.fastq.gz",
        hisat2_index=expand("hisat2_index/chr22_ERCC92.{ix}.ht2", ix=range(1, 9)),
    output: "align_hisat2/{sample}.bam"
    threads: 4
    shell:
        "hisat2 -p {threads} --dta -x hisat2_index/chr22_ERCC92 "
        "-1 {input.fq1} -2 {input.fq2} | "
		"samtools sort -@ {threads} -o {output}"

threads entry specifies how many CPUs the rule will use.
{sample} is a wildcard so similar outputs can share a rule.

More on wildcard

rule simple:
    input:
        fq1=".../{sample}.read1.fastq.gz",
        fq2=".../{sample}.read2.fastq.gz",
    output: "out/{sample}.bam"
    threads: 4
    shell:
		"mycmd process {input.fq1} {input.fq2} > {output}"

While run snakemake out/A.bam, it matches the output pattern and knows sample = A and replace all the wildcard

But how to apply to all samples? or to get all the name of the samples?

To get all the samples' name

SAMPLES, *_ = glob_wildcards(
    'griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/'
	'{sample}.read1.fastq.gz')

rule align_all_samples:
    input: expand("align_hisat2/{sample}.bam", sample=SAMPLES)

One can always hard code SAMPLES = ['A', 'B', ...]

snakemake -j 8 -p align_all_samples

How Snakemake resolves jobs

When Snakemake is figuring out what input are required to generate the desired output, a DAG is created along the way.

Each node in the DAG represent a job. Job that has its dependency ready can be run in parallel, say, no connected edges.

Move on to transcript assessment

Build HISAT2 genome reference index for alignment
Align sample reads to the genome by HISAT2
Assess transcript by StringTie (we are here)

Download the full Snakefile to save some keystrokes.

Run the full pipeline by snakemake -j 8 -p quant_all_samples

More features of Snakemake

Use function as a input
log: to store log files
Config and constants
Use run: instead of shell: for running custom Python snippet
Use external R/Python script instead of shell command

Docker

Run Snakemake in Docker

We need to pack Snakemake and all the tools inside a Docker image
Map the Snakefile and working directory inside Docker
Rest of the setup is the same

Make a Docker image

Use my Docker image or a minimal working Dockerfile:

FROM continuumio/miniconda3
RUN conda config --add channels conda-forge \
    && conda config --add channels bioconda \
    && conda install -y python=3.6 nomkl \
        stringtie samtools hisat2 snakemake \
        google-cloud-storage

cd ~/snakemake_example
docker run -t                       \
    -v $(pwd):/analysis             \
    lbwang/snakemake-conda-rnaseq   \
    snakemake -j 2 --timestamp      \
        -s /analysis/Snakefile --directory /analysis \
        quant_all_samples

-v ... maps the current directory to /analysis inside Docker
-s ... lets Snakemake find the mapped Snakefile
--directory ... and working dir

Google Cloud

Google Cloud Setup

A writable Google Cloud Storage (GS) bucket/path
Say, gs://dinglab/lbwang/snakemake_demo/
A Google Cloud Compute (GCE) VM instance

10GB disk space
Has write access to that GS bucket/path
Docker installed

An empty working directory at ~/snakemake_example

Let Snakemake use remote files

from snakemake.remote.GS import RemoteProvider \
	as GSRemoteProvider
GS = GSRemoteProvider()

rule xxx:
    input: GS.remote("readonlybucket/path/to/file")
	output: GS.remote("writablebucket/path/to/file")

Make all the input files on GS

I've uploaded a copy on Google Cloud Storage already at
gs://lbwang-playground/snakemake_rnaseq.

rule align_hisat:
    input:
        fastq1=GS.remote("lbwang-playground/.../{sample}.read1.fastq.gz"),
        fastq2=GS.remote("lbwang-playground/.../{sample}.read2.fastq.gz")
    output: "align_hisat2/{sample}.bam"
    shell: ...

Only input/output entries are affected. One can update all paths to be on GS.

All input files on my readonly bucket

gsutil ls -r gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/:

gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/:
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/chr22_ERCC92.fa
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf

gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/:
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz

Point all the paths under GS

Snakemake allow user to pass --default-remote-provider and --default-remote-prefix so all the local path will be on GS.

For example, when the default remote prefix being
dinglab/lbwang/snakemake_demo
Then the output path align_hisat2/{sample}.bam becomes
dinglab/lbwang/snakemake_demo/align_hisat2/{sample}.bam

Modify our Snakefile

The modified Snakefile can be found here. Changes worth noting:

Host all the input source files on GS using GS.remote(...)
Modify FULL_HISAT2_INDEX_PREFIX to be {WRITABLE_BUCKET_PATH}/hisat2_index/chr22_ERCC92

A special case of path changing because HISAT2 command doesn't accept the path to the indices but just the prefix.

docker run -t -i                           \
    -v $(pwd):/analysis                         \
    -v ~/.config/gcloud:/root/.config/gcloud    \
    lbwang/snakemake-conda-rnaseq               \
    snakemake -j 4 --timestamp --verbose -p --keep-remote   \
        -s /analysis/Snakefile --directory /analysis        \
		--default-remote-provider GS                        \
		--default-remote-prefix "{WRITABLE_BUCKET_PATH}"    \
        quant_all_samples

~/snakemake_example
├── lbwang-playground/
│   └── snakemake_rnaseq/
│       └── griffithlab_brain_vs_uhr/
│           ├── GRCh38_Ens87_chr22_ERCC/
│           └── HBR_UHR_ERCC_ds_10pc/
├── {WRITABLE_BUCKET_PATH}/
│   ├── align_hisat2/
│   ├── hisat2_index/
│   └── stringtie/
└── Snakefile

About Me

Liang-Bo Wang, aka Bobo or Liang²
Computational and Systems Biology PhD student in Ding Lab at WUSTL
Speak Python and R

Introduction to Snakemake