Introduction to Snakemake

Liang-Bo Wang (Bobo), 2017-09-05

Introduction to Snakemake

By Bobo (Liang2) under CC 4.0 BY license

Esc to overview
to navigate

What is Snakemake?

Snakemake is a Python-based workflow management tool.

How does Snakemake work?

  1. Given the final output files, snakemake will find the rule that generates them
  2. The rule will specify the required input files
  3. If the input files exist, then we are good to run the commands
  4. If some input does not exist but there are rules to generate them, use those rules to find the required input files
  5. Repeat the process until all outputs can be generated from existing files
  6. Snakemake will start running the corresponding commands

Run snakemake in action

Write RNA-seq pipeline in Snakemake to assess transcript abundance:

  1. Build the alignment index
  2. Map sample reads to genome
  3. Assess sample's transcript abundance

Tools to use: HISAT2 + StringTie (new Tuxedo pipeline)

Setup

Setup the environment

We will use Python 3.6 and install all the tools using bioconda.
Before continue, make sure one has a working:

Working directory and example data

Everything will be under ~/snakemake_example

Example data from Griffith Lab’s RNA-seq tutorial: 2 conditions, 3 repeats.

cd ~/snakemake_example
wget https://storage.googleapis.com/lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr.tar.gz
tar xf griffithlab_brain_vs_uhr.tar.gz

Setup Bioconda

Bioconda is a conda channel providing many bioinfomatics software.

conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
		

Create a conda env

Conda env can isolate all the tools so one can run multiple pipelines without worrying about version conflict of a same tool.
Create an env called new-tuxedo to install all the required tools for this pipeline.

conda create -n new_tuxedo \
    python=3.6 snakemake hisat2 stringtie samtools

Use the conda env

$ source activate new_tuxedo       # Use the conda env
(new_tuxedo) $ hisat2 --version
~/miniconda3/envs/new_tuxedo/bin/hisat2-align-s version 2.1.0
(new_tuxedo) $ deactivate           # Exit the env

# The tool is not available outside
$ hisat2 --version
bash: hisat2: command not found

Pipeline
(Write Snakefile)

Pipeline overview

  1. Build HISAT2 genome reference index for alignment (we are here)
  2. Align sample reads to the genome by HISAT2
  3. Assess transcript by StringTie

The pipeline will be written to a file Snakefile

How to build HISAT2 index?

hisat2_extract_splice_sites.py genes.gtf > genes.ss
hisat2_extract_exons.py genes.gtf > genes.exon
hisat2-build -p 4 \
	--ss genes.ss --exon genes.exon \
	index_prefix

Snakemake rules for the first two commands

rule extract_genome_splice_sites:
    input: "griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf"
    output: "hisat2_index/chr22_ERCC92.ss"
	shell: "hisat2_extract_splice_sites.py {input} > {output}"

rule extract_genome_exons:
    input: "griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf"
    output: "hisat2_index/chr22_ERCC92.exon"
	shell: "hisat2_extract_exons.py {input} > {output}"

Try run snakemake

snakemake [OPTIONS] [RULE or OUTPUT]
  -n --dryrun   Don't actually run the pipeline
  -p            Print the command of each job

snakemake hisat2_index/chr22_ERCC92.ss
rm hisat2_index/chr22_ERCC92.ss

For the third command

rule build_hisat_index:
    input:
        genome_fa="griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/chr22_ERCC92.fa",
        splice_sites="hisat2_index/chr22_ERCC92.ss",
        exons="hisat2_index/chr22_ERCC92.exon",
    output: expand("hisat2_index/chr22_ERCC92.{ix}.ht2", ix=range(1, 9))
    shell:
        "hisat2-build -p 4 {input.genome_fa} "
        "--ss {input.splice_sites} --exon {input.exons} "
        "hisat2_index/chr22_ERCC92"

expand() is a Snakemake utility function. Run snakemake again by:

snakemake -p build_hisat_index

Move on to alignment

  1. Build HISAT2 genome reference index for alignment
  2. Align sample reads to the genome by HISAT2 (we are here)
  3. Assess transcript by StringTie

How to use HISAT2 for alignment?

For each sample, supply their pair-end FASTQs and index, and pipe the output to samtools. Output will be a BAM file per sample.

hisat2 -p 4 --dta -x index_prefix \
    -1 sample_r1.fastq -2 sample_r2.fastq | \
	samtools sort -@ 4 -o sample.bam

How to apply a rule for different samples?

rule align_hisat:
    input:
        fq1="griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/{sample}.read1.fastq.gz",
        fq2="griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/{sample}.read2.fastq.gz",
        hisat2_index=expand("hisat2_index/chr22_ERCC92.{ix}.ht2", ix=range(1, 9)),
    output: "align_hisat2/{sample}.bam"
    threads: 4
    shell:
        "hisat2 -p {threads} --dta -x hisat2_index/chr22_ERCC92 "
        "-1 {input.fq1} -2 {input.fq2} | "
		"samtools sort -@ {threads} -o {output}"
		

More on wildcard

rule simple:
    input:
        fq1=".../{sample}.read1.fastq.gz",
        fq2=".../{sample}.read2.fastq.gz",
    output: "out/{sample}.bam"
    threads: 4
    shell:
		"mycmd process {input.fq1} {input.fq2} > {output}"
		

While run snakemake out/A.bam, it matches the output pattern and knows sample = A and replace all the wildcard

But how to apply to all samples? or to get all the name of the samples?

To get all the samples' name

SAMPLES, *_ = glob_wildcards(
    'griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/'
	'{sample}.read1.fastq.gz')

rule align_all_samples:
    input: expand("align_hisat2/{sample}.bam", sample=SAMPLES)
		

One can always hard code SAMPLES = ['A', 'B', ...]

snakemake -j 8 -p align_all_samples

How Snakemake resolves jobs

When Snakemake is figuring out what input are required to generate the desired output, a DAG is created along the way.

Each node in the DAG represent a job. Job that has its dependency ready can be run in parallel, say, no connected edges.

Move on to transcript assessment

  1. Build HISAT2 genome reference index for alignment
  2. Align sample reads to the genome by HISAT2
  3. Assess transcript by StringTie (we are here)

Download the full Snakefile to save some keystrokes.

Run the full pipeline by snakemake -j 8 -p quant_all_samples

More features of Snakemake

Docker

Run Snakemake in Docker

Make a Docker image

Use my Docker image or a minimal working Dockerfile:

FROM continuumio/miniconda3
RUN conda config --add channels conda-forge \
    && conda config --add channels bioconda \
    && conda install -y python=3.6 nomkl \
        stringtie samtools hisat2 snakemake \
        google-cloud-storage
cd ~/snakemake_example
docker run -t                       \
    -v $(pwd):/analysis             \
    lbwang/snakemake-conda-rnaseq   \
    snakemake -j 2 --timestamp      \
        -s /analysis/Snakefile --directory /analysis \
        quant_all_samples

Google Cloud

Google Cloud Setup

Let Snakemake use remote files

from snakemake.remote.GS import RemoteProvider \
	as GSRemoteProvider
GS = GSRemoteProvider()

rule xxx:
    input: GS.remote("readonlybucket/path/to/file")
	output: GS.remote("writablebucket/path/to/file")
		

Make all the input files on GS

I've uploaded a copy on Google Cloud Storage already at
gs://lbwang-playground/snakemake_rnaseq.

rule align_hisat:
    input:
        fastq1=GS.remote("lbwang-playground/.../{sample}.read1.fastq.gz"),
        fastq2=GS.remote("lbwang-playground/.../{sample}.read2.fastq.gz")
    output: "align_hisat2/{sample}.bam"
    shell: ...
		

Only input/output entries are affected. One can update all paths to be on GS.

All input files on my readonly bucket

gsutil ls -r gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/:

gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/:
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/chr22_ERCC92.fa
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf

gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/:
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
		

Point all the paths under GS

Snakemake allow user to pass --default-remote-provider and --default-remote-prefix so all the local path will be on GS.

Modify our Snakefile

The modified Snakefile can be found here. Changes worth noting:

docker run -t -i                           \
    -v $(pwd):/analysis                         \
    -v ~/.config/gcloud:/root/.config/gcloud    \
    lbwang/snakemake-conda-rnaseq               \
    snakemake -j 4 --timestamp --verbose -p --keep-remote   \
        -s /analysis/Snakefile --directory /analysis        \
		--default-remote-provider GS                        \
		--default-remote-prefix "{WRITABLE_BUCKET_PATH}"    \
        quant_all_samples
~/snakemake_example
├── lbwang-playground/
│   └── snakemake_rnaseq/
│       └── griffithlab_brain_vs_uhr/
│           ├── GRCh38_Ens87_chr22_ERCC/
│           └── HBR_UHR_ERCC_ds_10pc/
├── {WRITABLE_BUCKET_PATH}/
│   ├── align_hisat2/
│   ├── hisat2_index/
│   └── stringtie/
└── Snakefile
		

Further reading

About Me

Q&A