Liang-Bo Wang (Bobo), 2017-09-05
By Bobo (Liang2) under CC 4.0 BY license
Esc to overview
← → to navigate
Snakemake is a Python-based workflow management tool.
Write RNA-seq pipeline in Snakemake to assess transcript abundance:
Tools to use: HISAT2 + StringTie (new Tuxedo pipeline)
We will use Python 3.6 and install all the tools using bioconda.
Before continue, make sure one has a working:
Everything will be under ~/snakemake_example
Example data from Griffith Lab’s RNA-seq tutorial: 2 conditions, 3 repeats.
cd ~/snakemake_example
wget https://storage.googleapis.com/lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr.tar.gz
tar xf griffithlab_brain_vs_uhr.tar.gz
Bioconda is a conda channel providing many bioinfomatics software.
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
Conda env can isolate all the tools so one can run multiple pipelines without worrying about version conflict of a same tool.
Create an env called new-tuxedo
to install all the required tools for this pipeline.
conda create -n new_tuxedo \
python=3.6 snakemake hisat2 stringtie samtools
$ source activate new_tuxedo # Use the conda env
(new_tuxedo) $ hisat2 --version
~/miniconda3/envs/new_tuxedo/bin/hisat2-align-s version 2.1.0
(new_tuxedo) $ deactivate # Exit the env
# The tool is not available outside
$ hisat2 --version
bash: hisat2: command not found
The pipeline will be written to a file Snakefile
hisat2_extract_splice_sites.py genes.gtf > genes.ss
hisat2_extract_exons.py genes.gtf > genes.exon
hisat2-build -p 4 \
--ss genes.ss --exon genes.exon \
index_prefix
rule extract_genome_splice_sites:
input: "griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf"
output: "hisat2_index/chr22_ERCC92.ss"
shell: "hisat2_extract_splice_sites.py {input} > {output}"
rule extract_genome_exons:
input: "griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf"
output: "hisat2_index/chr22_ERCC92.exon"
shell: "hisat2_extract_exons.py {input} > {output}"
snakemake [OPTIONS] [RULE or OUTPUT]
-n --dryrun Don't actually run the pipeline
-p Print the command of each job
snakemake hisat2_index/chr22_ERCC92.ss
rm hisat2_index/chr22_ERCC92.ss
rule build_hisat_index:
input:
genome_fa="griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/chr22_ERCC92.fa",
splice_sites="hisat2_index/chr22_ERCC92.ss",
exons="hisat2_index/chr22_ERCC92.exon",
output: expand("hisat2_index/chr22_ERCC92.{ix}.ht2", ix=range(1, 9))
shell:
"hisat2-build -p 4 {input.genome_fa} "
"--ss {input.splice_sites} --exon {input.exons} "
"hisat2_index/chr22_ERCC92"
expand()
is a Snakemake utility function. Run snakemake again by:
snakemake -p build_hisat_index
For each sample, supply their pair-end FASTQs and index, and pipe the output to samtools. Output will be a BAM file per sample.
hisat2 -p 4 --dta -x index_prefix \
-1 sample_r1.fastq -2 sample_r2.fastq | \
samtools sort -@ 4 -o sample.bam
How to apply a rule for different samples?
rule align_hisat:
input:
fq1="griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/{sample}.read1.fastq.gz",
fq2="griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/{sample}.read2.fastq.gz",
hisat2_index=expand("hisat2_index/chr22_ERCC92.{ix}.ht2", ix=range(1, 9)),
output: "align_hisat2/{sample}.bam"
threads: 4
shell:
"hisat2 -p {threads} --dta -x hisat2_index/chr22_ERCC92 "
"-1 {input.fq1} -2 {input.fq2} | "
"samtools sort -@ {threads} -o {output}"
threads
entry specifies how many CPUs the rule will use.{sample}
is a wildcard so similar outputs can share a rule.
rule simple:
input:
fq1=".../{sample}.read1.fastq.gz",
fq2=".../{sample}.read2.fastq.gz",
output: "out/{sample}.bam"
threads: 4
shell:
"mycmd process {input.fq1} {input.fq2} > {output}"
While run snakemake out/A.bam
, it matches the output pattern and knows sample = A
and replace all the wildcard
But how to apply to all samples? or to get all the name of the samples?
SAMPLES, *_ = glob_wildcards(
'griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/'
'{sample}.read1.fastq.gz')
rule align_all_samples:
input: expand("align_hisat2/{sample}.bam", sample=SAMPLES)
One can always hard code SAMPLES = ['A', 'B', ...]
snakemake -j 8 -p align_all_samples
When Snakemake is figuring out what input are required to generate the desired output, a DAG is created along the way.
Each node in the DAG represent a job. Job that has its dependency ready can be run in parallel, say, no connected edges.
Download the full Snakefile to save some keystrokes.
Run the full pipeline by snakemake -j 8 -p quant_all_samples
log:
to store log filesrun:
instead of shell:
for running custom Python snippetUse my Docker image or a minimal working Dockerfile
:
FROM continuumio/miniconda3
RUN conda config --add channels conda-forge \
&& conda config --add channels bioconda \
&& conda install -y python=3.6 nomkl \
stringtie samtools hisat2 snakemake \
google-cloud-storage
cd ~/snakemake_example
docker run -t \
-v $(pwd):/analysis \
lbwang/snakemake-conda-rnaseq \
snakemake -j 2 --timestamp \
-s /analysis/Snakefile --directory /analysis \
quant_all_samples
-v ...
maps the current directory to /analysis inside Docker-s ...
lets Snakemake find the mapped Snakefile--directory
... and working dirgs://dinglab/lbwang/snakemake_demo/
~/snakemake_example
from snakemake.remote.GS import RemoteProvider \
as GSRemoteProvider
GS = GSRemoteProvider()
rule xxx:
input: GS.remote("readonlybucket/path/to/file")
output: GS.remote("writablebucket/path/to/file")
I've uploaded a copy on Google Cloud Storage already at
gs://lbwang-playground/snakemake_rnaseq
.
rule align_hisat:
input:
fastq1=GS.remote("lbwang-playground/.../{sample}.read1.fastq.gz"),
fastq2=GS.remote("lbwang-playground/.../{sample}.read2.fastq.gz")
output: "align_hisat2/{sample}.bam"
shell: ...
Only input/output entries are affected. One can update all paths to be on GS.
gsutil ls -r gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/:
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/:
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/chr22_ERCC92.fa
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/:
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
Snakemake allow user to pass --default-remote-provider
and --default-remote-prefix
so all the local path will be on GS.
dinglab/lbwang/snakemake_demo
align_hisat2/{sample}.bam
becomesdinglab/lbwang/snakemake_demo/align_hisat2/{sample}.bam
The modified Snakefile can be found here. Changes worth noting:
GS.remote(...)
FULL_HISAT2_INDEX_PREFIX
to be {WRITABLE_BUCKET_PATH}/hisat2_index/chr22_ERCC92
docker run -t -i \
-v $(pwd):/analysis \
-v ~/.config/gcloud:/root/.config/gcloud \
lbwang/snakemake-conda-rnaseq \
snakemake -j 4 --timestamp --verbose -p --keep-remote \
-s /analysis/Snakefile --directory /analysis \
--default-remote-provider GS \
--default-remote-prefix "{WRITABLE_BUCKET_PATH}" \
quant_all_samples
~/snakemake_example
├── lbwang-playground/
│ └── snakemake_rnaseq/
│ └── griffithlab_brain_vs_uhr/
│ ├── GRCh38_Ens87_chr22_ERCC/
│ └── HBR_UHR_ERCC_ds_10pc/
├── {WRITABLE_BUCKET_PATH}/
│ ├── align_hisat2/
│ ├── hisat2_index/
│ └── stringtie/
└── Snakefile