Genomics in R by examples

Genomics in R by Examples

Liang-Bo Wang
Shared under CC 4.0 BY license

Esc to overview
← → to navigate

R has a powerful genomic toolbox

Query annotations like genome sequence, transcript, and array probes
Process genomic ranges (from annotations)
Store experiment results with annotations and metadata

Plus all the R packages built on the strong foundation:

Visualization

Statistical analysis

Dataset import

Get a taste by examples

Conversion between gene/transcript IDs and symbols
Extract and process the genomic location of a gene
Plot whole genome/exome sequencing coverage of a gene
~~Query TCGA omics using TCGAbiolinks~~

Check out additional readings for a thorough introduction

R setup

install.packages(c(
	"tidyverse",   # dplyr, ggplot2, tibble, stringr, ...
	"BiocManager"  # Bioconductor installer
))
BiocManger::install(c(
	"ensembldb", "EnsDb.Hsapiens.v86",
	"gtrellis", "TCGAbiolinks"
))

Example: Find the gene symbol of a transcript

EnsDb stores the Ensembl annotation of a specific release of an organism.

library(ensembldb)
library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86

ensembldb::select() maps the IDs between columns (keytypes).

ensembldb::select(
    edb,
    keys = c("ENST00000396884"),
    keytype = "TXID",
    columns = c("SYMBOL")
)
#              TXID SYMBOL
# 1 ENST00000396884  SOX10

keytypes(edb)
# ENTREZID
# PROTEINID
# SEQNAME
# SYMBOL
# TXBIOTYPE
# TXID
# ...

Annotation sources in Bioconductor

Object type	Example packages
TxDb	EnsDb.Hsapiens.v86, TxDb.Hsapiens.UCSC.hg38.knownGene
OrgDb	org.Hs.eg.db
BSgenome	BSgenome.Hsapiens.UCSC.hg38
Misc.	Organism.dplyr, AnnotationHub, biomaRt, all packages under AnnotationData

Ref: Bioconductor annotation workflow—Genomic Annotation Resources.

What about RefSeq transcript ID?

Try another annotation source like org.Hs.eg.db.

biomaRt can convert IDs between different annotation sources.

Use GenomicFeatures::makeTxDbFromGFF to build a new TxDB from RefSeq's GFF file. This method applies to all customized GTFs.

Example: Locate my transcript

			txs <- transcripts(edb, columns = c('symbol')); txs
# GRanges object with 216741 ranges and 1 metadata columns:
#                   seqnames            ranges strand |      symbol
#                      <Rle>         <IRanges>  <Rle> | <character>
#   ENST00000456328        1       11869-14409      + |     DDX11L1
#   ENST00000450305        1       12010-13670      + |     DDX11L1
#   ENST00000488147        1       14404-29570      - |      WASH7P
#               ...      ...               ...    ... .         ...
#   ENST00000435945        Y 26594851-26634652      - |     PARP4P1
#   ENST00000435741        Y 26626520-26627159      - |     FAM58CP
#   ENST00000431853        Y 56855244-56855488      + |     CTBP2P1
#   -------
#   seqinfo: 357 sequences from GRCh38 genome

How to read this output? What is a GRanges object?

`GRanges` overview

GRanges and GRangesList store genomic ranges in R
Provided by GenomicRanges package
Coordinate system:
- 1-based: First base of the chromosome starts with coordinate 1
- Closed: Start and end coordinates are included in the range
- Left-most: Start ≤ end regardless of the strand

`GRanges` cheatsheet

Ref: Genomic Ranges For Genome-Scale Data And Annotation, Martin Morgan, 2015.10

Find the location of all SOX10 transcripts:

txs[txs$symbol == 'SOX10', ]
# GRanges object with 6 ranges and 1 metadata columns:
#                   seqnames            ranges strand |      symbol
#                      <Rle>         <IRanges>  <Rle> | <character>
#   ENST00000446929       22 37970686-37983414      - |       SOX10
#   ENST00000396884       22 37972300-37984537      - |       SOX10
# ...

Calculate their promoter region:

promoters(subset(txs, symbol == 'SOX10'), upstream=2000, downstream=200)
# GRanges object with 6 ranges and 1 metadata columns:
#                   seqnames            ranges strand |      symbol
#                      <Rle>         <IRanges>  <Rle> | <character>
#   ENST00000446929       22 37983215-37985414      - |       SOX10
#   ENST00000396884       22 37984338-37986537      - |       SOX10
# ...

GRanges/GRangesList use cases

Gene model (exons of a transcript)
Location of microarray probes and other annotations
Store experiment results:
sequencing depth, copy number change, ChIP-seq peaks

Example: Plot the whole exome sequencing depth around SOX10

library(rtracklayer)
library(GenomicRanges)
library(gtrellis)
library(EnsDb.Hsapiens.v86)
library(tidyverse)

# Read sequencing depth BED file as a GRanges object
seq_depth_gr <- import.bedGraph('wxs_normal.subset.bed.gz')
# GRanges object with 76574 ranges and 1 metadata column:
#           seqnames            ranges strand |     score
#       [1]    chr22 23174700-23174716      * |         1
#       [2]    chr22 23174717-23174725      * |         2
#       [3]    chr22 23174726-23174735      * |         3
# ...

# Find the genomic range of SOX10
edb <- EnsDb.Hsapiens.v86
seqlevelsStyle(edb) <- 'UCSC'
sox10_region <- range(transcripts(edb, filter = ~ symbol == 'SOX10')) + 500
# GRanges object with 1 range and 0 metadata columns:
#       seqnames            ranges strand
#   [1]    chr22 37970186-37987922      -

Finally, use gtrellis to plot. gtrellis visualizes tracks of data using chromosomes as x-axis. Useful for CNV, sequencing depths, DNA methylation, and etc.

gtrellis_layout(
    data = sox10_region,  # specify the genomic region of interest
    track_ylim = c(0, 750)
)

# Add the track for WXS sequencing depth
add_lines_track(
    seq_depth_gr,
    seq_depth_gr$score,
    area = TRUE,
    gp = gpar(fill = "gray", col = NA)
)

Store -omics data with its metadata

Keep measurements of all samples with annotations, sample info, and metadata
Use cases: gene expression, microarray data
SummarizedExperiment provides SummarizedExperiment and RangedSummarizedExperiment

Ref: SummarizedExperiment's vignette

Summary

Query various annotations by R using TxDb and EnsDB
Store genomic ranges by GRanges and GRangesList
Plot data tracks along genome by gtrellis
Store multi-sample experiments by SummarizedExperiment

Extra example: Access TCGA data by TCGAbiolinks

Visualize the relationship between DNA methylation of MGMT promoter and its gene expression.

Query the gene expression of TCGA GBM primary tumors and store it as a SummarizedExperiment object
Use the sample information in colData() to stratify GBM samples by their MGMT methylation status

Thank you

Additional readings

Bioconductor Annotation Workflow covers AnnotationHub, OrgDb, TxDb, BSgenome and biomaRt
My post and my slides about the Bioconductor ecosystem in general
My another post focusing on Ensembl annotations
PH525x series - Biomedical Data Science
Bioconductor for Genomic Data Science

About Me

Liang-Bo Wang
Computational and Systems Biology PhD student at Washington University in St. Louis (WashU; WUSTL)
Advisor: Li Ding
Research interest: cancer genomics, proteomics, single cell imaging