Liang-Bo Wang's Blog

About | Talks | Archives

Store GDC genome as a Seqinfo object

Genomic Data Commons (GDC) hosted by NCI is the place to harmonize past and future genomic data, such as TCGA, TARGET, and CPTAC projects. GDC has its own genome reference, GRCh38.d1.vd1, which has 2,779 “chromosomes” including decoys and virus sequences. That said, the canonical chromosomes of GRCh38.d1.vd1 (e.g., chr1 to chr22, chrM, chrX, and chrY) are identical to that of hg38 and GRCh38. So all these three genome references can be used interchangeably.

Anyway, I was trying to correctly store the full GRCh38.d1.vd1 genome information in the GRanges and GRangesList R objects, which can be done by creating a Seqinfo object representing all its chromosomes. It was also fun to get familiar with the genomic data structures in R.

Build GDC’s Seqinfo

First, we need the length and the name of all chromosomes in GRCh38.d1.vd1. I used samtools to extract the information as a .dict file from the genome reference FASTA file.

export GDC_REF_FA_URL='https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834'
curl -Lo GRCh38.d1.vd1.fa.tar.gz $GDC_REF_FA_URL
samtools dict \
    -a 'GRCh38.d1.vd1' -s 'Homo sapiens' \
    -u $GDC_REF_FA_URL \
    GRCh38.d1.vd1.fa.tar.gz > GRCh38.d1.vd1.dict

head -n 3 GRCh38.d1.vd1.dict
# @HD   VN:1.0  SO:unsorted
# @SQ   SN:chr1 LN:248956422    M5:6aef897c3d6ff0c78aff06ac189178dd UR:https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834 AS:GRCh38.d1.vd1    SP:Homo sapiens
# @SQ   SN:chr2 LN:242193529    M5:f98db672eb0993dcfdabafe2a882905c UR:https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834 AS:GRCh38.d1.vd1    SP:Homo sapiens

Seqinfo also requires the information of whether a chromosome is circular. In GDC’s case, mitochondria chromosome and all viruses sequences are circular. Combining all the information together, we can construct the Seqinfo object describing the genome GRCh38.d1.vd1.

library(tidyverse)
library(GenomeInfoDb)

gdc_simple_tbl <- read_tsv(
    './GRCh38.d1.vd1.dict', 
    skip = 1,  # Skip the first line (@HQ ...)
    col_names = c('SQ', 'chrom', 'length', 'md5sum', 'URI', 'assembly', 'species')
) %>%
    select(chrom, length) %>%
    mutate(chrom = str_sub(chrom, start = 4), 
           length = as.integer(str_sub(length, start = 4)),
           circular = case_when(
               chrom == 'chrM' ~ TRUE,
               chrom == 'chrEBV' ~ TRUE,
               startsWith(chrom, 'HPV') ~ TRUE,
               TRUE ~ FALSE
           ))

gdc_seqinfo <- Seqinfo(
    seqnames = gdc_simple_tbl$chrom,
    seqlengths = gdc_simple_tbl$length,
    isCircular = gdc_simple_tbl$circular,
    genome = 'GRCh38.d1.vd1'
)

Now we can supply it to any GRanges object coming out from any GDC’s sequencing data.

> gdc_seqinfo
Seqinfo object with 2779 sequences (191 circular) from GRCh38.d1.vd1 genome:
  seqnames   seqlengths isCircular        genome
  chr1        248956422      FALSE GRCh38.d1.vd1
  chr2        242193529      FALSE GRCh38.d1.vd1
  chr3        198295559      FALSE GRCh38.d1.vd1
  chr4        190214555      FALSE GRCh38.d1.vd1
  chr5        181538259      FALSE GRCh38.d1.vd1
  ...               ...        ...           ...
  HPV-mKN2         7299       TRUE GRCh38.d1.vd1
  HPV-mKN3         7251       TRUE GRCh38.d1.vd1
  HPV-mL55         7177       TRUE GRCh38.d1.vd1
  HPV-mRTRX7       7731       TRUE GRCh38.d1.vd1
  HPV-mSD2         7300       TRUE GRCh38.d1.vd1

I store the gdc_seqinfo as a RDS file (link here) so I can re-use it easily.

gdc_seqinfo <- readRDS('seqinfo_GRCh38.d1.vd1.rds')