Access gene annotation using gffutils
Recently, I had to access gene annotations in multiple versions from multiple sources such as Ensembl, GENCODE, and UCSC. I used to rely on the R/Bioconductor ecosystem to query the coordinates of a gene annotation. There are existing Bioconductor packages ready for Ensembl and UCSC annotations (more info in my previous posts: Ensembl and UCSC), and one can create a new customized TxDb given a GTF/GFF file. However, the project I was working on was written in Python, so I went on searching for similar alternatives in Python.
That’s how I found gffutils, a Python package to access gene annotations from GTF/GFF files. gffutils
first imports the annotations from the GTF/GFF file into a SQLite database. The package also provides some abstraction on top of the database schema, so user can retrieve an annotation without talking to the database directly using repetitive SQL commands. Database enables fast random access to any gene annotation.
I will use GENCODE v19, an annotation used by many TCGA GRCh37/hg19 projects, as an example to demo the usage of gffutils. My project requires the coordinates of UTRs and exons of all the transcripts in use.
Usage example
To use gffutils to query GENCODE annotation, we need to create the database first. The comprehensive gene annotation GTF can be downloaded from the GENCODE website (URL to the GTF). The database creation is handled by gffutils’s create_db
function. It will take a few minutes to run and the database will be at gencode_v19.db
.
import gffutils
db = gffutils.create_db(
'./gencode.v19.annotation.gtf.gz',
dbfn='gencode_v19.db',
verbose=True,
merge_strategy='error',
disable_infer_transcripts=True,
disable_infer_genes=True,
)
# INFO - Committing changes: 2619000 features
# INFO - Populating features table and first-order relations: 2619443 features
# INFO - Creating relations(parent) index
# INFO - Creating relations(child) index
# INFO - Creating features(featuretype) index
# INFO - Creating features (seqid, start, end) index
# INFO - Creating features (seqid, start, end, strand) index
# INFO - Running ANALYSE features
Once the database is created, we don’t have to repeat the same process but load the database directly as a FeatureDB object:
db = gffutils.FeatureDB('./gencode_v19.db')
Single feature access
One can then access the annotations of a gene or transcript by its ID. Using a transcript of TP53 as an example,
>>> gene = db['ENSG00000141510.11']; gene
<Feature gene (chr17:7565097-7590856[-]) at 0x7fac828deeb8>
>>> tx = db['ENST00000269305.4']; tx
<Feature transcript (chr17:7571720-7590856[-]) at 0x7fac828f8080>
We can then access the details of the transcript:
>>> tx.featuretype, tx.source
('transcript', 'HAVANA')
>>> tx.chrom, tx.start, tx.end, tx.strand
('chr17', 7571720, 7590856, '-')
>>> tx.attributes.items()
[('gene_id', ['ENSG00000141510.11']),
('transcript_id', ['ENST00000269305.4']),
('gene_type', ['protein_coding']),
('gene_status', ['KNOWN']),
('gene_name', ['TP53']),
('transcript_type', ['protein_coding']),
('transcript_status', ['KNOWN']),
('transcript_name', ['TP53-001']),
('level', ['2']),
('protein_id', ['ENSP00000269305.4']),
('tag', ['basic', 'appris_principal', 'CCDS']),
('ccdsid', ['CCDS11118.1']),
('havana_gene', ['OTTHUMG00000162125.4']),
('havana_transcript', ['OTTHUMT00000367397.1'])]
Gene model coordinates of a transcript
To find the coordinates of its exons and UTRs, we use FeatureDB.children()
which takes an Feature object or its ID and retrieves all the features belong to this feature. TP53 is on the reverse strand of the chromosome, so we can further sort the features by their end position:
>>> list(db.children(tx, order_by='-end'))
[<Feature transcript (chr17:7571720-7590856[-]) at 0x7fac828922e8>,
<Feature UTR (chr17:7590695-7590856[-]) at 0x7fac82892208>,
<Feature exon (chr17:7590695-7590856[-]) at 0x7fac828922b0>,
<Feature UTR (chr17:7579913-7579940[-]) at 0x7fac828925c0>,
<Feature exon (chr17:7579839-7579940[-]) at 0x7fac828928d0>,
<Feature CDS (chr17:7579839-7579912[-]) at 0x7fac82892c18>,
<Feature start_codon (chr17:7579910-7579912[-]) at 0x7fac82892f28>,
...
<Feature CDS (chr17:7572930-7573008[-]) at 0x7fac828277b8>,
<Feature exon (chr17:7571720-7573008[-]) at 0x7fac82827b38>,
<Feature UTR (chr17:7571720-7572929[-]) at 0x7fac82827eb8>,
<Feature stop_codon (chr17:7572927-7572929[-]) at 0x7fac828fca90>]
We have retrieved the UTRs, CDSs and exons of the transcript. Note that UTR is considered a part of an exon in gene annotation terminology. We should use CDSs as the exons that will be translated to amino acids. FeatureDB.children()
provides a way to subset the feature type it returns:
>>> list(db.children(tx, order_by='-end', featuretype=['CDS', 'UTR']))
[<Feature UTR (chr17:7590695-7590856[-]) at 0x7fac8283d7f0>,
<Feature UTR (chr17:7579913-7579940[-]) at 0x7fac8283d710>,
<Feature CDS (chr17:7579839-7579912[-]) at 0x7fac8283d7b8>,
...
<Feature CDS (chr17:7572930-7573008[-]) at 0x7fac82846470>,
<Feature UTR (chr17:7571720-7572929[-]) at 0x7fac828467b8>]
Now the gene model of TP53 becomes clearly visible.
Feature selection
To select all the transcripts in the database, there is a FeatureDB.all_features()
function. Here we want to select only the basic GENOCODE transcripts and count the number of different gene types:
from collections import Counter
# All the transcripts of basic GENCODE v19
all_basic_txs = (
tx for tx in db.all_features(featuretype='transcript')
if 'tag' in tx.attributes and 'basic' in tx.attributes['tag']
)
Counter(tx.attributes['gene_type'][0] for tx in all_basic_txs).most_common(5)
# [('protein_coding', 67186),
# ('antisense', 9160),
# ('lincRNA', 7121),
# ('miRNA', 3055),
# ('misc_RNA', 2034)]
Direct operation on the database
Since gffutils is just a abstraction layer on top of the database, we can always talk to the underlying SQLite database directly by writing SQL commands. The database schema is available on the gffutils’s documentation. Under the hood, FeatureDB object maintains a SQLite connection at FeatureDB.conn
and a helper function to run a single SQL command via FeatureDB.execute()
.
For example, GENCODE stores the full version of a transcript ID but in many occasion, such information is not available. Say if we only know the TP53 transcript ID is ENST00000269305
, then we can write a SQL query to find the matching ID:
>>> db.conn
<sqlite3.Connection at 0x7fac89423490>
>>> cur = db.execute(
... "SELECT id FROM features "
... "WHERE featuretype='transcript' AND id LIKE 'ENST00000269305.%';")
>>> cur.fetchone()[0]
'ENST00000269305.4'
We can even tweak the SQLite behavior by setting the PRAGMA
statements. gffutils has already added default pragma to optimize database query, including less database integrity and large memory size:
>>> db.pragmas
{'synchronous': 'NORMAL',
'journal_mode': 'MEMORY',
'main.page_size': 4096,
'main.cache_size': 10000}
>>> db.execute('PRAGMA temp_store=MEMORY')
>>> db.execute('PRAGMA cache_size=-1000000') # Use 1GB memory
Discussions
gffutils provides a SQLite-based gene annotation storage in Python. Though it may not be as feature complete as what user may get in R, it is highly customizable and can be easily integrated with other Python functions. Like the Bioconductor packages GenomicFeatures and EnsDb, they all use a SQLite database under the hood. As shown in another post, we can actually connect to those databases built by R packages directly, so user can access information from other sources such as UniProt isoforms and gene names.
In my opinion, all the approaches mentioned above are always better than trying to bake one’s own from scratch. Those packages are backed by numerous tests and are built from reliable or the original data sources. Besides multiple existing solutions in R and Python, one can always access the databases built by those packages from different languages, so it is quite unlikely to build something from scratch anyway.