ID cross reference with exact protein sequence identity using UniParc

TL;DR: Many existing ID mappings between different biological ID systems (RefSeq/Ensembl/UniProt) don’t consider if the IDs have the same exact protein sequence. When the exact sequence is needed, UniParc can be used to cross-reference the IDs. I will demonstrate how to use UniParc to map RefSeq …

Identify the Ensembl release from versioned IDs

I often received data that was annotated by an unknown Ensembl release/version.

It could be the Ensembl IDs in a gene expression matrix, a VEP annotated MAF file, or even a customized GTF. The documentation of those files wasn’t always clear about the annotation in use. However, it …

Generate Venn diagrams easily

I find myself generating Venn diagrams quite often. While there are many available Venn diagram plotting libraries available, they don’t always fit my need. My inputs of the diagram are the set sizes rather than lists of observations. And after drawing the Venn diagram, I often edit them to …

Store GDC genome as a Seqinfo object

Genomic Data Commons (GDC) hosted by NCI is the place to harmonize past and future genomic data, such as TCGA, TARGET, and CPTAC projects. GDC has its own genome reference, GRCh38.d1.vd1, which has 2,779 “chromosomes” including decoys and virus sequences. That said, the canonical chromosomes of GRCh38 …

Build EnsDb from a local Ensembl MySQL database

In some occasions, I need to access the older version of Ensembl human transcripts. For example, the mutation calls generated by the NCI’s Genomic Data Common pipeline are annotated by Ensembl v84. To programmatically query the Ensembl annotations, I use the EnsDb SQLite database created by ensembldb, which is …

Make Firefox fullscreen borderless on macOS

Firefox fullscreen on macOS by default contains the address bar and the tab bar. I usually don’t really need the full vertical space for web page, so those bars aren’t a problem. But when I access a RStudio Server on Firefox, I always want to have more vertical …

GPG Key Transition

I am transiting my GPG key again. However, for this time, I expect to use the new GPG master key longer and will start building this identity unless there is a concern about the key strength or I accidentally lose the key.

Back in my GPG key transition in 2016 …

Access gene annotation using gffutils

Recently, I had to access gene annotations in multiple versions from multiple sources such as Ensembl, GENCODE, and UCSC. I used to rely on the R/Bioconductor ecosystem to query the coordinates of a gene annotation. There are existing Bioconductor packages ready for Ensembl and UCSC annotations (more info in …

Read UniProtKB in XML format

UniProt Knowledge Base (UniProtKB) provides various methods to access their data. I settled on their XML format since no additional parsing code is required and the format is well defined, which comes with a schema. Plus, it turns out that databases such as PDB also provide their data export in …

Ad hoc bioinformatic analysis in database

Recently I’ve found that bioinformatic analysis in a database is not hard at all and the database set up wasn’t as daunting as it sounds, especially when the data are tabular. I used to start my analysis with loading everything into R or Python, and then figuring out …