Use DuckDB in ensembldb to query Ensembl's genome annotations

I have been using ensembldb to query genome annotations locally, which stores the Ensembl annotations in a offline SQLite database. By replacing the database engine with DuckDB, genome-wide queries are faster with small impact on gene specific queries (depending on the usage). DuckDB database’s file size is also smaller, and it can be even smaller by offloading the tables to external Parquet files.

Thesis in LaTeX

A few months ago, I finished my PhD thesis in LaTeX. The WUSTL LaTeX template I could find has a long history1, which has a relative lengthy implementation and includes unnecessary code. I didn’t feel comfortable to build on top of it.

I ended up rewriting the template …

Fix Fira Code font ligatures and features

Fira Code has been my choice of the programming font for a while. It’s also the default monospace font of my blog. I like its ligatures such as >= and connected lines ====== ------. It evens renders the progress bar nicely . It makes my plain text documents look neat.

That said, I …

Change the blog commenting system to utterances

My blog is statically generated, so it needs an external service for commenting. I chose Disqus when I started my blog because it was a popular choice, and it is free and easy to setup. However, there’s been increasing concern about its extensive user tracking, ads, and therefore a …

ID cross reference with exact protein sequence identity using UniParc

TL;DR: Many existing ID mappings between different biological ID systems (RefSeq/Ensembl/UniProt) don’t consider if the IDs have the same exact protein sequence. When the exact sequence is needed, UniParc can be used to cross-reference the IDs. I will demonstrate how to use UniParc to map RefSeq …

Identify the Ensembl release from versioned IDs

I often received data that was annotated by an unknown Ensembl release/version.

It could be the Ensembl IDs in a gene expression matrix, a VEP annotated MAF file, or even a customized GTF. The documentation of those files wasn’t always clear about the annotation in use. However, it …

Generate Venn diagrams easily

I find myself generating Venn diagrams quite often. While there are many available Venn diagram plotting libraries available, they don’t always fit my need. My inputs of the diagram are the set sizes rather than lists of observations. And after drawing the Venn diagram, I often edit them to …

Store GDC genome as a Seqinfo object

Genomic Data Commons (GDC) hosted by NCI is the place to harmonize past and future genomic data, such as TCGA, TARGET, and CPTAC projects. GDC has its own genome reference, GRCh38.d1.vd1, which has 2,779 “chromosomes” including decoys and virus sequences. That said, the canonical chromosomes of GRCh38 …

Build EnsDb from a local Ensembl MySQL database

In some occasions, I need to access the older version of Ensembl human transcripts. For example, the mutation calls generated by the NCI’s Genomic Data Common pipeline are annotated by Ensembl v84. To programmatically query the Ensembl annotations, I use the EnsDb SQLite database created by ensembldb, which is …

Make Firefox fullscreen borderless on macOS

EDIT 2021-06-01: In Firefox 89+, there’s a default option “Hide Toolbar” in the fullscreen mode that automatically hides the toolbar. So the customization is no longer needed.

Firefox fullscreen on macOS by default contains the address bar and the tab bar. I usually don’t really need the full …