Liang-Bo Wang's Blog

Use DuckDB in ensembldb to query Ensembl's genome annotations

2022-10-05T00:00:00-05:00

To query genome annotations locally, ensembldb has been my go-to approach. While I’ve already said many good things about this R package (1, 2, 3), here’s a summary of my favorite features:

I can use the same Ensembl version throughout my project (as a SQLite database)
I can query the genome-wide annotations and their locations easily and offline
Nice integration to R’s ecosystem that I can easily combine the extracted annotations with my data and other annotations using GenomicRanges and SummarizedExperiment
Language agnostic to use its database (say, I can query the same db in Python)

Since DuckDB is designed for analytical query workloads (aka OLAP), I decided to convert ensembldb’s SQLite database to DuckDB and try it in some of my common analysis scenarios. DuckDB has a similar look-and-feel to SQLite. Also, it uses a columnar storage and supports query into external Apache Parquet and Apache Arrow tables. I tried out some of these user-friendly features in this exercise.

Convert ensembldb’s database to DuckDB through Parquet
- Load Parquet tables to DuckDB
- Database file size comparison
Use DuckDB in ensembldb
Benchmark the databases
Summary

Convert ensembldb’s database to DuckDB through Parquet

The first step is to convert ensembldb’s SQLite database to DuckDB¹. I decided to export the SQLite tables as individual Parquet files, and then reload them back to DuckDB. So we could also test the DuckDB’s ability to query external parquet files directly.

For this exercise, I used the latest Ensembl release (v107). We can download the corresponding SQLite database from AnnotationHub’s web interface (its object ID is AH104864):

curl -Lo EnsDb.Hsapiens.v107.sqlite \
    https://annotationhub.bioconductor.org/fetch/111610

We use SQLAlchemy to fetch the schema of all the tables:

from sqlalchemy import MetaData, create_engine

engine = create_engine('sqlite:///EnsDb.Hsapiens.v107.sqlite')
metadata = MetaData()
metadata.reflect(bind=engine)

We can then list all the tables and their column data types:

db_tables = metadata.sorted_tables
for table in db_tables:
    print(f'{table.name}: ', end='')
    print(', '.join(f'{c.name} ({c.type})' for c in table.columns))
# chromosome: seq_name (TEXT), seq_length (INTEGER), is_circular (INTEGER)
# gene: gene_id (TEXT), gene_name (TEXT), gene_biotype (TEXT),
#   gene_seq_start (INTEGER), gene_seq_end (INTEGER),
#   seq_name (TEXT), seq_strand (INTEGER),...
# ...

With the correct data type mapping, we can export all the tables as Parquet by pandas and PyArrow. Since there are quite many text columns, I also used zstandard to compress the Parquet (with a higher compression level):

import pyarrow as pa
import pandas as pd
import pyarrow.parquet as pq

sqlite_to_pyarrow_type_mapping = {
    'TEXT': pa.string(),
    'INTEGER': pa.int64(),
    'REAL': pa.float64(),
}

# Read each SQLite table as a Arrow table
arrow_tables = dict()
with engine.connect() as conn:
    for table in metadata.sorted_tables:
        # Construct the corresponding pyarrow schema
        schema = pa.schema([
            (c.name, sqlite_to_pyarrow_type_mapping[str(c.type)])
            for c in table.columns
        ])
        arrow_tables[table.name] = pa.Table.from_pandas(
            pd.read_sql_table(table.name, conn, coerce_float=False),
            schema=schema,
            preserve_index=False,
        )

# Write each Arrow table to a zstd compressed Parquet
for table_name, table in arrow_tables.items():
    pq.write_table(
        table,
        f'ensdb_v107/{table_name}.parquet',
        compression = 'zstd',
        compression_level = 9,
    )

Load Parquet tables to DuckDB

Finally, we can load the exported Parquet tables to DuckDB. Here I tested a few approaches:

Create views to the external Parquet files (no content loaded to the db)
Load the full content
Load the full content and index the tables (same as the original SQLite db)

Since DuckDB has native support for Parquet files, the syntax is straightforward:

-- Install and activate the extension
INSTALL parquet; LOAD parquet;

-- To create views to external Parquet
CREATE VIEW <table> AS SELECT * FROM './ensdb_v107/<table>.parquet';
...

-- To load the full content from external Parquet
CREATE TABLE <table> AS SELECT * FROM './ensdb_v107/<table>.parquet';
...

-- To index the table (use .schema to get the original index definitions)
CREATE UNIQUE INDEX gene_gene_id_idx on gene (gene_id);
CREATE INDEX gene_gene_name_idx on gene (gene_name);
CREATE INDEX gene_seq_name_idx on gene (seq_name);
...

Note that I didn’t try to “optimize” the table indices for my queries. I simply mirrored the same index definition from the original SQLite database.

DuckDB’s commandline interface works like SQLite. And it keeps the database in a single file too. The full conversion including the Parquet step took about 10 seconds to complete.

duckdb -echo ensdb_v107.duckdb < create_duckdb.sql
duckdb -readonly ensdb_v107.duckdb

Database file size comparison

Here shows the file size of the databases created with different settings:

Database	File size	(%)
SQLite no indexed	243MB	57.9
SQLite (original)	420MB	100.0
DuckDB with external Parquets	37.6MB	9.0
DuckDB	169MB	40.2
DuckDB indexed	528MB	125.7

DuckDB with external Parquets yields the smallest file (~9% of the original size). It’s probably due to a lot of text columns in the database, and zstd compression works really well for the plain text. This approach could make the ensembldb database more portable. Say, it’s possible to commit it directly into the analysis project’s GitHub repo.

By loading the actual data into DuckDB (without indices), the file grows considerably due to no compression. Though it is slightly smaller than its SQLite counterpart. I wonder if this is due to the columnar storage being more space efficient than row storage. After indexing the DuckDB database, it surprisingly grows to be much larger than SQLite. I don’t know DuckDB’s indexing methods enough to understand what happened here. Since DuckDB is still actively developing its indexing algorithm, I suppose this could be optimized in the future.

Now we have the databases ready. Let’s see how they perform.

Use DuckDB in ensembldb

It’s painless to tell ensembldb to use DuckDB instead. DuckDB’s R client already implements R’s DBI interface, and ensembldb accepts a DBI connection to create a EnsDb object. So we already have everything we need:

library(duckdb)
library(ensembldb)

edb_sqlite = EnsDb('EnsDb.Hsapiens.v107.sqlite')

conn = dbConnect(duckdb(), dbdir = "ensdb_v107.duckdb", read_only = TRUE)
edb_duckdb = EnsDb(conn)
dbDisconnect(conn, shutdown = TRUE)  # disconnect after usage

All the downstream usage of ensembldb is the same from here.

Benchmark the databases

Now we have the original SQLite database and three DuckDB databases constructed with various settings ready to use in ensembldb. Here I tested two scenarios: a genome-wide annotation query and a gene-specific lookup.

To make the query more realistic and complicated, I also applied a filter to all queries to select annotations only from the canonical chromosomes and remove all LRG genes:

standard_filter = AnnotationFilter(
    ~ seq_name %in% c(1:22, 'X', 'Y', 'MT') &
        gene_biotype != 'LRG_gene'
)

I use microbenchmark to benchmark the same query from different databases. It works like this:

mbm = microbenchmark(
    "sqlite_noidx" = { <some query> },
    "sqlite" = { ... },
    "duckdb_parquet" = { ... },
    "duckdb" = { ... },
    "duckdb_idx" = { ... },
    times = 20  # 50 times for faster queries
)
summary(mbm)

Genome-wide annotation query

The first genome-wide query finds the 5’UTR genomic ranges of all the transcripts. This is one of the most computationally intensive built-in queries I know, involving some genomic range arithmics and querying over multiple tables.

five_utr_per_tx = fiveUTRsByTranscript(edb, filter = standard_filter)
five_utr_per_tx |> head()
## GRangesList object of length 6:
## $ENST00000000442
## GRanges object with 2 ranges and 4 metadata columns:
##       seqnames            ranges strand |   gene_biotype    seq_name
##          <Rle>         <IRanges>  <Rle> |    <character> <character>
##   [1]       11 64305524-64305736      + | protein_coding          11
##   [2]       11 64307168-64307179      + | protein_coding          11
##               exon_id exon_rank
##           <character> <integer>
##   [1] ENSE00001884684         1
##   [2] ENSE00001195360         2
##   -------
##   seqinfo: 25 sequences (1 circular) from GRCh38 genome
##
## ...

Here is the microbenchmark results by running the same query in all databases:

Benchmark results of extracting genome-wide 5'UTR locations per transcript.

There is a huge performance increase for all DuckDB databases, since this query pretty much scans over the full table. Overall, DuckDB runs 3+ times faster than SQLite.

In many cases, there are always a few runs in each database that take significantly more time. This trend is quite consistent as I re-run the benchmarks multiple times. While I haven’t investigated these outliers, I think this is due to the first run(s) being un-cached. Surprisingly, DuckDB with indices run much slower than that without indices (especially the first run). Though the index might be useless in sequential scans, I guess the slowdown could be due to the bigger file (longer to cache) or the query planner accidentally traversing over indices.

Another genome-wide annotation query

The other genome-wide query finds the transcripts of all the genes.

tx_per_gene = transcriptsBy(edb, by = "gene", filter = standard_filter)
tx_per_gene |> head()
## GRangesList object of length 6:
## $ENSG00000000003
## GRanges object with 5 ranges and 12 metadata columns:
##       seqnames              ranges strand |           tx_id
##          <Rle>           <IRanges>  <Rle> |     <character>
##   [1]        X 100633442-100639991      - | ENST00000494424
##   [2]        X 100627109-100637104      - | ENST00000612152
##   [3]        X 100632063-100637104      - | ENST00000614008
##   [4]        X 100627108-100636806      - | ENST00000373020
##   [5]        X 100632541-100636689      - | ENST00000496771
##                 tx_biotype tx_cds_seq_start tx_cds_seq_end         gene_id
##                <character>        <integer>      <integer>     <character>
##   [1] processed_transcript             <NA>           <NA> ENSG00000000003
##   [2]       protein_coding        100630798      100635569 ENSG00000000003
##   [3]       protein_coding        100632063      100635569 ENSG00000000003
##   [4]       protein_coding        100630798      100636694 ENSG00000000003
##   [5] processed_transcript             <NA>           <NA> ENSG00000000003
## ...

Similarly, here are the benchmark results:

Benchmark of extracting genome-wide gene isoforms.

This query tells more or less the same story with only a notable difference. In this case, fully loaded DuckDB with and without indices share the same performance. Interestingly, all the DuckDB runtimes are in a bimodal distribution. I don’t know why.

Gene-specific lookup

My another main scenario is to look up the annotations of a specific gene. Let’s simulate this kind of queries by retrieving all the transcripts of a gene “EGFR”:

egfr_tx = transcripts(edb, filter = AnnotationFilter(~ gene_name == 'EGFR'))
egfr_tx
## GRanges object with 14 ranges and 12 metadata columns:
##                   seqnames            ranges strand |           tx_id
##                      <Rle>         <IRanges>  <Rle> |     <character>
##   ENST00000344576        7 55019017-55171037      + | ENST00000344576
##   ENST00000275493        7 55019017-55211628      + | ENST00000275493
##   ENST00000455089        7 55019021-55203076      + | ENST00000455089
##   ENST00000342916        7 55019032-55168635      + | ENST00000342916
##         LRG_304t1        7 55019032-55207338      + |       LRG_304t1
##               ...      ...               ...    ... .             ...
##   ENST00000450046        7 55109723-55211536      + | ENST00000450046
##   ENST00000700145        7 55163753-55205865      + | ENST00000700145
##   ENST00000485503        7 55192811-55200802      + | ENST00000485503
##   ENST00000700146        7 55198272-55208067      + | ENST00000700146
##   ENST00000700147        7 55200573-55206016      + | ENST00000700147
## ...

Benchmark of extracting the annotations of a specific gene.

SQLite with indices undoubtedly has the best performance. Understandably, it’s been fine tuned for this very use case (extracting a few rows using indices). And SQLite without an index takes the most time to complete, so it’s necessary to always index the tables.

The performance of all three DuckDB databases fall in between the two extremes of SQLite dbs. Unlike SQLite, indexed DuckDB only speeds up the query a little bit (21.0ms vs 22.4ms on average). Given the worse performance of one of the genome-wide queries above using the indexed DuckDB db, I think it’s optional to create indices for ensembldb’s DuckDB dbs.

Summary

Here is the overview of the benchmarking results together with the db’s file size. The table below displays the performance in average speed-up ratio (and the worst case ratio) over the original SQLite db (ratio the higher the better):

Database	Size (%)	Genome I	Genome II	Gene-specific lookup
SQLite no indexed	57.9	0.61 (0.78)	0.88 (1.02)	0.044 (0.12)
SQLite (original)	100.0	1.00 (1.00)	1.00 (1.00)	1.00 (1.00)
DuckDB w. ext. Parquets	9.0	3.36 (3.69)	4.14 (4.63)	0.15 (0.35)
DuckDB	40.2	4.70 (4.76)	6.30 (6.73)	0.61 (0.81)
DuckDB indexed	125.7	3.66 (1.87)	6.29 (6.33)	0.65 (1.80)

Overall, DuckDB shows impressive performance increase for genome-wide queries. It uses up less storage too. While DuckDB is slower than SQLite when it comes to gene-specific lookups, since we are talking about tens of milliseconds per query, unless we are running thousands of these queries, the performance impact is minimal. On the other hand, genome-wide queries are saving seconds per query.

As the benchmark results shown, we could replace the original ensembldb database with a DuckDB database by loading the tables and removing the indices. If the user is willing to sacrifice some performance in gene-specific lookups, DuckDB with external Parquet files only uses < 10% of the original disk space but it still runs faster for genome-wide queries.

While the default indices copied from SQLite are not very helpful, I didn’t tune the indices to maximally speed up the gene-specific lookups. We can probably also tune the Parquet compression ratio to find a better balance between the decompression speed and file size. Note that DuckDB’s file format is not stabilized yet, so the database needs to be re-created in newer DuckDB versions.

All in all, I think DuckDB advertises itself accurately when it comes to analytical query workloads. It shows good performance when it queries a large portion of its content. By having a similar interface to SQLite and clients in popular languages (R, Python, and etc), it’s easy to change an existing SQLite usecase to use DuckDB. My small exercise with ensembldb has convinced me to try out DuckDB in more scenarios too.

There is an official extension sqlite_scanner currently under development that lets a DuckDB attach directly to a SQLite database. So in the future, it could be much easier to convert SQLite to DuckDB. ↩

Thesis in LaTeX

2022-03-12T00:00:00-06:00

A few months ago, I finished my PhD thesis in LaTeX. The WUSTL LaTeX template I could find has a long history¹, which has a relative lengthy implementation and includes unnecessary code. I didn’t feel comfortable to build on top of it.

I ended up rewriting the template based on the memoir package, which is designed for long documents like book and thesis. Overall I enjoyed my experience writing in LaTeX, so I wanted to share my LaTeX setup and my experience of editing a long document.

Writing thesis in LaTeX is easy and rewarding (once it’s set up)
My LaTeX setup
Package recommendation to start a new LaTeX template
Why I didn’t use LaTeX a lot during my PhD
Types of documents I will write in LaTeX in future

Writing thesis in LaTeX is easy and rewarding (once it’s set up)

The advantages of LaTeX and its ecosystem² over WYSIWYG editors (e.g., Word and Google Doc) are more obvious when the document gets longer, as WYSIWYG editors start to slow down, say, to change the figure style across the whole document, or to swap some figures and sections. On the other hand, the LaTeX workflow remains the same. While the compilation takes longer, it usually runs in the background and I have adapted to only check the output periodically. I appreciate its reliability and modularity when my document gets really long. I can just focus on the writing.

The hardest part is the setup of a LaTeX document.

What are the “recommended” packages? What are all these parameters of the packages and commands? How to set up the folder structure? How do I make a figure block whose caption overflows to the next page (and any other visual components)? Finally, not to mention the time I spent fixing the errors and the debugging when the code doesn’t work as expected.

Now, after some time going over the documentations, I wanted to write down my setup for future usage.

My LaTeX setup

I prefer the combination of LuaLaTeX (XeLaTeX) and BibLaTeX/Biber. Both LaTeX engines can utilize system fonts and support Unicode. As for the bibliography management, BibLaTeX is more customizable.

Bibliography management with Zotero and Better BibTeX

I manage all my reference in Zotero. My Zotero setup (Zotero + Zotfile) has been the same over the past 6 years, which is surprisingly stable in the software world. Zotero has builtin support to export the reference in BibTeX and BibLaTeX formats.

To further integrate Zotero to the LaTeX workflow, Better BibTeX for Zotero plugin comes in handy. The plugin can customize the citation key generation, and it can automatically export the .bib file when the reference changes.

I currently have the following rule for the citation key generation:

[auth+initials:lower]_[authorLast+initials:lower][>1]:[shorttitle2_2][year]
|[Auth+initials:lower:(unknown)]:[Title:capitalize:substring=1,64][year]

The key will try to extract the first and last author of the paper, the first two words of the title, and the publication year. If the reference is not a article (mostly websites), it will try to use the author, title, and year when available.

Under this rule, for example, the citation key of the famous CRISPR paper (doi:10.1126/science.1225829):

Jinek, M., Chylinski, K., Fonfara, I., Hauer, M., Doudna, J.A., and Charpentier, E. (2012). A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science 337, 816–821.

becomes jinekm_charpentiere:ProgrammableDualRNAguided2012.

Better BibTeX also allows fixing the citation key, which is useful to keep a nickname of some commonly referred publications, or to maintain backward compatibility.

Overflowing legend of large figures

A feature I found very useful is to allow figure caption/legend to overflow to the next page, which is common in many journals. Usually the figure is large, taking up the full page, and has multiple panels combined into one file. And the remaining figure legend need to be put at the next page. Here’s an example of the desired behavior:

Example of the figure legend overflow.

To create “fake” panels for subcaption to keep track of the reference, I followed the suggestion on LaTeX Stack Exchange to create a hidden anchor for the label using \phantomsubcaption.

To allow the figure legend/caption to overflow, two figure environments are constructed side by side. Then I used a quotation helper function \sourceatright by memoir to place the “(legend continued on next page)” right aligned to the end of legend.

Here’s the final helper functions in the permeable:

% Allowing subcaptions when all figure panels are combined
% into one source image. Require subcaption package.
% Based on https://tex.stackexchange.com/a/255790
\newcommand{\phantomlabel}[1]{%
    \parbox{0pt}{\phantomsubcaption\label{#1}}%
}

% Note for figure caption spanning multiple pages
\newcommand{\legendcontdnote}{\sourceatright[2em]{%
        \footnotesize\itshape(legend continued on next page)%
}}
\newcommand{\legendcontdref}[1]{\emph{(\fref{#1} continued)}}

And the following shows an example usage:

\begin{figure}[p]  % usually large and will need a full page
    \centering
    \phantomlabel{fig:panel-a}  % hidden label of each figure panel
    \phantomlabel{fig:panel-b}
    \phantomlabel{fig:panel-c}
    ...
    \includegraphics{figures/myfigure.pdf}
    \caption{%
        Overview of the whole figure.
        \subref{fig:panel-a}
        Some description about panel A.
        \legendcontdnote
    }
    \label{fig:myfigure}
\end{figure}
\begin{figure}[t]  % place at the top of the very next page
    \centering
    \legend{%
        \legendcontdref{fig:myfigure}
        \subref{fig:panel-b}
        Some description about panel B.
        \subref{fig:panel-c}
        Some description about panel C.
        ...
    }
\end{figure}

Folder structure

I have a very typical folder structure of a thesis:

wustlthesis.cls     # document class
thesis.tex          # main file (structure and settings)
abstract.tex
acknowledgments.tex
chapters/           # text per chapter
  ├── XX_name.tex
  └── ...
figures/            # figures per chapter
  ├── chapXX_name/
  └── ...
fonts/              # External fonts
.github/workflows/  # Autobuild GitHub workflows
README.md           # Instructions to build the file
latexmkrc           # latexmk settings
references.bib      # Bibliography in BibLaTex

The main goal of the folder structure is to organize the materials by chapters. I also keep all the related files together so the project is standalone and reproducible.

GitHub online editing workflow

Another pain point of LaTeX setup is the installation of the toolchains (TexLive, MacTex, and etc). Sometimes I want to work on the document on other’s laptop for just a while, to fix an obvious typo or to write down some ideas. Also a working LaTeX toolchain always read to use will help others to adapt the workflow much more easily.

Turn out I can tap into the continuous integration and continuous delivery (CI/CD) service provided by the online code repositories. On GitHub, such service is GitHub Actions, allowing user to provide custom Docker image and run arbitrary code.

Luckily, someone has already laid out the ground work by preparing a TeXLive environment into a GitHub Action workflow: https://github.com/xu-cheng/latex-action. Below is an example of creating an online workflow to build the document using LuaLaTeX:

# .github/workflows/build_latex.yml
name: Build LaTeX PDF
on: push
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - name: Check out Git Repository
      uses: actions/checkout@v2

    - name: Build LaTeX files
      uses: xu-cheng/latex-action@v2
      with:
        root_file: thesis.tex
        latexmk_use_lualatex: true

    - name: Check if PDF file is generated
      run: |
        file thesis.pdf | grep -q ' PDF '

    - name: Upload PDF
      uses: actions/upload-artifact@v2
      with:
        name: PDF
        path: thesis.pdf

The workflow is triggered on every git push. Here is an example of my WUSTL thesis template (link):

Overview of all the online LaTeX document compilation jobs using GitHub actions.

Not only the workflow records the output messages useful to debug, the final PDF output is stored as an artifact. I no longer need to install LaTeX locally.

Auto-generated PDF output of the document.

By putting everything in a GitHub repo, I can also use GitHub’s online editor to work on my document in a web browser anywhere. Together with GitHub Actions, it can be an alternative to online LaTeX platform like Overleaf; hacky but more flexible.

Online editor of a LaTeX project in a GitHub repository.

Currently, the online workflow is not fully optimized. My full thesis takes about 6–12 minutes to complete on GitHub, while locally, a full run without cache takes about 3–5 minutes. And incremental local builds can be much faster with caches. By caching the environment and the intermediate outputs, the online workflow can be faster.

Package recommendation to start a new LaTeX template

I look into a few package alternatives while updating the thesis template.

For a new template/project, start with memoir, seriously.

Memoir is certainly a giant package and has a very lengthy documentation (~600 pages). But it pretty much covers > 90% of the possible formatting I can possibly think of. The first few chapters of its documentation (up to “Paragraphs and Lists” chapter) are very useful to get the main components and structures in place.

While memoir is powerful and covers everything, I do wish certain aspects of it can be better. The chapter title styling is complicated and difficult. My another issue is the lack of examples. I understand the current documentation is already crazily long. But memoir has many many options and features that I sometimes find it difficult to comprehend how to use a specific feature. Maybe more examples or code snippets can be added in a separate documentation.

With memoir, very few additional packages are required. Here is a short list of packages:

microtype: final touch on the typography. pure black magic in my view
enumitem: list environment customization
threeparttable: pretty table styling and in-table notes
subcaption: subfloat references
hyperref: link and PDF metadata
graphicx: external graphics
fontspec: custom system fonts
csquotes: quotes
babel: localization

Each package is powerful in their specific usage. Their documentations are worth reading to fully utilize the package.

Why I didn’t use LaTeX a lot during my PhD

I didn’t really have many chances to use LaTeX during my PhD. It’s usually not worth the effort to use LaTeX for documents less than 50 pages.

The most common documents I need to produce are presentations and manuscripts, and I don’t think LaTeX is useful here. My presentations are figure heavy. Many figures are clipped from webpages, tool outputs, and other papers. They are messy and require post-editing like croppings and overlays, which are easier to do interactively; PowerPoint wins. Moreover, I need to finish my slides in a short time. LaTeX is not the right tool.

As for the manuscripts, the figures and text are managed separately, so there is no need to work on how to insert the figures at the right location in text. The main text editing is about citations. While LaTeX is good at them, Zotero can manage citations and bibliography easily on Google Docs, Word, and LibreOffice Writer. Another main hassle of manuscript editing is collaboration. While it’s possible for LaTeX, the existing services have a high entry bar with little benefit.

Types of documents I will write in LaTeX in future

If I need to write a report longer than 50 pages, I will write in LaTeX and use my thesis setup.

I also have been updating my CV in LaTeX. While I am playing with LaTeX for my thesis, I also applied the new things I learned to my CV. In fact, I applied a different theme and rewrote many parts of the theme to utilize the awesome packages. I like my new CV to be clean, minimalist, and easy to extend. I don’t know yet if I want to create my one-page resume in LaTeX.

So is LaTeX worth learning if I am not going to write a book or a long report in future?

It’s a bad investment in time for sure, since I have been putting hours in learning LaTeX. But I have since been paying more attention to typography and layout of my documents in general. I appreciate certain aesthetics of the printings. And by trying to fulfill the desired look and feel in my documents, I’ve also improved my editing skills in all WYSIWYG editors (even Adobe InDesign) and webpages (CSS). It’s hard to beat that artistic satisfaction of getting the style right.

I am glad that I learned LaTeX.

The history of the existing template goes way back to 1995. I tried to summarize the history I can be find from the file comments here. ↩
Here are the advantages LaTeX I appreciate the most:
- Precise control of the layout
- Automatic positioning of float objects (figures and tables)
- Programmable and reusable visual components, especially those provided by the packages
- Excellent bibliography management
- Excellent equation display
↩

Fix Fira Code font ligatures and features

2022-03-01T00:00:00-06:00

Fira Code has been my choice of the programming font for a while. It’s also the default monospace font of my blog. I like its ligatures such as >= and connected lines ====== ------. It evens renders the progress bar nicely . It makes my plain text documents look neat.

That said, I don’t like the default ampersands & and the at signs @. I find them harder to read than their traditional looks. To change their looks, we can enable the alternative ligatures and features of the font using different OpenType features (see also the guide on MDN). In this case, ss05 and ss08 enable the traditional looks of ampersands and at signs, respectively. Most modern editors and word processors are able to configure the feature sets in use.

Comparison of the Fira Code rendering with and without the features fixed (ss01, ss03, ss05, and ss08).

Unfortunately, I encounter programs that are unable to configure the font features. While tools like pyftfeatfreeze (OpenType Feature Freezer) are able to swap specific glyphs by directly editing the font file, ligatures of those glyphs may fail. For example, ss08 feature (e.g., ==, !=, and ===) won’t be permanently enabled using this approach.

Permanently fix the font features using the official build script (update in 2022-03)

Thanks to the pull request by @Daxtorim a few days after this post was published, we now are able to permanently fix the font features using the official build script:

docker run -it --rm \
    -v $PWD:/opt/FiraCode \
    tonsky/firacode:latest \
    ./FiraCode/script/build.sh \
    -f "ss01,ss03,ss05,ss08" \
    -n "Fira Code ss01 ss03 ss05 ss08"

# Rename the generated TTFs
parallel 'mv {} {.}.ss1358_enabled.ttf' ::: distr/ttf/'Fira Code ss01 ss03 ss05 ss08'/*.ttf

And that’s it! This is the easiest solution and it works straight out of the box. Praise the open source community :) I still kept the original instructions below to manually create the patches since that’s what happens behind the scene.

Permanently fix the font features

By changing the source code of the font generation, it should be possible to permanently fix any font features (aka patching). As mentioned by the original author (@tonsky):

There is probably a simpler approach to patching the font, just concat whatever code there is in ssXX and add it to the end of calt feature. That should work on the current version of the font, but you’ll need to do your own research on which scripts to use for that

And that’s exactly my patch for FiraCode.glyphs. Copy all the content of the features (say, ss01, ss03, ss05, and ss08) to calt. So the original code:

# FiraCode.glyphs
{
code = "lookup less_bar_greater ...
... underscores;\012";
name = calt;
},

becomes:

{
code = "lookup less_bar_greater ...
... underscores;\012
# ss01\012  sub r by r.ss01;
# ss03\012  sub ampersand by ampersand.ss03;...
# ss05\012  sub at by at.ss05;\012sub asciitilde.spacer'; ...
# ss08\012  sub equal_equal.liga by equal_equal.ss08;...
";
name = calt;
},

Then build the font from the .glyphs file using the Docker image:

docker run -it --rm \
    -v $PWD:/opt/FiraCode \
    tonsky/firacode ./FiraCode/script/build.sh

To set a different font name for the feature fixed fonts, I use pyftfeatfreeze:

parallel \
    pyftfeatfreeze \
        --suffix --usesuffix="'ss01 ss03 ss05 ss08'" \
        -v -n \
        '{}' 'features_enabled/{/.}.ss1358_enabled.ttf' \
        ::: ttf/*.ttf

Change the blog commenting system to utterances

2022-02-20T00:00:00-06:00

My blog is statically generated, so it needs an external service for commenting. I chose Disqus when I started my blog because it was a popular choice, and it is free and easy to setup. However, there’s been increasing concern about its extensive user tracking, ads, and therefore a toll on the page loading performance¹. Heck, I don’t even load Disqus myself when I check my own blog:

What my blog looks like from my end, where Disqus is blocked by Privacy Badger by default.

Recently, I was finally able to look into the alternatives to Disqus. I landed on utterances, a commenting widget based on GitHub Issues. I like it for a few reasons:

Free and open source
No trackings and ads (for now at least)
Comments are tied to the blog’s code repository
Moderation using existing GitHub tools/interface

Switch to a new commenting systems can be hard due to the loss of the old comments. But since there are only a total of 75 comments on my blog, I don’t have a lot to miss :) I did back up the old comments because I enjoyed the discussions, and they are one of the main motiviations to keep me going. Disqus offers a way to export all the comments, and here is the frequency of all the comments over time:

Number of comments on my blog over time

Since this post, my blog will be using the new utterances comment widget. For comparison, I attached a screenshot of the old interface using Disqus below:

Old commenting widget using Disqus on my blog.

There are already many summaries on these issues. For example: Replacing Disqus with Comments (Discussion on Hacker News) and Disqus, the dark commenting system (Discussion on Hacker News). ↩

ID cross reference with exact protein sequence identity using UniParc

2020-07-24T00:00:00-05:00

TL;DR: Many existing ID mappings between different biological ID systems (RefSeq/Ensembl/UniProt) don’t consider if the IDs have the same exact protein sequence. When the exact sequence is needed, UniParc can be used to cross-reference the IDs. I will demonstrate how to use UniParc to map RefSeq human proteins to UniProt and Ensembl at scale.

You can skip to the solution if you already know what’s the problem I want to tackle.

Camps of biological IDs
Challenges to map IDs with exact sequence identity
Why mapping with exact sequence identity?
UniParc comes into rescue for protein sequence identity
Programatic UniParc access using its XML
Summary

Camps of biological IDs

There are a few “camps” of biological IDs that are used by many (human) databases and datasets: Ensembl, RefSeq (plus NCBI/Entrez Gene), and UniProt. Each ID camp is comprehensive independently, containing gene-level, transcript-level, and protein-level information using their own systems of IDs. Unfortunately, all three ID systems/camps are useful in their own way, making the choice of the “favorite” ID system really divided for different databases and datasets.

To get a sense of these “ID camps” and how information is connected through them and across them, this great illustration from bioDBnet sums it all (it’s huge):

Best illustration of the complex ID crossref: bioDBnet Network Diagram (source)

Challenges to map IDs with exact sequence identity

It’s usually straightforward to cross reference within each ID camp, as long as one has the versioned ID and a copy of that camp’s ID system. For example, to know the gene symbol of ENSP00000368632.3, I can easily use ensembldb, a lite copy of Ensembl’s ID system, to find out it is translated from transcript ENST00000379328.9, which is one of the transcripts of gene ENSG00000107485.18, whose gene symbol is GATA3. Easy peasy. The story goes the same for RefSeq and UniProt (albeit this one is more protein centric).

However, things get messy when one wants to cross reference across ID camps. While both official and third-party services (e.g., bioDBnet and DAVID) exist, they don’t guarantee sequence identity match. In this post, I will focus on the protein sequence. For example, bioDBnet says ENSP00000368632 can be mapped to (note the lack of version):

RefSeq: NP_002042, NP_001002295, …
UniProt: P23771, …

But when considering their sequence being identical, ENSP00000368632.3 should only match NP_001002295.1 in RefSeq, and the non-canonical form of UniProt P23771-2 (sequence version 1). The other IDs don’t have the exact same protein sequence because of an 1aa deletion. To complicate things more, many mappings don’t handle ID versions, and sequence often changes across ID versions. Without tracking the versioned ID, it’s impossible to say which IDs have the same sequence.

Let me be clear that there is nothing wrong about these services. They’re built so people can map IDs at high level and increase the number of mappable IDs, which is extremely useful in its own way. In fact, a lot of IDs simply cannot be mapped when consider exact sequence identity.

For transcripts the situation isn’t any better because many mapped transcripts of different ID camps have different UTR sequences. I will probably write another post to touch on transcript ID mapping. The short answer is that the MANE project started by RefSeq and Ensembl is working on the problem.

Why mapping with exact sequence identity?

Well, when do I need ID mapping with exact protein sequence identity? My use case is to map post-translation modifications (PTMs) from RefSeq to UniProt. CPTAC detected loads of PTMs (mostly phosphosites) using RefSeq as the peptide spectral library (peptide search database). But a lot of what we know about a protein is from UniProt. So I need a reliable way to map a specific amino acid of one RefSeq protein to its UniProt counterpart. I’ve been using existing services for the job, but they not perfect.

For example, we found a phosphosite NP_001317366.1 (PTPN11) p.Y546. PTPN11 corresponds to Q06124 in UniProt reviewed proteome (the current canonical isoform is Q06124-2 seq. ver3). But you won’t find anything (e.g. antibody) about this site at 546aa because sequences of NP_001317366.1 and Q06124-2 don’t match. This phosphosite actually maps to Q06124-2 p.Y542.

While one can argue that I should re-run the peptide search using UniProt, this solution only works around the problem. The problem comes back when I want to map the Ensembl based mutations to UniProt. I am also aware that two proteins with different sequence can be biologically different, and we shouldn’t just blindly integrate their annotation. I totally agree, so the integration should be further validated. Due to the nature of the shotgun proteomics, as long as the peptide sequence of the PTM site can be found in both proteins, it’s fairly possible that the site can be mapped to both. This topic has been on my mind for a while. I’ll write about it once I figure out the details. Anyway, mapping PTMs between different protein sequences is my next step, and it goes beyond the scope of this post.

UniParc comes into rescue for protein sequence identity

The UniProt Archive (UniParc) is a comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world. Proteins may exist in different source databases and in multiple copies in the same database. UniParc avoided such redundancy by storing each unique sequence only once and giving it a stable and unique identifier (UPI) making it possible to identify the same protein from different source databases. A UPI is never removed, changed or reassigned. UniParc contains only protein sequences.

(source: UniParc help page)

UniParc is a collection of non-redundant protein sequence archive. The UniParc ID and sequence is permanently stable, but the cross-references associated to one UniParc entry may change over time. All of its properties make UniParc perfect to be the identifer to map across ID camps. UniParc IDs can be queried using the CRC64-ISO checksum of the protein sequence.

For example, let’s find the UniPrac ID of NP_001317366.1. First, we obtain its protein sequence in FASTA from NCBI:

$ export refseq_id="NP_001317366.1"
$ curl -Lo "$refseq_id".fasta \
    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=$refseq_id&rettype=fasta&retmode=text"

$ head -n 3 "$refseq_id".fasta
>NP_001317366.1 tyrosine-protein phosphatase non-receptor type 11 isoform 3 [Homo sapiens]
MTSRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGDFTLSVRRNGAVTHIKIQNTGDYYDLYGGEK
FATLAELVQYYMEHHGQLKEKNGDVIELKYPLNCADPTSERWFHGHLSGKEAEKLLTEKGKHGSFLVRES

Then calculate the CRC64 checksum (there are a few packages capable, or use EBI’s checksum calculator):

>>> from pysam import FastaFile
>>> from crc64iso import crc64iso
>>> fa = FastaFile('NP_001317366.1.fasta')
>>> seq = fa.fetch(fa.references[0])
>>> crc64iso.crc64(seq)
'37E8BFC7ECA2D03F'

Search checksum:37E8BFC7ECA2D03F on UniParc gives an unique entry¹ UPI000041C017: https://www.uniprot.org/uniparc/?query=checksum%3A37E8BFC7ECA2D03F&sort=score&direct=yes.

UniParc entry UPI000041C017 and all of its human ID cross references with exact sequence identity.

All the external IDs listed here have the identical protein sequence to NP_001317366.1, which of course includes itself. UniParc also marks the IDs inactive if they are superseded by a newer version or become obsolete, which is quite useful for data forensics.

Programatic UniParc access using its XML

To extract UniParc’s cross reference, it’s easiest to parse its XML, which is also easy to download in bulk. Continue to use UPI000041C017 (NP_001317366.1) as the example,

$ curl -LO https://www.uniprot.org/uniparc/UPI000041C017.xml
$ head -n 10 UPI000041C017.xml
<?xml version='1.0' encoding='UTF-8'?>
<uniparc xmlns="http://uniprot.org/uniparc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniparc http://www.uniprot.org/docs/uniparc.xsd" version="2020_03">
<entry dataset="uniparc">
<accession>UPI000041C017</accession>
<dbReference type="UniProtKB/Swiss-Prot" id="Q06124" version_i="2" active="N" version="2" created="2005-12-20" last="2019-11-13">
<property type="NCBI_GI" value="84028248"/>
<property type="NCBI_taxonomy_id" value="9606"/>
<property type="protein_name" value="Tyrosine-protein phosphatase non-receptor type 11"/>
<property type="gene_name" value="PTPN11"/>
</dbReference>

While I don’t find XML easy to read, I’ve figured out a way before to parse XMLs as a JSON-like dictionary given its schema. Let’s define a function to flatten the nested structure and only select the information we want:

def parse_dbref(entry):
    """
    Parse Ensembl/UniProt/RefSeq IDs of an UniParc entry.

    Keep both active and inactive IDs.
    """
    ref_ids = {}
    for db_type in ["Ensembl", "UniProt", "RefSeq"]:
        ids = set()
        for d in entry['dbReference']:
            if d["@type"].startswith(db_type):
                # Skip non-human entries
                ncbi_taxid = next(
                    (p['@value'] for p in d['property'] if p['@type'] == 'NCBI_taxonomy_id'),
                    None
                )
                if ncbi_taxid != '9606':
                    continue

                # Make versioned ID
                if '@version' not in d:
                    # Use the UniParc internal version (for UniProt)
                    id_str = f"{d['@id']}.{d['@version_i']}"
                else:
                    id_str = f"{d['@id']}.{d['@version']}"
                ids.add(id_str)
        ref_ids[db_type] = sorted(ids)
    return ref_ids

xmlschema makes it really easy to parse a XML with schema:

>>> import xmlschema
>>> from pprint import pprint
>>> xs = xmlschema.XMLSchema('https://www.uniprot.org/docs/uniparc.xsd')
>>> data, errors = xs.to_dict('UPI000041C017.xml', validation='lax')
>>> pprint(parse_dbref(data['entry'][0]))
{'Ensembl': ['ENSP00000489597.1'],
 'RefSeq': ['NP_001317366.1', 'XP_006719589.1'],
 'UniProt': ['Q06124-1.1', 'Q06124.2']}

Voilà, we can now map across ID camps with confidence!

This method can be applied to a large number of queries efficiently. By reading in a FASTA of protein sequences of interest, we can build URLs to UniParc XML per protein entry using its checksum, and pass the URLs as an aria2's input file xml.links:

https://www.uniprot.org/uniparc/?query=checksum%3A{crc64_checksum}&format=xml
    out={protein_id}.xml

...

And download all the links in batch:

aria2c -c -j5 --max-overall-download-limit=10M -i xml.links

Summary

We solve the ID mapping with exact protein sequence identity between Ensembl/RefSeq/UniProt camps through UniParc.

Note that the version of UniProt entries is a bit confusing. For example, Q06124.2 means the sequence version 2 of Q06124. But finding UniProt’s sequence version is not that straightforward, and the UniProt isoforms unlike the canonical isoform lack version tracking. As a result, while processing UniProt associated annotation, I will always add UniParc IDs or keep its protein sequence for future reference.

RefSeq and Ensembl protein IDs are always versioned. Thus it’s highly recommended to keep the versioned ID for these two systems.

It’s possible that two different protein sequences have the same checksum, though very unlikely. So always double check so do this in batch. ↩

Identify the Ensembl release from versioned IDs

2020-06-08T00:00:00-05:00

I often received data that was annotated by an unknown Ensembl release/version.

It could be the Ensembl IDs in a gene expression matrix, a VEP annotated MAF file, or even a customized GTF. The documentation of those files wasn’t always clear about the annotation in use. However, it’s sometimes necessary to know the exact Ensembl release. Say, I want to reproduce the result, or to pass the output to the extended downstream workflow. While the tiny difference between adjacent releases is only annoying, thousands of changes per release add up quickly to be obvious inconsistency when using releases across different years.

It’s possible to pinpoint the Ensembl release using the ID versions. For example, ENSG00000119772.15 (DNMT3A) only existed in Ensembl releases 79 and 80; ENST00000275493.7 (EGFR) remains alive since release 96. By checking the ID history on https://ensembl.org, I can identify the possible Ensembl releases my data uses. A handy URL shortcut to accompany the investigation is ensembl.org/id/<ensembl_id>, which can redirect to the different page tabs depends on the ID types (e.g., ENSG to genes and ENST to transcript).

ID history of ENSG00000119772 (DNMT3A) (source)

With Ensembl Tark, I wrote a Python script to automate the query. Given a list of versioned Ensembl IDs¹, the script will identify the possible Ensembl releases. The IDs can be genes, transcripts, or proteins. Based on my testing, the script can identify Ensembl release range with less than 30 IDs.

Here is an example list of IDs:

ENST00000426449.4
ENST00000434817.4
ENSP00000222254.6
ENSP00000477864.1
ENSG00000170266.14
ENSG00000238009.5
ENSG00000173705.7

And the output of the script:

$ python check_possible_ensembl_releases.py ensembl_ids.list
[2020-06-09 13:00:48][INFO   ] Querying for 7 Ensembl IDs
[2020-06-09 13:00:49][INFO   ] Only Ensembl releases 75–99 are supported by Ensembl Tark. IDs outside the range may not be identified.
ENSP00000477864.1 in Ensembl releases 76–99
ENSG00000173705.7 in Ensembl releases 79–80
ENSG00000238009.5 in Ensembl releases 79–80
ENST00000434817.4 in Ensembl releases 79–80
ENSP00000222254.6 in Ensembl releases 75–99
ENSG00000170266.14 in Ensembl releases 79–80
ENST00000426449.4 in Ensembl releases 79–80
Possible Ensembl releases are: 79, 80

In this case, Ensembl releases 79 and 80 have the same gene model.

The script basically uses Tark’s REST APIs to get the release range for each ID, and find the intersection of all the ranges. It runs on Python 3.8 and uses aiohttp 3.6 for concurrent API calls. Tark currently has records from release 75 (2014) to 99 (2020), so this approach will fail if the IDs are too old (or too new, but I think it will include the latest r100 soon). I limited the maximal concurrent calls ≤ 5 so I don’t overwhelm the Tark service.

Tark is a great website/service that compares the transcripts between different versions or even different sources (Ensembl vs RefSeq)! It can tell you what exon or UTR was changed, something quick tricky to set up because one has to import databases for every Ensembl release and even RefSeq releases. Tark is currently in beta, but I hope it can be stable and remained updated.

And this is just another point that it’s usually a good idea to include versioned IDs in the data.

Now I can go back digging the history of my data :)

It’s probably possible to apply the same approach without Ensembl ID version (e.g., ENSG00000119772), but one might need a lot of them because IDs are much more stable. ↩

Generate Venn diagrams easily

2019-04-20T00:00:00-05:00

I find myself generating Venn diagrams quite often. While there are many available Venn diagram plotting libraries available, they don’t always fit my need. My inputs of the diagram are the set sizes rather than lists of observations. And after drawing the Venn diagram, I often edit them to integrate with other figures, so I prefer a vector format like SVG, which not all the libraries offer.

So I made an Observable Notebook that allows me to interactively modify the Venn diagram, and download the output as a SVG file. It’s built on the venn.js library, which does all the heavy lifting.

Here is the screenshot of the diagram drawing interface (see it live on the notebook):

Screenshot of the Observable Notebook

The set sizes can be easily tweaked by editing the sets variable. The colors of the two sets can be configured by clicking on the color blocks. There is a button to download the generated Venn diagram. Finally, everything changes will interactively reflect on the diagram.

What’s more cool about the Observable Notebook is I can simply modify the code to change the output. If I want to a new set to have a three-way Venn diagram, I just need to update the sets. For example, copy paste the following to the notebook:

sets = [
  {sets: [0], label: 'A', size: 1700, fill: set0_color},
  {sets: [1], label: 'B', size: 1350, fill: set1_color},
  {sets: [2], label: 'C', size: 700, fill: 'green'},
  {sets: [0, 1], size: 1200},
  {sets: [0, 2], size: 500},
  {sets: [1, 2], size: 450},
  {sets: [0, 1, 2], size: 350}
]

And I will get the following new Venn diagram in SVG:

I hope now I will spend less time figuring out how to draw a Venn diagram.

Store GDC genome as a Seqinfo object

2019-02-26T00:00:00-06:00

Genomic Data Commons (GDC) hosted by NCI is the place to harmonize past and future genomic data, such as TCGA, TARGET, and CPTAC projects. GDC has its own genome reference, GRCh38.d1.vd1, which has 2,779 “chromosomes” including decoys and virus sequences. That said, the canonical chromosomes of GRCh38.d1.vd1 (e.g., chr1 to chr22, chrM, chrX, and chrY) are identical to that of hg38 and GRCh38. So all these three genome references can be used interchangeably.

Anyway, I was trying to correctly store the full GRCh38.d1.vd1 genome information in the GRanges and GRangesList R objects, which can be done by creating a Seqinfo object representing all its chromosomes. It was also fun to get familiar with the genomic data structures in R.

Build GDC’s Seqinfo

First, we need the length and the name of all chromosomes in GRCh38.d1.vd1. I used samtools to extract the information as a .dict file from the genome reference FASTA file.

export GDC_REF_FA_URL='https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834'
curl -Lo GRCh38.d1.vd1.fa.tar.gz $GDC_REF_FA_URL
samtools dict \
    -a 'GRCh38.d1.vd1' -s 'Homo sapiens' \
    -u $GDC_REF_FA_URL \
    GRCh38.d1.vd1.fa.tar.gz > GRCh38.d1.vd1.dict

head -n 3 GRCh38.d1.vd1.dict
# @HD   VN:1.0  SO:unsorted
# @SQ   SN:chr1 LN:248956422    M5:6aef897c3d6ff0c78aff06ac189178dd UR:https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834 AS:GRCh38.d1.vd1    SP:Homo sapiens
# @SQ   SN:chr2 LN:242193529    M5:f98db672eb0993dcfdabafe2a882905c UR:https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834 AS:GRCh38.d1.vd1    SP:Homo sapiens

Seqinfo also requires the information of whether a chromosome is circular. In GDC’s case, mitochondria chromosome and all viruses sequences are circular. Combining all the information together, we can construct the Seqinfo object describing the genome GRCh38.d1.vd1.

library(tidyverse)
library(GenomeInfoDb)

gdc_simple_tbl <- read_tsv(
    './GRCh38.d1.vd1.dict', 
    skip = 1,  # Skip the first line (@HQ ...)
    col_names = c('SQ', 'chrom', 'length', 'md5sum', 'URI', 'assembly', 'species')
) %>%
    select(chrom, length) %>%
    mutate(chrom = str_sub(chrom, start = 4), 
           length = as.integer(str_sub(length, start = 4)),
           circular = case_when(
               chrom == 'chrM' ~ TRUE,
               chrom == 'chrEBV' ~ TRUE,
               startsWith(chrom, 'HPV') ~ TRUE,
               TRUE ~ FALSE
           ))

gdc_seqinfo <- Seqinfo(
    seqnames = gdc_simple_tbl$chrom,
    seqlengths = gdc_simple_tbl$length,
    isCircular = gdc_simple_tbl$circular,
    genome = 'GRCh38.d1.vd1'
)

Now we can supply it to any GRanges object coming out from any GDC’s sequencing data.

> gdc_seqinfo
Seqinfo object with 2779 sequences (191 circular) from GRCh38.d1.vd1 genome:
  seqnames   seqlengths isCircular        genome
  chr1        248956422      FALSE GRCh38.d1.vd1
  chr2        242193529      FALSE GRCh38.d1.vd1
  chr3        198295559      FALSE GRCh38.d1.vd1
  chr4        190214555      FALSE GRCh38.d1.vd1
  chr5        181538259      FALSE GRCh38.d1.vd1
  ...               ...        ...           ...
  HPV-mKN2         7299       TRUE GRCh38.d1.vd1
  HPV-mKN3         7251       TRUE GRCh38.d1.vd1
  HPV-mL55         7177       TRUE GRCh38.d1.vd1
  HPV-mRTRX7       7731       TRUE GRCh38.d1.vd1
  HPV-mSD2         7300       TRUE GRCh38.d1.vd1

I store the gdc_seqinfo as a RDS file (link here) so I can re-use it easily.

gdc_seqinfo <- readRDS('seqinfo_GRCh38.d1.vd1.rds')

Build EnsDb from a local Ensembl MySQL database

2019-01-08T00:00:00-06:00

In some occasions, I need to access the older version of Ensembl human transcripts. For example, the mutation calls generated by the NCI’s Genomic Data Common pipeline are annotated by Ensembl v84. To programmatically query the Ensembl annotations, I use the EnsDb SQLite database created by ensembldb, which is a R package I enjoy using (see my previous post for its usage).

The EnsDbs of the recent versions of Ensembl (v87+) are available on AnnotationHub. However, the older versions are not available, and they don’t get updated when ensembldb introduces a new feature. For example, now newer EnsDbs include the transcript and gene ID version (github issue).

In my case, I need to build a EnsDb of Ensembl v84. The ensembldb’s documentation describes how to build one from the public Ensembl MySQL server. However, this method will take more than a day to complete. I started to look for other methods. After some trial and error, I managed to create my EnsDb fast by connecting to a local Ensembl database that I built. Surprisingly the setup wasn’t difficult at all, and it only took about an hour to build the EnsDb.

Here are my notes of how to create the EnsDB from a local Ensembl MySQL database. I use macOS but the steps can be easily modified to work on other OSes.

Ensembl VM
Build a local Ensembl v84 MySQL database
Build EnsDB from the local MySQL database
Remove MySQL

Ensembl VM

To create a EnsDB from a Ensembl MySQL database, we need to the Ensembl Perl APIs. And the easiest setup is by a Ensembl virtual machine. We just need to import the VM image using VirtualBox and install the ensembldb R package inside the VM, then it is ready to build the EnsDb. I recommend the VM to have more memory than the default 1GB since a larger memory helps build the R packages and EnsDb.

Build a local Ensembl v84 MySQL database

Ensembl provides the MySQL database dump to allow easy import of their data of any version. Assuming the working directory is ~/Documents/Ensembl_MySQL_mirror/, we first copy the database dump by:

cd ~/Documents/Ensembl_MySQL_mirror

# Download the db dump
rsync -a rsync://ftp.ensembl.org/ensembl/pub/release-84/mysql/homo_sapiens_core_84_38 .

# MySQL doesn't accept compressed db dump files so we decompress them
gunzip *.txt.gz

While downloading the data, we also need to install the MySQL server. I install the same or similar version of MySQL Ensembl is currently using, which is 5.6 at the time of writing. On macOS, Homebrew can specify the version of MySQL to be installed:

brew install mysql@5.6
# And launch the MySQL server
/usr/local/opt/mysql@5.6/bin/mysql.server start

First we create a database whose name matches the Ensembl version (v84):

CREATE DATABASE homo_sapiens_core_84_38;

Then we load the table schema and Ensembl data:

/usr/local/opt/mysql@5.6/bin/mysql -u root \
    homo_sapiens_core_84_38 < homo_sapiens_core_84_38.sql

/usr/local/opt/mysql@5.6/bin/mysqlimport \
    -u root \
    --fields-terminated-by='\t' --fields_escaped_by=\\  \
    homo_sapiens_core_84_38 -L *.txt

Finally, we modify the MySQL config at /usr/local/etc/my.cnf to accept remote database connection, so our VM can access the database on the host machine. I don’t use MySQL for anything else, so I simply let MySQL binds to all the possible IP addresses my machine has:

[mysqld]
bind-address = *

Note that this is not a secure configuration. To be secure, there should be a designated MySQL user with limited permission and a stricter connection setting. Restart MySQL to load the new config:

/usr/local/opt/mysql@5.6/bin/mysql.server restart

Write down an (local) IP address of our host machine.

Build EnsDB from the local MySQL database

Now we can come back to the vm and build the EnsDb v84. Run the following R script to build the EnsDb:

library(ensembldb)
fetchTablesFromEnsembl(
    84, species = "human",
    user = 'root', host = '<our host IP>', port = 3306
)
DBFile <- makeEnsemblSQLiteFromTables()

The EnsDb SQLite database will be availabe under the working directory. We can test the new EnsDb by:

edb <- EnsDb(DBFile)

Remove MySQL

If there is no other need of MySQL, we can uninstall it and remove all its databases by:

brew remove mysql@5.6
rm -rf /usr/local/var/mysql

Make Firefox fullscreen borderless on macOS

2018-12-13T00:00:00-06:00

EDIT 2021-06-01: In Firefox 89+, there’s a default option “Hide Toolbar” in the fullscreen mode that automatically hides the toolbar. So the customization is no longer needed.

Firefox fullscreen on macOS by default contains the address bar and the tab bar. I usually don’t really need the full vertical space for web page, so those bars aren’t a problem. But when I access a RStudio Server on Firefox, I always want to have more vertical space. As shown in the screenshot below, the address bar and the tab bar of Firefox are unnecessary, and they may be quite distracting. If those bars are hidden and only show up upon request when Firefox enters fullscreen, the vertical space can be saved and the interface will remain clean.

It turns out that Firefox controls its user interface styling using CSS. So we can set the shape of the window tabs, the height of the address bar, and more by adding a CSS file at ~/Library/Application Support/Firefox/Profiles/<profile>/chrome/userChrome.css. In Firefox 69+, we need to set toolkit.legacyUserProfileCustomizations.stylesheets=true in about:config to enable the CSS styling (more details here).

My modification was based on this answer on Stack Exchange:

#navigator-toolbox[inFullscreen] {
    height: 0.5rem;
    margin-bottom: -0.5rem;
    opacity: 0;
    overflow: hidden;
    z-index: 1;
}

#navigator-toolbox[inFullscreen]:hover,
#navigator-toolbox[inFullscreen]:focus-within {
    /*
     * Add some padding between the navbar and the top screen edge
     * to be more visible while the macOS hidden menu bar shows up.
     * The macOS menubar will hide after a few seconds.
     */
    padding-top: 1.5rem;
    height: auto;
    margin-bottom: 0rem;
    opacity: 1;
    overflow: visible;
}

Note that the Firefox needs to be restarted to get the styling in effect. Here is how the Firefox fullscreen looks like after applying the userChrome.css above.

Now the RStudio Server web page feels like a native app, similar to what RStudio Desktop offers. Both address and tab bars are hidden by default, and when the mouse hovers to the top, they get visible again.

Your browser doesn't support HTML5 video. You can still download the screencast and view it locally.

Switch tabs in the borderless fullscreen of Firefox.

Those top bars will show up as well when they are in focus by shortkeys. For example, ⌘ + L will get focus on the address bar. It is useful when I want to launch a quick search in a new tab.

Your browser doesn't support HTML5 video. You can still download the screencast and view it locally.

Address bar is shown automatically when it is focused using shortkey.

In userChrome.css, I added a small padding between the bars and the top border to make them more accessible by mouse. When the mouse moves to the top, macOS’s menu bar will pop up as well, and both Firefox and macOS will overlap. The macOS one will go away first, but the mouse has to stay on the Firefox bars so they don’t disappear either.

Your browser doesn't support HTML5 video. You can still download the screencast and view it locally.

The overlapping of Firefox bars and macOS menubar is still a bit annoying, which will require some practice to navigate between them by mouse. I will probably rely more on the shortkeys instead. Anyway, I now have more vertical space and the modification of userChrome.css works fine for now.

For more information about modifying the Firefox user interface, there is a website that introduces userChrome.css in depth.

Notes for screencast encoding

I modified the command from my previous post to shrink the file size of the original QuickTime screencasts using FFmpeg. More encoding parameters can be found at FFmpeg’s wiki (VP9 and H.264).

# VP9 (WEBM)
ffmpeg -i fullscreen_switch_tabs.mov \
    -vcodec libvpx-vp9 -b:v 200K \
    -pass 1 -an -r 24 -f webm /dev/null
ffmpeg -i fullscreen_switch_tabs.mov \
    -vcodec libvpx-vp9 -b:v 200K \
    -pass 2 -an -r 24 fullscreen_switch_tabs.webm

# H.264 (MP4)
ffmpeg -i fullscreen_switch_tabs.mov \
    -vcodec h264 \
    -strict -2 -crf 40 -preset slow -r 24 \
    fullscreen_switch_tabs.mp4

EDIT 2020-06-13: Added extra about:config settings in Firefox 69+ and fixed the styling. I also increased the top margin since the new address bar is taller.

GPG Key Transition

2018-10-20T00:00:00-05:00

I am transiting my GPG key again. However, for this time, I expect to use the new GPG master key longer and will start building this identity unless there is a concern about the key strength or I accidentally lose the key.

Back in my GPG key transition in 2016, I’ve created the subkeys for daily usage and isolated the master key into a secret offline place. I learned more about PGP throughout the years, sadly though, I still seldom have a chance to use it extensively in my daily life.

This time, I am moving the subkeys to a YubiKey. I found drduh’s guide on GitHub very informative to set up both the GPG key and the yubikey, as well as get my hands on the various possible applications. Another notable change is that I no longer set an expiration date on my master key.

I will revoke my old keys once the transition is done.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

I am transitioning GPG keys from an old 4096-bit RSA key to a new
4096-bit RSA key.  The old key will continue to be valid for some
time, but I prefer all new correspondance to be encrypted in the new
key, and will be making all signatures going forward with the new key.

This transition document is signed with both keys to validate the
transition.

The old key, which I am transitioning away from, is:

  pub   rsa4096/0x44D10E44730992C4 2016-12-04 [SC] [expires: 2018-12-04]
        Key fingerprint = 85DF A3EB 72CD DE7D 3F2A  127C 44D1 0E44 7309 92C4

The new key, to which I am transitioning to, is:

  pub   rsa4096/0x69BAE333BC4DC4BA 2018-10-17 [SC]
        Key fingerprint = 978B 49B8 EFB7 02F3 3B3F  F2E5 69BA E333 BC4D C4BA

To fetch the full new key from a public key server using GnuPG, run:

  gpg --recv-key 0x69BAE333BC4DC4BA

If you have already validated my old key, you can then validate that
the new key is signed by my old key:

  gpg --check-sigs 0x69BAE333BC4DC4BA

Please contact me via e-mail at <me@liang2.tw> if you have any
questions about this document or this transition.

                                            Liang-Bo Wang
                                            (liang2, ccwang002)
                                            me@liang2.tw
                                            Oct 20, 2018
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEExQ3ldlbp0O+Zc1vXBS+kcT+kX08FAlvLdGkACgkQBS+kcT+k
X092fw/8DFGIIMHtf3JAt8nGth8Y94oyTgrorqzu7TXCxvUiGk6Qd/WBLiMvh//9
7mVounqtPSGuYHiWm6gtDlT+YRoFoJH3IZ9aboMzd8e/p4TspKhXAsF5bhp6U9ED
HCds+VRoVBGPtO3Ogizl5wpynYp7OzwqgwTteFlJ167mmqL05n/xLHnsvii3UFO4
MWwWMxVmwvEpJINsYOJ+mFxOXeD23ckKt3GQh2NF2Dpa+apEvq5l9vjmk4Vnqau2
WKXygbz1Rm0b629dblV8vU9iwgIsSlXz8oopkETadpkaGk9s/p8AYPupmqfCiwHx
/gYlAgIV0DmePfOjYGZ0RJTVHFnG5ong3kUyoeuAwCMWLC/QQ6pCIL+pRnXOT1/q
oXCTK7HUUZjWpRyHf/ptHpQOeYH+AuYAYJNwQA3xjzv57kkGT/gR3Q4APmA6l5lN
swHVL63QYjcKCNm+fR7pr2ka50WSrerFyLWinAT9G9w+h/31Zcl0cBqEM45nlaye
B5fYaU/G6ehFxWgVEjS7z7FSLQVNJu7aAXVZy9QlKcduNXBUZsDkcDI1iyKVhvl4
41SNhLXMO2Jk69kHtuE803/aVgw8N3Y/E5zjlBNXWSpPMPFeKZl9W6eqnl50kQcd
fKbZGV8bxFl40iBi4mfIrdOXrBQ9Oohp7UVWTHG1qYMruVRmW3aJAjMEAQEKAB0W
IQS66Rw2AWA1VmCbiTP4R473V2p7VwUCW8t0bgAKCRD4R473V2p7V8DqD/9YpnFP
usbOW9p3GwUgbvqPdefLszFZZb5LNsgL9eTSKwMUTn5AGldFquDM3hbjiZ+e/nbD
TwFKbKRt//48R5UTYsYJVxcNVW3CXxMtl+8B5PJORfUbz0/HSEsnKTMlHP1M4ybw
gZCI45sP4wT1prU3ngkZGHJJY9ojNOCrzHA+DVEp+vROn/zyg6AbLcr0+/yjCHO0
pxDbGUgEOV93GFKXl85u7qCXUTrIt2fkeFeEQoh248oBJQHPjD9WyOV/O3QNdT7d
6g7lJcSSpwevtTsWFaWCxRM2IwHlXJiWU/9bA2Jrb8E07mJGlmil7xCe4rPFD69F
/y9MBfMG0KVjAFgs6vAR5zHnN865d18JCunQ4OhY0tDnhi4q3O0OdehGqyX92pwL
hMKtHRHLuhYoDud2kdxizmtov1bHghO0kKSTlDGhZvs9Fpod4MKQHaZx4VbpA8np
OpmPBkX7+34AmnLnJP2GOA4UhsRpX1iyqTaGePjhtA0gqz5285bL02JGx/m7HDtX
MYI4yoc3rdZz37axnRinWmW7Lu5JmQGeVLJZ7Z2b83BEHVnapXPW2kFp8PcqbNnA
Lbb9XLkbHNOaiC4EFm07uFmMVkV6aKW3xV1YIlDeovRfNC0cyZDnUNFqaDGtZAeO
UifqHyDqNjBJX0a8miUMZDOXsZFD2jzm3pjS/Q==
=JRPp
-----END PGP SIGNATURE-----

Access gene annotation using gffutils

2018-06-22T00:00:00-05:00

Recently, I had to access gene annotations in multiple versions from multiple sources such as Ensembl, GENCODE, and UCSC. I used to rely on the R/Bioconductor ecosystem to query the coordinates of a gene annotation. There are existing Bioconductor packages ready for Ensembl and UCSC annotations (more info in my previous posts: Ensembl and UCSC), and one can create a new customized TxDb given a GTF/GFF file. However, the project I was working on was written in Python, so I went on searching for similar alternatives in Python.

That’s how I found gffutils, a Python package to access gene annotations from GTF/GFF files. gffutils first imports the annotations from the GTF/GFF file into a SQLite database. The package also provides some abstraction on top of the database schema, so user can retrieve an annotation without talking to the database directly using repetitive SQL commands. Database enables fast random access to any gene annotation.

I will use GENCODE v19, an annotation used by many TCGA GRCh37/hg19 projects, as an example to demo the usage of gffutils. My project requires the coordinates of UTRs and exons of all the transcripts in use.

Usage example
Direct operation on the database
Discussions

Usage example

To use gffutils to query GENCODE annotation, we need to create the database first. The comprehensive gene annotation GTF can be downloaded from the GENCODE website (URL to the GTF). The database creation is handled by gffutils’s create_db function. It will take a few minutes to run and the database will be at gencode_v19.db.

import gffutils

db = gffutils.create_db(
    './gencode.v19.annotation.gtf.gz',
    dbfn='gencode_v19.db',
    verbose=True,
    merge_strategy='error',
    disable_infer_transcripts=True,
    disable_infer_genes=True,
)
# INFO - Committing changes: 2619000 features
# INFO - Populating features table and first-order relations: 2619443 features
# INFO - Creating relations(parent) index
# INFO - Creating relations(child) index
# INFO - Creating features(featuretype) index
# INFO - Creating features (seqid, start, end) index
# INFO - Creating features (seqid, start, end, strand) index
# INFO - Running ANALYSE features

Once the database is created, we don’t have to repeat the same process but load the database directly as a FeatureDB object:

db = gffutils.FeatureDB('./gencode_v19.db')

Single feature access

One can then access the annotations of a gene or transcript by its ID. Using a transcript of TP53 as an example,

>>> gene = db['ENSG00000141510.11']; gene
<Feature gene (chr17:7565097-7590856[-]) at 0x7fac828deeb8>
>>> tx = db['ENST00000269305.4']; tx
<Feature transcript (chr17:7571720-7590856[-]) at 0x7fac828f8080>

We can then access the details of the transcript:

>>> tx.featuretype, tx.source
('transcript', 'HAVANA')
>>> tx.chrom, tx.start, tx.end, tx.strand
('chr17', 7571720, 7590856, '-')
>>> tx.attributes.items()
[('gene_id', ['ENSG00000141510.11']),
 ('transcript_id', ['ENST00000269305.4']),
 ('gene_type', ['protein_coding']),
 ('gene_status', ['KNOWN']),
 ('gene_name', ['TP53']),
 ('transcript_type', ['protein_coding']),
 ('transcript_status', ['KNOWN']),
 ('transcript_name', ['TP53-001']),
 ('level', ['2']),
 ('protein_id', ['ENSP00000269305.4']),
 ('tag', ['basic', 'appris_principal', 'CCDS']),
 ('ccdsid', ['CCDS11118.1']),
 ('havana_gene', ['OTTHUMG00000162125.4']),
 ('havana_transcript', ['OTTHUMT00000367397.1'])]

Gene model coordinates of a transcript

To find the coordinates of its exons and UTRs, we use FeatureDB.children() which takes an Feature object or its ID and retrieves all the features belong to this feature. TP53 is on the reverse strand of the chromosome, so we can further sort the features by their end position:

>>> list(db.children(tx, order_by='-end'))             
[<Feature transcript (chr17:7571720-7590856[-]) at 0x7fac828922e8>,
 <Feature UTR (chr17:7590695-7590856[-]) at 0x7fac82892208>, 
 <Feature exon (chr17:7590695-7590856[-]) at 0x7fac828922b0>,
 <Feature UTR (chr17:7579913-7579940[-]) at 0x7fac828925c0>, 
 <Feature exon (chr17:7579839-7579940[-]) at 0x7fac828928d0>,
 <Feature CDS (chr17:7579839-7579912[-]) at 0x7fac82892c18>, 
 <Feature start_codon (chr17:7579910-7579912[-]) at 0x7fac82892f28>,
 ...
 <Feature CDS (chr17:7572930-7573008[-]) at 0x7fac828277b8>, 
 <Feature exon (chr17:7571720-7573008[-]) at 0x7fac82827b38>,
 <Feature UTR (chr17:7571720-7572929[-]) at 0x7fac82827eb8>,       
 <Feature stop_codon (chr17:7572927-7572929[-]) at 0x7fac828fca90>]

We have retrieved the UTRs, CDSs and exons of the transcript. Note that UTR is considered a part of an exon in gene annotation terminology. We should use CDSs as the exons that will be translated to amino acids. FeatureDB.children() provides a way to subset the feature type it returns:

>>> list(db.children(tx, order_by='-end', featuretype=['CDS', 'UTR']))
[<Feature UTR (chr17:7590695-7590856[-]) at 0x7fac8283d7f0>,
 <Feature UTR (chr17:7579913-7579940[-]) at 0x7fac8283d710>,
 <Feature CDS (chr17:7579839-7579912[-]) at 0x7fac8283d7b8>,
 ...
 <Feature CDS (chr17:7572930-7573008[-]) at 0x7fac82846470>,
 <Feature UTR (chr17:7571720-7572929[-]) at 0x7fac828467b8>]

Now the gene model of TP53 becomes clearly visible.

Feature selection

To select all the transcripts in the database, there is a FeatureDB.all_features() function. Here we want to select only the basic GENOCODE transcripts and count the number of different gene types:

from collections import Counter
# All the transcripts of basic GENCODE v19
all_basic_txs = (
    tx for tx in db.all_features(featuretype='transcript') 
    if 'tag' in tx.attributes and 'basic' in tx.attributes['tag']
)

Counter(tx.attributes['gene_type'][0] for tx in all_basic_txs).most_common(5)
# [('protein_coding', 67186),
#  ('antisense', 9160),
#  ('lincRNA', 7121),
#  ('miRNA', 3055),
#  ('misc_RNA', 2034)]

Direct operation on the database

Since gffutils is just a abstraction layer on top of the database, we can always talk to the underlying SQLite database directly by writing SQL commands. The database schema is available on the gffutils’s documentation. Under the hood, FeatureDB object maintains a SQLite connection at FeatureDB.conn and a helper function to run a single SQL command via FeatureDB.execute().

For example, GENCODE stores the full version of a transcript ID but in many occasion, such information is not available. Say if we only know the TP53 transcript ID is ENST00000269305, then we can write a SQL query to find the matching ID:

>>> db.conn
<sqlite3.Connection at 0x7fac89423490>
>>> cur = db.execute(
...    "SELECT id FROM features "
...    "WHERE featuretype='transcript' AND id LIKE 'ENST00000269305.%';")
>>> cur.fetchone()[0]
'ENST00000269305.4'

We can even tweak the SQLite behavior by setting the PRAGMA statements. gffutils has already added default pragma to optimize database query, including less database integrity and large memory size:

>>> db.pragmas
{'synchronous': 'NORMAL',
 'journal_mode': 'MEMORY',
 'main.page_size': 4096,
 'main.cache_size': 10000}
>>> db.execute('PRAGMA temp_store=MEMORY') 
>>> db.execute('PRAGMA cache_size=-1000000')  # Use 1GB memory

Discussions

gffutils provides a SQLite-based gene annotation storage in Python. Though it may not be as feature complete as what user may get in R, it is highly customizable and can be easily integrated with other Python functions. Like the Bioconductor packages GenomicFeatures and EnsDb, they all use a SQLite database under the hood. As shown in another post, we can actually connect to those databases built by R packages directly, so user can access information from other sources such as UniProt isoforms and gene names.

In my opinion, all the approaches mentioned above are always better than trying to bake one’s own from scratch. Those packages are backed by numerous tests and are built from reliable or the original data sources. Besides multiple existing solutions in R and Python, one can always access the databases built by those packages from different languages, so it is quite unlikely to build something from scratch anyway.

Read UniProtKB in XML format

2018-01-28T00:00:00-06:00

UniProt Knowledge Base (UniProtKB) provides various methods to access their data. I settled on their XML format since no additional parsing code is required and the format is well defined, which comes with a schema. Plus, it turns out that databases such as PDB also provide their data export in XML format and the corresponding schema so the method can be applied elsewhere.

Here I will show how to read XML with its schema in Python using xmlschema.

Other ways to read UniProtKB data in bulk
XML and XML schema
Read UniProt XML by xmlschema
Summary

XML and XML schema

XML data are structured. For example, this is what entry P51587 looks like in XML format:

<?xml version="1.0" encoding="UTF-8"?>
<uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd">
<entry dataset="Swiss-Prot" created="1996-10-01" modified="2017-12-20" version="201">
<accession>P51587</accession>
<accession>O00183</accession>
<accession>O15008</accession>
<accession>Q13879</accession>
<accession>Q5TBJ7</accession>
<name>BRCA2_HUMAN</name>
<protein>
<recommendedName>
<fullName>Breast cancer type 2 susceptibility protein</fullName>
</recommendedName>
<alternativeName>
<fullName>Fanconi anemia group D1 protein</fullName>
</alternativeName>
</protein>
<!-- ...  -->
</entry>
</uniprot>

The file is available at https://www.uniprot.org/uniprot/P04637.xml. Basically, all the information about this entry should be available in this file, as long as one knows how to query the XML via XPath. However, I find XML file harder to read alone, especially without any guide of how the file was constructed.

UniProt XML is constructed based on its XML schema, available as an XSD file at http://www.uniprot.org/support/docs/uniprot.xsd. The schema not only helps understand the XML content, it also validates whether the XML is valid. In other words, since all UniProt XMLs are validated by its schema, one can expect to parse all their data the same as what the schema has defined. XML schema is also part of the W3C standard and wildly used.

Read UniProt XML by xmlschema

I use xmlschema to read XML with its schema in Python. Instead of using XPath, one can actually convert the XML content into a dictionary-like format, which can be easily passed to other Python functions.

Using same entry P51587 as an example,

>>> import xmlschema
>>> schema = xmlschema.XMLSchema('https://www.uniprot.org/docs/uniprot.xsd')
>>> entry_dict = schema.to_dict('./P51587.xml')
>>> entry_dict.keys()
dict_keys(['@xsi:schemaLocation', 'entry', 'copyright'])
>>> content = entry_dict['entry'][0]
>>> list(content)[:6]
['@dataset', '@created', '@modified', '@version', 'accession', 'name']

I don’t need any custom code to read the XML content structurally. For example, to get all the accession IDs of this entry,

>>> content['accession']
['P51587', 'O00183', 'O15008', 'Q13879', 'Q5TBJ7']

To get the protein names,

>>> content['protein']
{'alternativeName': [{'fullName': 'Fanconi anemia group D1 protein'}],
 'recommendedName': {'fullName': 'Breast cancer type 2 susceptibility protein'}}

One can compare the dictionary converted result with the original XML. I’d like to end the demo with a more complicated example that finds all the sequence variants:

>>> seq_variants = filter(
...     lambda d: d['@type'] == 'sequence variant' and 'variation' in d,
...     content['feature'])
>>> [(d['location']['position']['@position'], 
...   d['original'], d['variation'][0])
...   for d in seq_variants][:10]
[(25, 'G', 'R'),
 (31, 'W', 'C'),
 (31, 'W', 'R'),
 (32, 'F', 'L'),
 (42, 'Y', 'C'),
 (53, 'K', 'R'),
 (60, 'N', 'S'),
 (64, 'T', 'I'),
 (75, 'A', 'P'),
 (81, 'F', 'L')]

Summary

Using UniProt’s XML and its schema can read all the data in a structured fashion without a custom parser. Once downloading the XML files of interest, one could basically query everything locally, which is very helpful to retrieve substantial information from UniProt, say, extracting all the citations for certain protein feature.

XML schema really helps users to understand the data structure and it also helps the database developers validate their data export. I hope someday all the databases should have this validation enforced.

However, one may find XML format tedious and not human-friendly to read. JSON has been popular and used heavily by RESTful APIs. The specification of JSON schema exists, but it is not a W3C standard yet.

SPARQL and RDF, part of the attempt for the Semantic web can be a universal query interface solving the same problem more elegantly, though the entry level is a bit high with limited learning resources available.

For now, reading bulk data in XML with its schema seems to be the mature way to go with abundant support.

Ad hoc bioinformatic analysis in database

2018-01-20T00:00:00-06:00

Recently I’ve found that bioinformatic analysis in a database is not hard at all and the database set up wasn’t as daunting as it sounds, especially when the data are tabular. I used to start my analysis with loading everything into R or Python, and then figuring out all the filtering and grouping commands with my favorite R or Python packages. However, the data size would be bound by memory and the analysis might be slow unless additional optimization was applied. On the other hand, databases have already solved the problems by mapping the data to disk and indexing. Therefore I’d like to share my recent experience on using databases for bioinfo analysis.

Note that if one is interested in the actual tips of using databases for analysis, feel free to skip the whole background section.

Background
- Reading tabular data in bioinformatics
- Database
Tabular data IO in database
Benchmark
Conclusion

Background

Reading tabular data in bioinformatics

Tabular data are everywhere in bioinformatics. To record gene expressions, variants or cross reference IDs between different annotation systems or databases, data are stored in various tabular-like formats, such as BED, GTF, MAF, and VCF, which can usually be normalized to the standard CSV and TSV files. Starting with the raw data, we apply different kinds of filtering and grouping to pick up the records of interest. For example, we might subset the data within a genomic region, select transcripts above an expression threshold, or group the data by the same transcript across multiple samples.

Researchers have developed numerous tools to select the data of interest. In Python, numpy and pandas dominate the analysis; in R, data.frame, tibble, and data.table are all widely used. However, all the tools above only work if the data can be fit into memory. Unfortunately, bioinformatics data can go beyond 10GB easily these days. It has been difficult to analyze everything in memory. Even using a powerful server with a few hundreds GB of memory, the overhead of loading all data into memory can be time-consuming. To make things worse, when joining multiple data together, the magnitude of the issues above will be multiplied.

One might argue that in Python there are packages like xarray and dask capable of handling out-of-memory multi-dimensional array. But they are only useful for handling numerical data. In bioinformatics, metadata are frequently used and consist of many text columns, where numpy doesn’t have the same computing advantage as numerical columns. For example, gene expression only makes sense if it comes with the gene symbol, the transcript id, and the sample id.

Database

Databases have been solving the out-of-memory data analysis for decades, and it also comes with several advantages. First, the language databases use is standardized, known as Structured Query Language (SQL). SQL is expressive, which means instead of writing how to load or query the data, one writes what the data or the query look like. Databases support concurrent reads, enabling query in parallel. Second, One can speed up the queries by setting up indexes. Different types of indexes and different combinations of columns can be added to boost the query. Lastly, databases are persistent, so one only needs to load the data once.

I mainly use two databases: SQLite and PostgreSQL. SQLite’s database is just a single file on disk and it doesn’t need any configuration to run. In fact SQLite ships with Python, available as the sqlite module. SQLite works very well in my case.

PostgreSQL is a more feature-rich database and has better concurrency support such as multiple writers at the same time. Its advanced indexing and data types might be helpful for genomic range query. The downside is that it requires some configurations and its installation is not as easy as SQLite. Though the basic PostgreSQL setup is actually just a few commands on Debian Linux, one probably needs to go through some documentation to understand what they are about and how to tweak the config.

The most annoying thing I found using a database in the past was to load my data, where I had to create the table by CREATE TABLE ... and insert all my data by multiple INSERT INTO ... VALUES ... statements. But recently I found that many databases have some built-in utilities to make the process easy and fast. Also, it is not hard to programmatically generate the statements through packages like SQLAlchemy. Therefore, I will share some experience of using databases here.

Tabular data IO in database

SQLite

For SQLite, use .mode csv with .import statement to load in data. SQLite will create the table automatically by using the first row as the column names if the table doesn’t exist. One can create the table before the loading to define each column’s data type, otherwise, columns are just TEXT type. .separator controls the delimiter character SQLite uses between columns.

.mode csv
.separator \t   -- For TSV files
.import '/path/to/tsv' table_name

To export data, use .once statement followed by the query:

.header on  -- Export columns name
.once '/path/to/output.tsv'
SELECT * FROM table_name;  -- Export all data in the table

Commands above can be scripted into SQLite like:

sqlite3 mydb.sqlite < load_data.sql

PostgreSQL

For PostgreSQL, the built-in solution is to use the COPY statement or the \copy metacommand to import or export data. COPY runs faster than the equivalent INSERT statements. Besides built-in commands, an external tool pgloader has been very helpful for the data loading, whose loading process is more flexible.

In this post, I won’t dive into details of their usage. There will be an example in the benchmark section.

Loading compressed data with named pipe

Many tabular data are compressed by gzip or bgzip to save the disk space. To decompress the file and load into the database without storing the uncompressed file somewhere first, one can consider using named pipe.

The idea is to decompress the file to a named pipe and read the data in a database from the named pipe. A named pipe can be created by mkfifo. For example,

mkfifo mypipe
gunzip -c mydata.tsv.gz > mypipe &

The trailing & makes the decompress command running in the background to keep everything in one shell session. Then read the data in SQLite as if it were a file like:

.import mypipe mytable

The trick here can be further expanded to any preprocessing in any language. One can simply preprocess the file and write the output to a named pipe. The database can read from the named pipe without storing the full intermediate output on disk. Plus, by piping between commands more CPU cores are utilized.

Benchmark

To give an idea of the data processing time in databases, I used all the somatic variants from TCGA MC3 as a demonstration. The goal here is to count the number of variants by different transcript and its mutation type. So the output result will be something like the following:

Transcript ID	Mutation type	Count
ENST00000000233	3’UTR	20
ENST00000000233	Frame_Shift_Del	1
ENST00000000233	Intron	6
…	…	…

After filtering out all the silent mutations, there are about total 2.8 million variants making up 614MB of disk space.

I used three methods to load and group the variants: pandas, SQLite, and PostgreSQL. Their code is shown below.

pandas (Python)

Standard pandas IO code.

import numpy as np
import pandas as pd


df = pd.read_table(
    'mc3_filtered.tsv',
    header=None,
    names=[
        'chrom', 'start', 'end', 'strand', 'mutation_type',
        'ref_allele', 'alt_allele', 'transcript_id',
        'hgvs_c', 'hgvs_p', 'cdna_start', 'cdna_end',
        'p_start', 'p_end', 'normal_id', 'tumor_id'
    ],
    dtype={
        'chrom': str, 'start': np.int64, 'end': np.int64,
        'strand': np.int32, 'cdna_start': str, 'cdna_end': str,
        'p_start': str, 'p_end': str,
    },
    engine='c',
)

grp_df = df.groupby(['transcript_id', 'mutation_type'])['alt_allele'].count().reset_index()
grp_df.to_csv('out.pandas.tsv', index=False, sep='\t')

SQLite

I set some PRAGMA ... statements at the beginning to control some of the SQLite settings. It tells SQLite to use more cache, create temporary tables in memory and disable all the transaction recovery settings. SQLite by default writes everything to the disk first before changing the actual database content so if the program fails or any exception occurs, it can recover all the transactions properly. In our case, we don’t care about the integrity of the database.

PRAGMA cache_size=-4192000;  -- Use 2GB RAM as cache
PRAGMA temp_store=MEMORY;
PRAGMA synchronous=OFF;
PRAGMA journal_mode=OFF;
PRAGMA locking_mode=EXCLUSIVE;

.mode csv
.separator \t
CREATE TABLE mc3 (
    chrom       TEXT,
    "start"     INT,
    "end"       INT,
    strand      INT,
    mutation_type   TEXT,
    ref_allele  TEXT,
    alt_allele  TEXT,
    transcript_id   TEXT,
    hgvs_c      TEXT,
    hgvs_p      TEXT,
    cdna_start  INT,
    cdna_end    INT,
    p_start     INT,
    p_end       INT,
    normal_id   TEXT,
    tumor_id    TEXT
);
.import mc3_filtered.tsv mc3
-- Create an index to speed up grouping on the same columns
CREATE INDEX mc3_idx ON mc3 (transcript_id, mutation_type);

-- Output
.once out.sqlite.tsv
SELECT transcript_id, mutation_type, COUNT(alt_allele) AS c
FROM mc3
GROUP BY transcript_id, mutation_type;

PostgreSQL

I used pgloader to load the data into a local PostgreSQL database test_mc3. pgloader can take a script of its own mini-language.

LOAD CSV
    FROM 'mc3_filtered.tsv'
    INTO postgresql:///test_mc3?mc3
    WITH fields terminated by '\t',
         fields not enclosed,
         drop indexes
    BEFORE LOAD DO
    $$ DROP TABLE IF EXISTS mc3; $$,
    $$ CREATE TABLE mc3 (
            chrom       TEXT,
            "start"     BIGINT,
            "end"       BIGINT,
            strand      SMALLINT,
            mutation_type   TEXT,
            ref_allele  TEXT,
            alt_allele  TEXT,
            transcript_id   TEXT,
            hgvs_c      TEXT,
            hgvs_p      TEXT,
            cdna_start  INT,
            cdna_end    INT,
            p_start     INT,
            p_end       INT,
            normal_id   TEXT,
            tumor_id    TEXT
        );
    $$,
    $$ CREATE INDEX mc3_idx ON mc3 (transcript_id, mutation_type); $$
;

To do the grouping analysis, I used the built-in COPY command:

COPY (
    SELECT transcript_id, mutation_type, COUNT(alt_allele) AS c
    FROM mc3
    GROUP BY transcript_id, mutation_type
) TO '/private/tmp/mc3/MC3/out.psql.tsv' WITH (FORMAT TEXT);

Result

I didn’t run it systematically but a few repeats showed the similar numbers.

Method	Read data (sec)	Group-by analysis (sec)
Pandas	10.7	0.9
SQLite	27.7	4.0
PostgreSQL	82.6	13.5

In this case, all data can be loaded into memory easily, so pandas gave the best performance here. It actually took nearly no-time to complete the grouping.

All databases ran much slower on loading data than pandas. PostgreSQL seems to run a lot more slower than SQLite, which I think it has something to do with my server configuration, say, not enough cache size, or not enough working memory for the group-by operation. I feel like PostgreSQL can be faster but anyway this’s the result I have so far. Note that all the databases are stored on a PCIe SSD disk. If they were on a normal hard drive, the database creation will take a much longer time.

However, after the data are loaded into the database, the speed of the query alone is comparable to pandas. Because for pandas, one cannot skip the step of reading data so if the analysis is on a frequently used dataset, database like SQLite can yield better performance. Once the data get larger than the memory capacity, special care will be needed to make the pandas’ approach work, whereas database can scale up with little fuss.

Conclusion

My post provides a different solution to work with tabular data by working in a database. In-memory approaches like pandas work very efficiently at a small dataset but one will have to code the “how-tos” to scale to a larger dataset that cannot feed into memory (or the overhead is too high). On the other hand, databases can easily scale to a few hundred GBs in size and the query is fast. For analysis on a frequently used dataset, loading data into the database first might be a good idea.

Another good thing about databases is that SQL makes joining across tables easily. One can easily join across multiple tables, say, expand the gene annotation and doesn’t have to worry how to implement it. With indexing, the joining can be fast. In pandas, one generates many objects representing the joining results, but those objects cannot be easily shared between scripts. Relying on storing the intermediate objects on disk, the accumulated overhead might be significant. Projects like Apache Arrow might solve the in-memory object passing ultimately, but its development is still in the early phase. As for databases, one can define reusable views for the joining logic and filtering results. The post didn’t really touch this part so I probably need another benchmark or post to back my thoughts.

If one is analyzing variants, using databases or SQL in general has been backed up by many pratical projects. People at Quinlab Lab hace been building vcf2db to load variants into databases for downstream annotation and analysis. To scale way up to terabytes or petabytes of variant data, Google Cloud Genomics provides an interface to store and query variants in BigQuery, where users use standard SQL to select the variants of interest.

However, working in pandas gives users great room for flexibility. For example, one can iterate over rows and do some complex transformation of the value. Maybe it would be the optimal solution to use pandas.read_sql to run a query in a database.

It seems to me like many people rely too much on the features of some special file formats such as bgzip and tabix and have forgotten the generic yet flexible approach using databases. Those formats often optimize the random access by a given genomic query by indexing. In databases, such index is analogous to (chrom, start, -end) or even GiST index on Range type in PostgreSQL. It might be slower in databases, but aside from the performance, one can continue to query the records in the same way in databases. For special format, the functionality will be much limited.

Now I will give the database approach a try before writing my own data wrangling script.

EDIT 2018-01-28: Add real world examples of using databases to store variant data.

Using EnsDb's annotation database in Python

2017-11-17T00:00:00-06:00

I found that there isn’t a systematic way to query and convert genomic annotation IDs in Python. At least there isn’t one as good as what R/Bioconductor currently has. If you’ve never heard of R/Bioconductor annotation tool stack before, check out the official workflow or my post in 2016 specific for querying Ensembl annotations.

Although I enjoy using R for genomic annotation conversion, a few days ago I wanted to do the same thing inside my text processing script in Python. I might be able to re-write the script in R but I feel like R is not really the right tool for this task and on top of it, I don’t know how to write an efficent text processing in R¹.

Knowing the fact that all annotations in R are stored in single-file SQLite databases, I should be able to connect the database directly Python or any other language and wirte SQL query to retrieve the same information. So my question now becomes to how to extract or find the path to the databases. Turn out that many new Bioconductor annotation packages are hosted via AnnotationHub, and user can search for the annotation package and retrieve them locally by their ID. For example, all the recent Ensembl releases, e.g., EnsDb.Hsapiens.vXX, are available on AnnotationHub.

After digging around a bit, I am able to query the AnnotationHub, download the correct EnsDB SQLite database file, and make SQL queries for the annotation ID conversion without any R package. I will share the details in the rest of the post.

AnnotationHub web interface
Manual query in AnnotationHub
Manual query in EnsDB
Summary

But before we start with the details, I want to clarify that it wasn’t my intention to persuade people away from the current R ecosystem. The current R ecosystem is great and I will recommend people to stick with it as much as you can. I am pretty sure I will hit a lot of issues if I want to do more complex analysis or queries without the help of what R packages provide.

AnnotationHub web interface

EDIT 2019-01-29
Now AnnotationHub has a nice web interface. With the new API, we can search and download all the EnsDb annotation objects on AnnotationHub by visiting https://annotationhub.bioconductor.org/package2/AHEnsDbs:

The web query interface of AnnotationHub

The following section is the old way to navigate through AnnotationHub’s database.

Manual query in AnnotationHub

When one wants to use the R package AnnotationHub, the common usage is

library(AnnotationHub)
ah <- AnnotationHub()
## snapshotDate(): 2017-10-27

query(ah, c("EnsDb", "Homo sapiens"))

The function call AnnotationHub() will download the latest version of the metadata of all available annotation object. The subsequent query(...) function will talk to the local metadata database.

Now let’s do it manually without any R function calls.

The default AnnotationHub is at https://annotationhub.bioconductor.org/. By visiting the page we can find several relevant endpoints:

/metadata/annotationhub.sqlite3
/fetch/:id # id => rdatapaths.id

So as long as we get the rdatapaths.id of the EnsDb using the metadata, we can download it via the /fetch/:id endpoint.

After downloading the metadata database https://annotationhub.bioconductor.org/metadata/annotationhub.sqlite3, we can inspect it in SQLite3 by connecting it directly:

sqlite3 annotationhub.sqlite3

Some useful commands to inspect a foreign database (or the ultimate help command .help):

sqlite> .header on
sqlite> .mode column
sqlite> .tables
biocversions       rdatapaths         schema_info        test
input_sources      recipes            statuses           timestamp
location_prefixes  resources          tags
sqlite> .schema rdatapaths
CREATE TABLE `rdatapaths`(`id` integer DEFAULT (NULL) NOT NULL PRIMARY KEY , `rdatapath` varchar(255) DEFAULT (NULL) NULL, `rdataclass` varchar(255) DEFAULT (NULL) NULL, `resource_id` integer DEFAULT (NULL) NULL, `dispatchclass` varchar(255) DEFAULT (NULL) NULL, CONSTRAINT `rdatapaths_ibfk_1` FOREIGN KEY (`resource_id`) REFERENCES `resources`(`id`));
CREATE INDEX `rdatapaths_resource_id` ON `rdatapaths` (`resource_id`);

So let’s make a SQL query to find all Human’s EnsDb:

SELECT r.ah_id, rdp.id AS rdatapaths_id, rdp.rdatapath, r.title
FROM resources AS r
JOIN rdatapaths AS rdp
ON r.id = rdp.resource_id
WHERE r.title LIKE '%EnsDb for Homo Sapiens%';
-- ah_id       rdatapaths_id  rdatapath                               title
-- ----------  -------------  --------------------------------------  -- ---------------------------------
-- AH53211     59949          AHEnsDbs/v87/EnsDb.Hsapiens.v87.sqlite  Ensembl 87 EnsDb for Homo Sapiens
-- AH53715     60453          AHEnsDbs/v88/EnsDb.Hsapiens.v88.sqlite  Ensembl 88 EnsDb for Homo Sapiens
-- AH56681     63419          AHEnsDbs/v89/EnsDb.Hsapiens.v89.sqlite  Ensembl 89 EnsDb for Homo Sapiens
-- AH57757     64495          AHEnsDbs/v90/EnsDb.Hsapiens.v90.sqlite  Ensembl 90 EnsDb for Homo Sapiens

All the Ensembl releases 87+ are available! I will use the release 90 for example. we can download it by its rdatapaths id:

wget -O EnsDb.Hsapiens.v90.sqlite https://annotationhub.bioconductor.org/fetch/64495

For older Ensembl release, one may need to build the SQLite database based by the instructions from ensembldb. For the last GRCh37 release, Ensembl release 75, one can download the source of the Bioconductor annotation package EnsDb.Hsapiens.v75 and extract it. The database will be under inst/extdata.

Manual query in EnsDB

EnsDb SQLite database are Ensembl annotation databases created by the R package ensembldb.

Here I will show how to find a transcript’s gene name, its genomic location, and all its exon locations given its Ensembl transcript ID.

First connect the database by sqlite3 EnsDb.Hsapiens.v90.sqlite. Its table design is very straightforward:

sqlite> .tables
chromosome      exon            metadata        protein_domain  tx2exon
entrezgene      gene            protein         tx              uniprot

So it didn’t take me long to figure out how to join the transcript and gene information:

SELECT tx.tx_id, tx.gene_id, gene.gene_name, seq_name, seq_strand
FROM tx JOIN gene ON tx.gene_id = gene.gene_id
WHERE tx_id='ENST00000358731';
-- tx_id            gene_id          gene_name   seq_name    seq_strand
-- ---------------  ---------------  ----------  ----------  ----------
-- ENST00000358731  ENSG00000145734  BDP1        5           1

And for the genomic ranges of its exon:

SELECT tx_id, exon_idx, exon_seq_start, exon_seq_end
FROM tx2exon JOIN exon ON tx2exon.exon_id = exon.exon_id
WHERE tx_id = 'ENST00000380139'
ORDER BY exon_idx;
-- tx_id            exon_idx    exon_seq_start  exon_seq_end
-- ---------------  ----------  --------------  ------------
-- ENST00000380139  1           32427904        32428133
-- ENST00000380139  2           32407645        32407772
-- ENST00000380139  3           32407250        32407338
-- ENST00000380139  4           32404203        32404271
-- ENST00000380139  5           32400723        32403200

All the coordinates are 1-based and the ranges are inclusive.

Summary

By downloading the underlying annotation database, one can do the same annotation query out of R language and sometimes it may be helpful. I feel like instead of trying to come up with my own layout of annotation mapping across multiple sources, it is more reliable to use a more official build. On the other hand, it is very hard to get the annotation mapping correct and there are tons of corner cases that require careful and systematic decisions. So I don’t really recommend to build my own mapping at the first place anyway. The method here should help the situation of annotation query out of R a bit.

Potentially one can try copy the full R infrastructure but using the same underlying database and replicate the same experience to other languages, but it might require substantial work to get the infrastructure done and correct.

EDIT 2017-12-13: Add instructions of using older Ensembl release.
EDIT 2019-01-29: Add the web interface of AnnotationHub.

Based on my impression, my R expert friends would probably recommend me to write it with R-cpp, which I think would be over-kill for such a small task. But my impression can be wrong. Feel free to share your thoughts! ↩

Use Snakemake on Google cloud

2017-08-10T00:00:00-05:00

TL;DR Run a RNA-seq pipeline using Snakemake locally and later port it to Google Cloud. Snakemake can parallelize jobs of a pipeline and even across machines.

Snakemake has been my favorite workflow management system for a while. I came across it while writing my master thesis and from the first look, it already appeared to be extremely flexible and powerful. I got some time to play with it during my lab rotation and now after joining the lab, I am using it in my many research projects. With more and more projects in lab relying on virtualization like Docker, package management like bioconda, and cloud computing like Google Cloud, I would like to continue using Snakemake in those scenarios as well. Hence this post to write down all the details.

The post will introduce the Snakemake by writing the pipeline locally, then gradually move towards to Docker and more Google Cloud products, e.g., Google Cloud Storage, Google Compute Engine (GCE), and Google Container Engine (GKE). Snakemake tutorial is a good place to start with to understand how Snakemake works.

RNA-seq dataset and pipeline for demonstration
Installation of snakemake and all related tools
Snakemake local pipeline execution
Snakemake on Google Cloud
- Move input files to the cloud (from Google Cloud Storage)
- Store output files on the cloud
Dockerize the environment
- Use Google Cloud Storage in Docker image
Google Container Engine (GKE)
- Potential issues of using GKE with Snakemake
Summary

RNA-seq dataset and pipeline for demonstration

In this example, I will use ~/snakemake_example to store all the files and output. Make sure you change all the paths to be relative to the actual folder in your machine.

The demo pipeline will be a RNA-seq pipeline for transcript-level expression analysis, often called the new Tuxedo pipeline involving HISAT2 and StringTie. The RNA-seq dataset is from Griffith Lab’s RNA-seq tutorial which,

… consists of two commercially available RNA samples: Universal Human Reference (UHR) and Human Brain Reference (HBR). The UHR is total RNA isolated from a diverse set of 10 cancer cell lines. The HBR is total RNA isolated from the brains of 23 Caucasians, male and female, of varying age but mostly 60-80 years old.

(From the wiki page “RNA-seq Data” of the tutorial)

Our RNA-seq raw data are the 10% downsampled FASTQ files for these samples. For the human genome reference, only the chromosome 22 from GRCh38 is used. The gene annotation is from Ensembl Version 87. Let’s download all the samples and annotations.

$ cd ~/snakemake_example
$ wget https://storage.googleapis.com/lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr.tar.gz
$ tar xf griffithlab_brain_vs_uhr.tar.gz

Now you should have the following file structure:

~/snakemake_example
├── griffithlab_brain_vs_uhr/
│   ├── GRCh38_Ens87_chr22_ERCC/
│   │   ├── chr22_ERCC92.fa
│   │   └── genes_chr22_ERCC92.gtf
│   └── HBR_UHR_ERCC_ds_10pc/
│       ├── HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
│       ├── HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
│       ├── ...
│       ├── UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
│       └── UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
└── griffithlab_brain_vs_uhr.tar.gz

After installing conda and setting up bioconda, the installation is simple. All the dependencies are kept in a conda environment called new_tuxedo.

$ conda create -n new_tuxedo \
    python=3.6 snakemake hisat2 stringtie samtools
$ source activate new_tuxedo        # Use the conda env
(new_tuxedo) $ hisat2 --version     # Tools are available in the env
/Users/liang/miniconda3/envs/new_tuxedo/bin/hisat2-align-s version 2.1.0
...
(new_tuxedo) $ deactivate           # Exit the env
$ hisat2 --version                  # Tools are isolated in the env
bash: hisat2: command not found

All the following steps should be run inside this conda environment unless it’s specified otherwise.

Snakemake local pipeline execution

The RNA-seq pipeline largely consists of the following steps:

Build HISAT2 genome reference index for alignment
Align sample reads to the genome by HISAT2
Assemble per-sample transcripts by StringTie
Merge per-sample transcripts by StringTie
Quantify transcript abundance by StringTie

To get the taste of how to write a Snakemake pipeline, I will implement it gradually by breaking it into three major parts: genome reference index build, alignment, and transcript assessment.

Genome reference index build (How to write snakemake rules)

To build the genome reference, we need to extract the splice sites and exons by two of the HISAT2 scripts, hisat2_extract_splice_sites.py and hisat2_extract_exons.py. Then we call hisat2-build to build the index. Create a new file at ~/snakemake_example/Snakefile with the following content:

GENOME_FA = "griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/chr22_ERCC92.fa"
GENOME_GTF = "griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf"
HISAT2_INDEX_PREFIX = "hisat2_index/chr22_ERCC92"

rule extract_genome_splice_sites:
    input: GENOME_GTF
    output: "hisat2_index/chr22_ERCC92.ss"
    shell: "hisat2_extract_splice_sites.py {input} > {output}"

rule extract_genome_exons:
    input: GENOME_GTF
    output: "hisat2_index/chr22_ERCC92.exon"
    shell: "hisat2_extract_exons.py {input} > {output}"

rule build_hisat_index:
    input:
        genome_fa=GENOME_FA,
        splice_sites="hisat2_index/chr22_ERCC92.ss",
        exons="hisat2_index/chr22_ERCC92.exon",
    output: expand(f"{HISAT2_INDEX_PREFIX}.{{ix}}.ht2", ix=range(1, 9))
    log: "hisat2_index/build.log"
    threads: 8
    shell:
        "hisat2-build -p {threads} {input.genome_fa} "
        "--ss {input.splice_sites} --exon {input.exons} {HISAT2_INDEX_PREFIX} "
        "2>{log}"

Overall Snakefile is Python-based, so one can define variables and functions, import Python libraries, and use all the string operations as one does in the Python source code. Here I defined some constants to the genome reference files (GENOME_FA and GENOME_GTF) and the output index prefix (HISAT2_INDEX_PREFIX) because they will get quite repetitive and specifying them at the front can make future modifications easier.

In case one hasn’t read the Snakemake Tutorial, here is an overview of the Snakemake pipeline execution. A Snakemake rule is similar to a Makefile rule. In a rule, one can specify the input pattern and the output pattern of a rule, as well as the command to run for this rule. When snakemake runs, all the output user wants to generate will be translated into a sets of rules to be run. Based on the desired output, Snakemake will find the rule that can generate them (matching the rule’s output pattern) and the required input. The finding process can be traversed rules after rules, that is, some input of a rule depends on the output of another rule, until all the inputs are available. Then Snakemake will start to generate the output by running the commands each rule gives.

Now we can look at the three rules in our current Snakefile.

The first rule extract_genome_splice_sites extracts the genome splice sites. The input file is GENOME_GTF which is the Ensembl gene annotation. The output is a file at hisat2_index/chr22_ERCC92.ss. The command to generate the output from the given input is a shell command. The command contains some variables, {input} and {output}, where Snakemake will fill in them with the sepcified intput and output. So when the first rule is activated, Snakemake will let Bash shell to run:

hisat2_extract_splice_sites.py \
    griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf \
    > hisat2_index/chr22_ERCC92.ss

The second rule extract_genome_exons is quite similar to the first one, but extracts the genome exons and stores it in hisat2_index/chr22_ERCC92.exon.

The third rule build_hisat_index builds the actual index. Input can be multiple files, in this case there are three entries, including the chromosome sequence, splice sites and exons. One can later refer only to input of the same entry by their entry name. For example, {input.genome_fa} means the chromosome sequence FASTA file.

The output of the third rule is expand(f"{HISAT2_INDEX_PREFIX}.{{ix}}.ht2", ix=range(1, 9)), where expand(...) is a Snakemake function which can interpolate a string pattern into an array of strings. In this case the generate index files are <index_prefix>.1.ht2, … ,<index_prefix>.8.ht2. Instead of specifies the output eight times, we use expand and pass a variable ix to iterate from 1 to 8. The double curly brackets are to escape the f"..." f-string interpolation (see the Python documentation). So the whole process to interpret the output is:

expand(f"{HISAT2_INDEX_PREFIX}.{{ix}}.ht2", ix=range(1, 9))
expand("hisat2_index/chr22_ERCC92.{ix}.ht2", ix=range(1, 9))
"hisat2_index/chr22_ERCC92.1.ht2", "hisat2_index/chr22_ERCC92.2.ht2", ..., "hisat2_index/chr22_ERCC92.8.ht2"

For the rest of the entries such as threads, and log, one can find more information at the Snakemake documentation about Rules.

Run Snakemake

Let’s build the genome reference index.

$ snakemake -j 8 -p build_hisat_index
Provided cores: 8
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   build_hisat_index
    1   extract_genome_exons
    1   extract_genome_splice_sites
    3

rule extract_genome_exons:
    input: griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf
    output: hisat2_index/chr22_ERCC92.exon
    jobid: 1

hisat2_extract_exons.py griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf > hisat2_index/chr22_ERCC92.exon
...
3 of 3 steps (100%) done

The command snakemake -j 8 -p build_hisat_index means:

-j 8: Use 8 cores
-p: Print the actual command of each job
build_hisat_index: The rule or certain output to be generated

If one runs it again, one will find that snakemake won’t do anything since all the output are present and updated.

$ snakemake -j 8 -p build_hisat_index
Nothing to be done.

Sample alignment (How to write a general rule)

Let’s write the rule to do the sample alignment. Append the Snakefile with the following content:

SAMPLES, *_ = glob_wildcards('griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/{sample}.read1.fastq.gz')

rule align_hisat:
    input:
        hisat2_index=expand(f"{HISAT2_INDEX_PREFIX}.{{ix}}.ht2", ix=range(1, 9)),
        fastq1="griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/{sample}.read1.fastq.gz",
        fastq2="griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/{sample}.read2.fastq.gz",
    output: "align_hisat2/{sample}.bam"
    log: "align_hisat2/{sample}.log"
    threads: 4
    shell:
        "hisat2 -p {threads} --dta -x {HISAT2_INDEX_PREFIX} "
        "-1 {input.fastq1} -2 {input.fastq2} 2>{log} | "
        "samtools sort -@ {threads} -o {output}"

rule align_all_samples:
    input: expand("align_hisat2/{sample}.bam", sample=SAMPLES)

There are two rules here but only align_hisat does the real work. The rule looks familar but there are something new. There is a unresolved variable {sample} in input, output and log entries, such as fastq1=".../{sample}.read1.fastq.gz". So this rule will apply to all outputs that match the pattern align_hisat2/{sample}.bam. For example, given an output align_hisat2/mysample.bam, Snakemake will look for the inputs griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/mysample.read1.fastq.gz, where sample = "mysample" in this case.

To get the names of all the samples, we use glob_wildcards(...) which finds all the files that match the given string pattern, and collects the possible values of the variables in the string pattern as a list. Hence all the sample names are stored in SAMPLES, and the other rule takes input of all samples’ BAM files to generate alignment of all samples.

Now run Snakemake again with a different rule target:

snakemake -j 8 -p align_all_samples

This time pay attention to the CPU usage (say, using htop), one should find out that snakemake runs jobs in parallel, and tries to use as many cores as possible.

Transcript assement

Let’s complete the whole pipeline by adding all StringTie steps to Snakefile:

from pathlib import Path

rule stringtie_assemble:
    input:
        genome_gtf=GENOME_GTF,
        bam="align_hisat2/{sample}.bam"
    output: "stringtie/assembled/{sample}.gtf"
    threads: 4
    shell:
        "stringtie -p {threads} -G {input.genome_gtf} "
        "-o {output} -l {wildcards.sample} {input.bam}"

rule stringtie_merge_list:
    input: expand("stringtie/assembled/{sample}.gtf", sample=SAMPLES)
    output: "stringtie/merged_list.txt"
    run:
        with open(output[0], 'w') as f:
            for gtf in input:
                print(Path(gtf).resolve(), file=f)

rule stringtie_merge:
    input:
        genome_gtf=GENOME_GTF,
        merged_list="stringtie/merged_list.txt",
        sample_gtfs=expand("stringtie/assembled/{sample}.gtf", sample=SAMPLES)
    output: "stringtie/merged.gtf"
    threads: 4
    shell:
        "stringtie --merge -p {threads} -G {input.genome_gtf} "
        "-o {output} {input.merged_list}"

rule stringtie_quant:
    input:
        merged_gtf="stringtie/merged.gtf",
        sample_bam="align_hisat2/{sample}.bam"
    output:
        gtf="stringtie/quant/{sample}/{sample}.gtf",
        ctabs=expand(
            "stringtie/quant/{{sample}}/{name}.ctab",
            name=['i2t', 'e2t', 'i_data', 'e_data', 't_data']
        )
    threads: 4
    shell:
        "stringtie -e -B -p {threads} -G {input.merged_gtf} "
        "-o {output.gtf} {input.sample_bam}"

rule quant_all_samples:
    input: expand("stringtie/quant/{sample}/{sample}.gtf", sample=SAMPLES)

Most rules are similar to the previous ones except for stringtie_merge_list. This step a file is generated to contain list of paths to all the samples’ GTF file. Instead of running some command (no shell entry), a run entry is used to write a Python code snippet to generate the file.

Another thing to be noted is the output entry ctabs=... of stringtie_quant. The following lines are equivalent:

# Before expansion
ctabs=expand(
    "stringtie/quant/{{sample}}/{name}.ctab",
    name=['i2t', 'e2t', 'i_data', 'e_data', 't_data']
)
# After expansion
ctabs="stringtie/quant/{sample}/i2t.ctab",
    "stringtie/quant/{sample}/e2t.ctab",
    ...,
    "stringtie/quant/{sample}/t_data.ctab"

The full Snakefile can be found here.

Job dependencies and DAG

Now with the pipeline complete, we can further look at the how all the rules are chained with each other. Snakemake has a command to generate the job depedency graph (a DAG):

snakemake --dag quant_all_samples | dot -Tsvg > dag.svg

Snakemake job dependency graph.

Snakemake generates such DAG first before execution, where each node represents a job. As long as two nodes have no connected edges and their input exist, they can be executed parallely. This is a powerful feature to pipeline management, which can use the resources in a fin grain.

A simpler graph that shows rules instead of jobs can be generated by:

snakemake --rulegraph quant_all_samples | dot -Tsvg > ruledag.svg

Snakemake rule dependency graph.

Snakemake on Google Cloud

Now we start to move our Snakemake pipeline to the Google Cloud. To complete all the following steps, one needs a Google account and has a bucket on the Google Cloud with write access. That is, be able to upload the output back to Google Cloud Storage. Snakemake is able to download/upload files from the cloud, one needs to set up the Google Cloud SDK on the local machine and create the default application credentials:

gcloud auth application-default login

Also, install the neccessary Python packages to give Snakemake the access to storage API:

conda install google-cloud-storage

Actually snakemake support remote files from many more providers. More detail can be found at the Snakemake documentation.

Note that although one can run this section on a local machine, this step will be significantly faster if one runs it on a Google Computer Engine (GCE) instance. It also saves extra bandwidth and fees.

Move input files to the cloud (from Google Cloud Storage)

Let’s modify the Snakefile to use the reference and FASTQ files from Google Cloud Storage. Replace those file paths with the following:

from pathlib import Path
from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
GS = GSRemoteProvider()

GS_PREFIX = "lbwang-playground/snakemake_rnaseq"
GENOME_FA =  GS.remote(f"{GS_PREFIX}/griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/chr22_ERCC92.fa")
GENOME_GTF = GS.remote(f"{GS_PREFIX}/griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf")
HISAT2_INDEX_PREFIX = "hisat2_index/chr22_ERCC92"

SAMPLES, *_ = GS.glob_wildcards(GS_PREFIX + '/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/{sample}.read1.fastq.gz')

# rule extract_genome_splice_sites:
# ...

rule align_hisat:
    input:
        hisat2_index=expand(f"{HISAT2_INDEX_PREFIX}.{{ix}}.ht2", ix=range(1, 9)),
        fastq1=GS.remote(GS_PREFIX + "/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/{sample}.read1.fastq.gz"),
        fastq2=GS.remote(GS_PREFIX + "/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/{sample}.read2.fastq.gz"),
    # ...

Now all the file paths are on Google Cloud Storage under the bucket lbwang-playground. For example, GENOME_FA points to gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/chr22_ERCC92.fa.

One could launch Snakemake again:

snakemake --timestamp -p --verbose --keep-remote -j 8 quant_all_samples

Store output files on the cloud

Although we could replace all the file paths to GS.remote(...), there is a simpler way to replace every path through the command line option. On top of that, we need to add a FULL_HISAT2_INDEX_PREFIX variable to reflect the path change that prepends the path under the writable bucket. Replace all {WRITABLE_BUCKET_PATH} with a writable Google Cloud Storage bucket.

HISAT2_INDEX_PREFIX = "hisat2_index/chr22_ERCC92"
FULL_HISAT2_INDEX_PREFIX = "{WRITABLE_BUCKET_PATH}/hisat2_index/chr22_ERCC92"

rule build_hisat_index:
    # ...
    shell:
        "hisat2-build -p {threads} {input.genome_fa} "
        "--ss {input.splice_sites} --exon {input.exons} {FULL_HISAT2_INDEX_PREFIX} "
        "2>{log}"

rule align_hisat:
    # ...
    shell:
        "hisat2 -p {threads} --dta -x {FULL_HISAT2_INDEX_PREFIX} "
        "-1 {input.fastq1} -2 {input.fastq2} 2>{log} | "
        "samtools sort -@ {threads} -o {output}"

The full Snakefile can be found here. Now run the Snakemake with the following options:

snakemake --timestamp -p --verbose --keep-remote -j 8 \
        --default-remote-provider GS \
        --default-remote-prefix {WRITABLE_BUCKET_PATH} > \
        quant_all_samples

To understand how the whole remote files work, here is the the folder structure after the exection:

~/snakemake_example
├── lbwang-playground/
│   └── snakemake_rnaseq/
│       └── griffithlab_brain_vs_uhr/
│           ├── GRCh38_Ens87_chr22_ERCC/
│           └── HBR_UHR_ERCC_ds_10pc/
├── {WRITABLE_BUCKET_PATH}/
│   ├── align_hisat2/
│   ├── hisat2_index/
│   └── stringtie/
└── Snakefile

So Snakemake simply downloads/generates the files with the full path on remote storage.

Dockerize the environment

Although bioconda has made the package installation very easy, it would be easier to just isolate the whole environment at the operating system level. One common approach is to use Docker.

A minimal working Dockerfile would be:

FROM continuumio/miniconda3
RUN conda install -y python=3.6 nomkl \
        stringtie samtools hisat2 snakemake google-cloud-storage \
    && conda clean -y --all

However there are some details required extra care at the time of writing, so I’ve created a Docker image for this pipeline on Docker Hub, lbwang/snakemake-conda-rnaseq. One could be able to run the snakemake by:

cd ~/snakemake_example
docker run -t                       \
    -v $(pwd):/analysis             \
    lbwang/snakemake-conda-rnaseq   \
    snakemake -j 2 --timestamp      \
        -s /analysis/Snakefile --directory /analysis \
        quant_all_samples

Use Google Cloud Storage in Docker image

To use Google’s Cloud products in a Docker image, one needs to install Google Cloud SDK inside the Docker image. Refer to Google’s Dockerfile with Cloud SDK for detail. lbwang/snakemake-conda-rnaseq has installed the Cloud SDK.

sudo docker run -t -i                           \
    -v $(pwd):/analysis                         \
    -v ~/.config/gcloud:/root/.config/gcloud    \
    lbwang/snakemake-conda-rnaseq               \
    snakemake -j 4 --timestamp --verbose -p --keep-remote   \
        -s /analysis/Snakefile --directory /analysis        \
        --default-remote-provider GS --default-remote-prefix "{WRITABLE_BUCKET_PATH}" \
        quant_all_samples

To run Docker on a GCE VM instance, it requires the host machine (the VM instance) to have Docker installed. One may refer to Docker’s official installation guide to install it. VM instance by default inherit the user’s permission (via the automatically created service account), thus the command above should apply to the GCE instance as well.

Google Container Engine (GKE)

To scale up the pipeline execution across multiple machines, Snakemake could use Google Container Engine (GKE, implemented on top of Kubernetes). This method is built on Docker which each node will pull down the given Docker image to load the environment. After some discussions about how to specify user input image ¹, on Snakemake 4.1+ one is able to specify the Docker image Kubernete’s node uses by --container-image <image>.

To install the master branch of Snakemake, run:

pip install git+https://bitbucket.org/snakemake/snakemake.git@master

Following Snakemake’s GKE guide, extra packages need to be installed to talk to GKE (Kubernetes) cluster:

pip install kubernetes
gcloud components install kubectl
# or Debian on GCE:
# sudo apt-get install kubectl

First we create the GKE cluster by:

export CLUSTER_NAME="snakemake-cluster"
export ZONE="us-central1-a"
gcloud container clusters create $CLUSTER_NAME \
    --zone=$ZONE --num-nodes=3 \
    --machine-type="n1-standard-4" \
    --scopes storage-rw
gcloud container clusters get-credentials --zone=$ZONE $CLUSTER_NAME

This will launch 3 GCE VM instances using n1-standard-4 machine type (4 CPUs). Therefore in the cluster there are total 12 CPUs available for computation. Modify the variables to fit one’s setting.

Note that some rule may specify a number of CPUs that no node in the clusters has, say the rule build_hisat_index specifies 8 threads. In this case, the cluster cannot find a node with enough free CPUs to forward the job to a pod and the cluster will halt. Therefore, make sure to lower the threads to a reasonable number (or use configfile to apply to mulitple samples). We will continue to use the same Docker image lbwang/snakemake-conda-rnaseq as the Kubernetes’ container image.

By default, Snakemake will always check if the output files are outdated, that is, older than the rule that generated them. To ensure it re-runs the pipeline, one might need to remove the generated output before calling Snakemake again:

gsutil -m rm -r gs://{WRITABLE_BUCKET_PATH}/{align_hisat2,hisat2_index,stringtie}

Then we are able to run the pipeline again.

snakemake                                            \
    --timestamp -p --verbose --keep-remote           \
    -j 12 --kubernetes                               \
    --container-image lbwang/snakemake-conda-rnaseq \
    --default-remote-provider GS                     \
    --default-remote-prefix {WRITABLE_BUCKET_PATH}   \
    quant_all_samples

Note that since we change the container image, we have to make sure the version of Snakemake in the Docker image and the machine starting the pipeline matches. An easy way to ensure that the versions are matched is to start the workflow inside the same Docker image.

To connect the Kubernete cluster inside Docker, we need to pass kubectl’s config file as well, which is at ~/.kube/config. So the full command becomes:

sudo docker run -t -i                           \
    -v $(pwd):/analysis                         \
    -v ~/.config/gcloud:/root/.config/gcloud    \
    -v ~/.kube/config:/root/.kube/config        \
    lbwang/snakemake-conda-rnaseq               \
    snakemake                                           \
        -s /analysis/Snakefile --directory /analysis    \
        --timestamp -p --verbose --keep-remote          \
        -j 12 --kubernetes                              \
        --container-image lbwang/snakemake-conda-rnaseq \
        --default-remote-provider GS                    \
        --default-remote-prefix {WRITABLE_BUCKET_PATH}  \
        quant_all_samples

After running our pipeline, make sure to delete the GKE cluster by:

gcloud container clusters delete --zone=$ZONE $CLUSTER_NAME

Potential issues of using GKE with Snakemake

I still encountered the following issues while running the whole pipeline on the Kubernetes. It is likely that they are not Snakemake’s fault but I couldn’t find enough time to dig into the details at the time of writing:

HISAT2 cannot build its index on Kubenetes. So the step build_hisat_index failed for unknown reason. The error message from HISAT2 looks like this:

...
Wrote 8912688 bytes to secondary GFM file: {WRITABLE_BUCKET_PATH}/snakemake_demo/hisat2_index/chr22_ERCC92.6.ht2
Index is corrupt: File size for {WRITABLE_BUCKET_PATH}/snakemake_demo/hisat2_index/chr22_ERCC92.6.ht2 should have been 8912688 but is actually 0.
Please check if there is a problem with the disk or if disk is full.
Total time for call to driver() for forward index: 00:01:18
Error: Encountered internal HISAT2 exception (#1)

Summary

Snakemake is a flexible pipeline management tool that can be run locally and on the cloud. Although it is able to run on Kubernetes such as Google Container Engine, it is a relatively new feature and will take some time to stablize. Currently if one wants to run everything (both the computing and the data) on the cloud, using Google Compute Engine and Google Cloud Storage will be the way to go.

Using a 4-core (n1-standard-4) GCE instance, the total time to finish the pipeline locally and via Google Cloud Storage were 3.2 mins and 5.8 mins resepctively. So there are some overhead to transfer files from/to the storage.

Docker and bioconda have made the deployment a lot easier. Bioconda truly saves a lot of duplicated efforts to figure out the tool compilation. Docker provides an OS-level isolation and an ecosystem of deployment. With more tools such as Singularity continuing to come out, virtualization seems to be a inevitable trend.

Other than Google cloud products, Snakemake also supports AWS, S3, LSF, SLURM and many other cluster settings. It seems to me that the day when one Snakefile works for all platforms might be around the corner.

EDIT 2017-08-15: Add a section about using Google Cloud in Docker. Update summary with some time measurements. Add links to the full Snakefiles.
EDIT 2017-09-07: Snakemake has added the support of custom Kubernetes container image. Thus update the GKE section to use the official parameter to pass image.
EDIT 2017-11-17: Add instructions to run the Snakemake on Kubernete inside Docker. And also list out the issues of using GKE.

In the discussion, Snakemake’s author, Johannes, mentioned the possiblity of using Singularity so each rule can run in a different virutal environment. Singularity support comes at Snakemake 4.2+. ↩

Variants、eQTL、MPRA

2017-06-20T00:00:00-05:00

Computational Biology 和 Bioinformatics 在現在可能區分不大，本文也不打算深究兩定義，但他們大致能代表兩大類將電腦科學、程式運用在生物上的研究。

在碩班，我的實驗室一直鼓勵我們去想新的演算法，把某種預測做得更好或者快，或者運用更多來源的數據；做新的工具；整合出新的資料庫。這些應用都有他們的研究價值，也需要大量的技術投入，即便在發表上並不會放入這些細節。這類研究比較偏向 Bioinformatics。

來 WashU 前，我期許自己繼續往 Bioinformatics 深入。然而，在過去的數月裡，即便我仍投入在這些數據分析與工具開發上，另一大部份的時間，我經歷了許多關於模型，或者，關於「如何回答重要的生物問題」上的討論，有了較碩班訓練不同的啟發。這另一類研究比較偏向 Computational Biology。

本文想用另一個角度來看所謂的「modeling」。內容主要來自 Barak Cohen 教授給的數堂課的筆記，主題為 Coding and Noncoding Variant。我生物背景不足，如果筆記有任何錯誤，煩請告知。

Conflict of Interest: Cohen Lab 開發了 CRE-seq (cis-regulatory element by sequencing)，其中一種 MPRA (Massively Parallel Reporter Assay) 技術。

Coding vs noncoding variants
eQTL
MPRA
Conclusion

Coding vs noncoding variants

首先來談談 coding 和 noncoding variant。課堂上老師讓我們自由辯論研究兩者的「優缺點」，亦或，如果你是 PI 比較想研究哪個，面臨的優勢與困境。

Coding variant 很好理解，就是在某個 gene coding region 產生的序列改變，一般會先看所謂的 nonsynonymous，即這個 variant 造成 amino acid 改變，影響到蛋白質的結構，進而影響到其功能。synonymous variant 雖然不會改變 amino acid，但在模式物種中，可能會討論不同 amino acid 對於不同 tRNA 的偏好，也許會影響到 gene expression。另一方面，它也可能會影響 transcription factor (TF) binding，某些 TF 在 biding 有偏好的 DNA sequence（motif），即使蛋白質序列不變，TF binding 變化也會影響到其他基因的調控。

不過一般而言，coding variant 主要都是考慮 nonsynonymous change，這造成的變化十分具大，無法解釋像 complex traits、gene expression 高低這種細微的變化。

Noncoding elements

Noncoding vairant 相對而言複雜的多。在討論它之前，不如來說說看我們知道哪些 noncoding elements：

Introns
Promoters
Regulatory elements (REs)
- cis-regulatory elements (CREs): promoters, enhancers
- trans-regulatory elements
miRNAs
Retrovirus, satellites, centromeres, telemeres
Structual elements
- Matrix Attachment Regions (MARs)
- Lamina Associated Domains (LADs)
- CTCF/Cohesin
- Topologically associating domains (TADs)
- 3D genome¹
Methylation

咦，忘記提到 histone modification 嗎？關於這些 epigenetics markers，Barak 對於他們有深刻的懷疑，他認為這些只是 markers 而非最終 regulatory element：

“Something you can measure does not mean it is interesting.”

然後建議我們去讀一篇批評 ENCODE 的論文²，被他評之為近十年最辛辣，標題也非常有趣。

Noncoding variants

從 non-coding elements 我們能知道控制 gene expression 可以從很多面向切入，於是討論 non-coding variant 時就會有很多不同的機制影響 gene expression。底下針對所謂的 enhancers (RE) 和 promoters 來畫個簡單的示意圖：

     TF1  TF2  TF3         TF4  [RNA PolII]
----[  enhancer  ]-------[promoter]--[gene body]-------------------------
  <-- Topologically Associating Domain, TAD ----->   <-- Another TAD -->

RNA Polymerase II (RNA PolII) 負責 gene transcription，而 promoter 是一段在 gene body 前不特定的序列可以吸引 RNA PolII 來提高 gene transcription rate，很有可能就會提高 gene expression。TF 可能會辨別 promoter 上特定的序列，它對 PolII 有更強的吸引力。除了 promoter 之外，enhancer 相較於 gene body 的距離就更不確定，可能是 10kb 或 100kb 之外，但它在立體的距離可能非常近，本身也可以 recruit TF 然後增加 Pol affinity。這一切可以用抑制、競爭的角度來想產生負向的調控。

Enhancer 的影響力沒有方向性，即上下游的 gene 都會受同個 enhancer 調控。於是有所謂 TAD 的概念，它會讓 chromosome 形成一個 loop 侷限這樣立體空間上下游的互動，使得只在同個 TAD 的 REs 和 gene 能互相作用。這樣的觀念可以進一步推廣到 3D genome 上，考慮不同 chromosome 間的互動。TAD 的邊界由某些 motif（例如 CTCF）決定，但究竟 TAD 是如果建立與調控，機制尚未明朗。

在 non-coding 複雜的交互作用的另一面，代表了每個交互作用很可能僅改變了基因表達的程度，而不是大幅度的開關。但這也代表他們對生物體不一定有很強的影響，所以有變化並一定代表它有功能。不過，不同的 cell type 倒可以用透過 TF 有無來調控一系列的 gene，而不是一味增加 gene 數量。因此，在很多情況下，了解 non-coding variants 造成的影響是很有趣的。

Endophenotypes

我們要如何看 non-coding variants 呢？首先要了解從 genotype 到 phenotype 其實中間包含了很多層級：

DNA (genotype) →  RNA →  Proteins →  Metabolites →  Phenotype

中間的每個步驟都可能影響，或不會傳遞影響至下個階段。但我們在 DNA 和 RNA level 有非常好的工具─定序─可以同時看 genome wide 非常多基因或區域。於是在大多數的情況我們都只有看到 endophenotypes，要務必僅記在心這和真正的 phenotype 是有所差異的。

eQTL

eQTL 即是一種 endophenotype。QTL (quantitative trait loci) 意即某個 chromosome region 可以關連至一些量化數值的變動（即 locations that map to some quantitative measures），而 eQTL 即為 expresion QTL，關心某段區域的 variants 影響 gene expression。

過往常見的 eQTL study 有：

Linkage study
Family tree
GWAS on two groups

這都是使用數個不同人不同 sample 來看 eQTL。但這些對於 non-coding variant 來說變因太多，為何不從個人、單一 sample 著手，即 allele imballance？然而單一 sample 就會牽扯到 eQTL 本身的問題，即它很難進一步從某個區域縮小到是哪個 variant 或哪幾個 variants 為決定性因子 (causal vairants)。

MPRA

於是我們可以想辦法設計實驗來進一步解釋 eQTL。實驗可以從兩個方向來設計：necessary 和 sufficient。Necessity 可以透過 CRISPR 設計一系列的 tiled gRNAs 把某個 eQTL 逐步刪掉。平行化這個實驗，可以透過 growth selection 和 single cell sequencing 讀出是哪些 gRNAs 最有影響力。

在 sufficiency 方面，我們可以設計 reporter assay 來回答這問題：

-----------------------------[weak promoter]--[GFP]--
---[cis RE, CRE]-------------[weak promoter]--[GFP]--

reporter assay 可以用個 plasmid 放到 target cell，但要怎麼平行化，同時看很多 genes 呢？這時候就是 MPRA (Massively Parallel Reporter Assay) 表現的時候了。我們可以用 DNA synthesis 把該 cis-regulatory element (CRE) 和 barcode 做出來，可以建立一個 CRE library，用 RNA-seq 就可以同時看到不同 CRE 所造成的 gene expression change。當然細節有像 normalization DNA amount 和 barcode efficiency，但我們可以用 MPRA 來分析 CRE。

這裡提到的 CRE-seq 有什麼缺陷呢？它是 Plasmid based，沒有 histone modification，有 copy number 問題；再來他的 genome context 也只有區域性（像 TAD 就沒有考慮）。於是接下來如何改善他，就是目前 Barak Lab 研究最新動態。

Conclusion

我覺得從這個角度，把很多觀念用系統的角度整合，並且提出新的實驗與模型，非常有趣。像要怎麼 model enhancer 和 TFs 的交互作用，都是很有趣的題目。他的課非常有啟發性，很有意思。

關於 3D genome 就要提一下這篇論文：
Adrian and Suhas el al., Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes, PNAS, 2015.
裡面用碎形 (fractal globule) 去解釋 CTCF 形成 TADs 造成怎麼樣的染色體摺疊，並如何透過這樣的摺疊產生 long distance interaction，因為可能在立體空間他們是接近的。模型用來解釋 Hi-C 數據。這篇論文使用數學之抽象和複雜，甚至請丘成桐來當 reviewer。 ↩
針對 ENCODE 所謂 80% genome are functional 非常有名的戰文：
Dan et al., On the Immortality of Television Sets: “Function” in the Human Genome According to the Evolution-Free Gospel of ENCODE, Genome Biol Evol, 2013. ↩

Changing login shell without chsh

2017-01-23T00:00:00-06:00

For my daily terminal life, I use fish shell. Fish shell can be largely described by the headline on its official website:

Finally, a command line shell for the 90s
fish is a smart and user-friendly command line shell for macOS, Linux, and the rest of the family.

Among all of its features, I particularly enjoy how the autocompletions are widely available and easy to use for generally all common commands and operations. I think I’ve got spoiled by the autocompletion so much that sometimes I get lazy at typing the full commands. Undoubtedly, fish is my login shell, replacing the ubiquitous Bash shell.

Most of the time, one can change the login shell by the command chsh. In order to let chsh accept the new fish shell, it must be added into the list of all accepted shells which requires root permission. However, in many occasions including working on a large shared server, one may not has the permission to add new shell and thus the options for the login shell are often limited.

Replacing login shell by `exec`

Alternative solution will be calling the new shell upon the execution of current shell. A POSIX-compliant shell¹ should always read the .profile configuration file upon login, the following command execute fish and sweep the process with the current running shell (usually, bash).

exec -l $SHELL -l

-l tells shell to act like a login shell. For more explanation about login and non-login (as well as (non-)interactive) shells can be found at this StackOverflow answer. By pointing $SHELL to the desired shell binary, one can achieve the similar behavior to chsh.

However putting exec in the login profile comes with a risk that if the new shell executable crashes (e.g. failed symlink, erroneous compilation and failed dynamic library linking), one cannot establish proper shell connections. I’ve experienced these catastrophic failures for a couple of times. Since the new shell crashes when the original shell is replacing its process, one cannot set up the proper terminal session, or simply put, one will fail to login. It was not fun at all to recover.

Fail-safe shell changing

To provide a fail-safe mechanism, I use the following code in my ~/.profile to change the shell. Only login shells will read this file so it won’t be executed when one runs a bash script.

FISH_BIN="$HOME/.linuxbrew/bin/fish"

# The replacement is only done in non-fish login interactive shell in
# SSH connection and fish executable exists.
if [                                                            \
     "$SHELL" != "$FISH_BIN" -a -n "$SSH_TTY" -a -x "$FISH_BIN" \
] ; then
    # we first check whether fish can be executed, otherwise the
    # replacement will cause immediate crash at login (not fun)
    if "$FISH_BIN" -c 'echo "Test fish running" >/dev/null' ; then
        export SHELL="$FISH_BIN"
        echo "One can launch the fish shell by 'exec -l \$SHELL -l'"
        # exec -l $SHELL -l   # launch the fish login shell
    else
        echo "Failed to launch fish shell. Go check its installation!"
        echo "Fall back to default shell $SHELL ..."
    fi
fi

Basically we ensure whether fish executable work by running echo '...' before we change the shell. By uncommenting the exec .. line one will get automatically directed to fish shell. But the safest option is to run the shell change oneself. This kind of “shell swapping” will only happen when we log in the server by ssh

The actual setting is quite straight-forward. While the backstory here is that I messed up a few times and I was lucky enough to keep another session alive. At first I only checked fish --version but it was not sufficient, since it didn’t actually execute fish’s main code instructions. I got a illegal instruction after changing to fish even though printing its version was fine.

Interestingly, fish is not a POSIX-compliant shell so it won’t read ~/.profile configuration file. ↩

GPG Key Transition

2016-12-06T00:00:00-06:00

I started using GPG key as one of my small experiments in March, 2015. Throughout the setup, I made some mistakes, which I revoked later, and explored several usage scenarios. Although like what was said in the post I’m giving up on PGP, I don’t really use the encryption in daily email communication, it is still good to have an online identity.

A year later, which is 3 months before my experimental key expires, I think now is a good time to roll out a new one. I followed Alex’s post Creating the Perfect GPG Keypair to create a signing subkey for daily usage and keep my master key sperately in a safe place. The following is my transition statement.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

I am transitioning GPG keys from an old 4192-bit RSA key to a new
4096-bit RSA key.  The old key will continue to be valid for some
time, but I prefer all new correspondance to be encrypted in the new
key, and will be making all signatures going forward with the new key.

This transition document is signed with both keys to validate the
transition.

If you have signed my old key, I would appreciate signatures on my new
key as well, provided that your signing policy permits that without
reauthenticating me.

The old key, which I am transitioning away from, is:

  pub   4096R/30A45011B233544E 2015-03-21 [expires: 2017-03-22]
      Key fingerprint = 6ECD C5B8 235C D44D 2471  866E 30A4 5011 B233 544E

The new key, to which I am transitioning to, is:

  pub   4096R/44D10E44730992C4 2016-12-04 [expires: 2018-12-04]
      Key fingerprint = 85DF A3EB 72CD DE7D 3F2A  127C 44D1 0E44 7309 92C4

To fetch the full new key from a public key server using GnuPG, run:

  gpg --keyserver hkps://hkps.pool.sks-keyservers.net --recv-key 44D10E44730992C4

If you have already validated my old key, you can then validate that
the new key is signed by my old key:

  gpg --check-sigs 44D10E44730992C4

If you then want to sign my new key, a simple and safe way to do that
is by using caff (shipped in Debian as part of the "signing-party"
package) as follows:

  caff 44D10E44730992C4

Please contact me via e-mail at <me@liang2.tw> if you have any
questions about this document or this transition.

                                            Liang-Bo Wang
                                            (liang2, ccwang002)
                                            me@liang2.tw
                                            Dec 06, 2016

-----BEGIN PGP SIGNATURE-----

iQIcBAEBCgAGBQJYR5q3AAoJEPhHjvdXantXlzEP/iEgSd2NcfcBThmrY84U+MXR
UOLED3Ax6YvDUv/nInkMAH74SyqujeF7E7+ZuZmDEWRCVS6pQtpuLTvKBviDPyWx
W/hS03AU5nV9llSYZ4I/FzQdVtdY5PBBNCHxK34LoqJQVr3LPdAQOO2m9g8M11z0
+7FjmyNOjvZIxqhU+PK7VNEcZQ9X30ndjgkwCZQFE/8Wz9FnPt5QdwZoxNRBfx7Q
tMtHpMxKNTHV1t3lCcOubf5zQLFQ07SZv2f2rmfDPlsrvzp4bzq8QWEEo/XvClRc
hyFpbM55FqlJE+5lg/Dj/XC5AN0LS8HNd9x7UBWZnNQhg00elc+CXxgNc+wMVj+N
uPXYP2n1oZ4T4Fr45eFg9nJagBpIUsu+M5hoNGXtCdLcYPzJeERPebt/VZJQybse
T60hO8K15A5WCkZYnw1mXu/JangNGY8Bxq3xm7VzXHLJktf33gIIMUQC2ZCn/I0N
MQIjkMrARFpsGfb1DEglyWk/QdK4A8Cy3eKsNaKvz4A+PMr4Eskn8zhO05yJQlux
3IkXt1lnn2YTF5fjInYQPo22bNCuub4qoJhthOoySy2Zv04NHnOtmYPBZMv14aZe
MPcv4Kvn8szNeBRDT/zKqCWoHmxIbxIs7ZvSIfvj/NpugtkSkJVtvr7gpXmqObcc
L7kJSEXjkLiqvFrq6MkxiQIcBAEBCgAGBQJYR5q3AAoJEDCkUBGyM1ROlzEQAJ3i
gpH6z/rHrAVCNxru7ATLmVZYGF0uxLvth0hUnOvmhWb7a60v4KwRgTBFJ9vdUB24
MW0T0BdxN8zJPrN6hGj9RxML5UpzH///oeL1gINM8IEhZWaG1/th7bx5f/ip8xN+
dbkA8Hp3LW/LAB09uJOITLbLaPa+N2Umcsu6stPXL+Z/06JSUYIliDRDkzzpb/qw
/OZD6sj1oI25A7KYEUPiNn+FxtBmNiFetDqwhCJSglEF3SBl8ZlrbgMDxIudZX/5
+ihTn2Za5q59c7u2ESMmInP1n8/lFxYxi/DWE2n8vrw84PwQ5lG5zdiiYQf78QeB
j77giQzYibzvRHZlslJEM0lSeNLQ72svT5SIFB+45wqtfVIAfZCxTppv35MkpyDw
gtYW/zL6U+Qx+chPgVpBLkpC7LbBvrJozIU0oHw8V837IByaeqBPu9rm+F3M++Mo
taLmkzNvhX6wozw9Tj0gnW6e8ytH7Xi8K8IYO7xSSOGih/oKF2PrWPd8gufMiIML
lOtcuwZOCQqAB2yAQ2BHliwrm78XELARZXM1sbWJTpXBJPAZ+ZbvnNFK6fUwnclK
H35TsvRJK7hH+4d10EdURleyRj7d0EcXlHqki4urKlwSzRebLzq365vADzXEjFYp
DmfC2ISS64uLqHgJ3HHxhSmTLdc8KSJqzFi90ZUu
=JS7v
-----END PGP SIGNATURE-----

St. Louis PhD 生活

2016-10-02T00:00:00-05:00

不知不覺，距離上篇文章已經 4 個月。這 4 個月裡，擔任 PyCon TW 2016 工作人員、寫完碩士論文、從一個讀了 7 年的學校畢業、然後前往美國的 Washington University in St. Louis 就讀 Bioinformatics PhD。

如今除了繼續認識新環境與新朋友之外，生活大抵安定。因為開學有各種學校或私下學生間的活動，以及適應新的課程與實驗室，其實手上的專案沒什麼進度。不過還是簡單的更新一下近況。

關於我的學校與城市

St. Louis 城市位置。

我想大多數的人並不認識 St. Louis (STL) 這城市與 Washington University in St. Louis (WashU or WUSTL) 這學校。首先關於 St. Louis，它是一個美國中西部的中型城市。與東西岸的城市相比來得鄉下，但學校附近也不會是一望無際的田野。曾經於 19 世紀美國西部拓荒時，作為西進的重要陸河交通埠口，城市的地標是一個很高的拱門。STL 舉辦過 1904 年夏季奧運與世界博覽會，因此城市很多建設都是在那時完成或更新，包含我們學校不少建築物與設施。

MLB 紅雀 (Cardinals) 棒球隊主場場地 Busch Stadium。背後的拱門即為 St. Louis 地標。

氣候四季分明。夏天可以到 37°C (100°F)。目前剛入秋季，早晚溫差大 (15 — 25°C)，緊接著的冬季會更加寒冷（-5 — 10°C)，下雪數天，但這可能等我經歷過冬天之後才能正確的描述。生活方面，基本上沒有什麼不便，但跟大城市可能一些娛樂休閒活動會少一點，在地的華人社區也比較小。不過反過來說，物價較低，房價親民。來之前曾經擔心網路速度不夠快，因為多數地方並沒有給住民使用的光纖網路，不過目前寬頻有 100MBps 而且實際傳輸能維持這個速度，偶爾能更快，因此暫時還可以接受。對一個阿宅來說，網路夠快就表示能連接到他的全世界了。

STL 交響樂團於 Forest Park 舉辦的戶外音樂會，在這邊是個闔家出遊的行程。坐在草地上野餐與聽表演。

接下來談談我的學校，Washington University in St. Louis，簡稱 WashU 或 WUSTL。因為太多學校都以 Washington 為名，例如 Seattle 有個公立學校 UW (University of Washington)，但彼此間並沒有任何關係。講 WashU 常常會誤會，因此常會寫 WUSTL 來區分，這也是學校的 domain name。底下就是我們學校的地理位置：

WUSTL 校區位置。 Ref: OpenStreetMap

學校被中間一個大公園切割成兩大校區。西邊的是主校區，大多數的學院都在此。東邊則是醫學院校區，包含諸多醫院，與生醫研究相關的科系。我就讀的科系即在東邊的醫學院校區。我的感覺台大公館校區與學校差不多大。中間的大公園 Forest Park 長寬 2.5 x 1.6 km，即是世博會的主場地，目前裡面有博物館、植物園、動物園等等，許多戶外活動舉辦於此，同時也是健身去處。

西邊的主校區建築物為學院歌德式 (Collegiate Gothic)，非常漂亮，如同概念中古老的大學有的模樣，實際上學校也已創立 160 多年。雖然說學校在台灣沒什麼知名度，但還算是個以研究為導向的私立「世界百大」學校，有 25 位諾貝爾獎得主。雖然看起來舊，但學校一直都在整修或建新的大樓，他們會維持現有建物的風格。

西邊主校區 (Danforth Campus) 校門入口建築 Brookings Hall。

醫學院相較而言就是比較現代的校區，例如下圖就是我實驗室的所在建築。

McKinley Research Building，於 2015 年落成。

關於研究的話，就不列在這邊了，也等之後更熟悉環境再來分享。如果你對申請 PhD 的過程有興趣的話，我有寫了一篇 PhD 申請準備在 ptt。

What’s Next?

花了一點時間安頓，現在這邊的生活其實有相當多的個人時間，除了一年級有課跟作業要準備外，還是有一些自己的時間。目前我已經累積了 6 篇草稿，大部份是關於 Django，但碩班畢業之後，就不太有機會再寫網站，直到最近幫實驗室做了個 proposal review system 才又碰了一下。有機會會整理它們。

在台灣還有 Python 文件翻譯這個專案，很可惜沒辦法在離開前完成，但這個是我今年的目標。目前是個擺著讓它自由發展的狀態，但我想在年底 Python 3.6 新版本釋出之前把 tutorial 部份告一段落。除了這專案之外，重新學習 C，以及看了些 Rust 的東西，多練習基本功。還沒有加入當地實體的開源社群，加上晚上活動也比在台灣危險一點。不過在美國時區，不論 IRC 或 mail list 都能輕鬆跟到討論，倒也不會覺得沒辦法接受新知。目標參加明年的 PyCon US。

順帶一提，部落格也換了新的主題，風格比較像 Medium，應該比較好閱讀。不過我還沒空把細節調整好，目前維持著只要還不會很不順眼，就維持現狀的做法。希望不久後又能恢復定期發文。

使用 conda env 部署 Django

2016-05-24T00:00:00-05:00

沒幾天前剛部署一次 Django，記錄在《使用 uWSGI、nginx、systemd 部署 Django》。今天又部署了另一個專案。部署的設定跟上次一樣：

nginx -- unix socket -- uWSGI -- Django

一樣寫一個 PROJ.service 的 systemd unit 來管理網站的啟動 (uWSGI)。之後提到 PROJ 時就換成自己的專案名稱；USER 就換成執行網站的帳號。

conda
uWSGI 和 $PATH
在 sysmted unit 使用環境變數
結論

conda

conda 是一個 Python 套件的管理系統，他的好處是，遇到要使用外部 library 時，會這些套件相依的 library 都一併安裝管理，也可以管理不同 Python 版本。可以想像是加強版的 pip + venv。conda 跟 pip 是相容的。

這個 Django 專案就用到很多像 numpy、pandas 的套件。為了維護方便，我考慮用 conda 來安裝。我使用的是 miniconda3，預設會安裝在 ~/miniconda3 底下，虛擬環境會出現在 ~/miniconda3/envs/。

$ conda create -n VENV python=3.5 numpy pandas django
$ source activate VENV
(VENV) $ pip install uwsgi

uWSGI 沒有在 conda 裡面，所以就用 pip 裝。從上次的文章知道系統並不用安裝。

uWSGI 和 $PATH

理論上，之後就照著上次操作就好，但在 uWSGI 就碰到問題：

$ sudo /home/USER/miniconda3/envs/VENV/bin/uwsgi --ini PROJ.ini
[uWSGI] getting INI configuration from PROJ.ini
*** Starting uWSGI 2.0.13.1 (64bit) on [Wed May 25 08:04:23 2016] ***
compiled with version: 5.3.1 20160413 on 25 May 2016 01:35:28
os: Linux-4.4.0-22-generic #40-Ubuntu SMP Thu May 12 22:03:46 UTC 2016
nodename: s66
machine: x86_64
clock source: unix
detected number of CPU cores: 24
current working directory: /etc/uwsgi/vassals
detected binary path: /home/USER/miniconda3/envs/VENV/bin/uwsgi
……
chdir() to /path/to/PROJ/
your processes number limit is 514650
your memory page size is 4096 bytes
detected max file descriptor number: 1024
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to UNIX address /run/PROJ/django.sock fd 3
Python version: 3.5.1 |Continuum Analytics, Inc.| (default, Dec  7 2015, 11:16:01)  [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Set PythonHome to /home/USER/miniconda3/envs/VENV
Failed to import the site module
Traceback (most recent call last):
  File "/usr/lib/python3.5/site.py", line 580, in <module>
    main()
  …… 
  File "/usr/lib/python3.5/_sysconfigdata.py", line 6, in <module>
    from _sysconfigdata_m import *
ImportError: No module named '_sysconfigdata_m'

但因為步驟實在太簡單，想不出來哪裡有錯，查網路也沒什麼相關的結果。在這邊卡了很久。

結果後來才發現，Traceback 那邊 uWSGi 跑去讀到 /usr/lib/python3.5/site.py，這表示一定有環境設錯才讓它找到這個不是我們要的 python 環境，理論上應該是找到 /home/USER/miniconda3/envs/VENV/lib/python3.5/site.py 才對。

經過一陣嘗試，發現只要修改 $PATH 環境變數就能運作了。

$ sudo -i
# echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
# export PATH=/home/USER/miniconda3/envs/VENV/bin:$PATH
# /home/USER/miniconda3/envs/VENV/bin/uwsgi --ini PROJ.ini

在 sysmted unit 使用環境變數

根據 systemd.exec(5) 關於 $PATH 環境變數的使用：

Colon-separated list of directories to use when launching executables. Systemd uses a fixed value of /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin.

預設只有以上提到的路徑，如果要修改環境變數的話，就透過 Environment=，因此多加了一行在 systemd unit 裡。其餘的設定都是相同的。

[Unit]
Description=PROJ Django server by uWSGI
After=syslog.target

[Service]
Environment="PATH=/home/USER/miniconda3/envs/VENV/bin:$PATH"
ExecStart=/home/USER/miniconda3/envs/VENV/bin/uwsgi --ini /etc/uwsgi/vassals/PROJ.ini
Restart=always
KillSignal=SIGQUIT
Type=notify
StandardError=syslog
NotifyAccess=all

[Install]
WantedBy=multi-user.target

結論

如果要改用 conda 管理套件的話，只要在 systemd unit 那邊多加一行修改 $PATH，加入虛擬環境放執行檔的路徑，其餘的設定都與一般 Python 虛擬環境相同。這樣就搞定了。但這個問題花了我 1 個多小時……

Ensembl Genomic Reference in Bioconductor

2016-05-21T18:00:00-05:00

TL;DR I gave a talk Genomics in R about querying genomic annotations and references in R/Bioconductor. In this post, we re-visit all the operations in my talk using Ensembl references instead of UCSC/NCBI ones.

This post is part of the “Genomic Data Processing in Bioconductor” series. In that post, I mentioned several topics critical for genomic data analysis in Bioconductor:

Annotation and genome reference (OrgDb, TxDb, OrganismDb, BSgenome)
Experiment data storage (ExpressionSets)
Operations on genome (GenomicRanges)
Genomic data visualization (Gviz, ggbio)

Few days ago in a local R community meetup, I gave a talk Genomics in R covering the “Annotation and genome reference” part and a quick glance through “Operations on genome”, which should be sufficient for daily usage such as searching annotations in the subset of some genomic ranges. You can find the slides, the meetup screencast (in Chinese) and the accompanied source code online. I don’t think a write-up is needed for the talk. But if anyone is interested, feel free to drop your reply below. :)

Fundamental Bioconductor packages

Some Bioconductor packages are the building blocks for genomic data analysis. I put a table here containing all the classes covered in rest of the post. If you are not familiar with these classes and their methods, go through the talk slides first, or at least follow the annotation workflow on Bioconductor.

R Class	Description
`OrgDb`	Gene-based information for Homo sapiens; useful for mapping between gene IDs, Names, Symbols, GO and KEGG identifiers, etc.
`TxDb`	Transcriptome ranges for the known gene track of Homo sapiens, e.g., introns, exons, UTR regions.
`OrganismDb`	Collection of multiple annotations for a common organism and genome build.
`BSgenome`	Full genome sequence for Homo sapiens.
`AnnotationHub`	Provides a convenient interface to annotations from many different sources; objects are returned as fully parsed Bioconductor data objects or as the name of a file on disk.

Ensembl genome browser and its ecosystem

In Bioconductor, most annotations are built against NCBI and UCSC naming systems, which are also used in my talk. However, there is another naming system maintained by Ensembl, whose IDs are very recognizable with suffix “ENSG” and “ENST” for gene and transcript respectively.

I particularly enjoy the Ensembl genome browser. The information is well organized and structured. For example, take a look at the description page of gene MAPK1,

Gene information page of MAP1 on Ensembl Genome Browser release 84 (link)

The gene tree tab shows its homologs and paralogs. The variant table tab shows various kinds of SNPs within MAPK1’s transcript region. SNPs are annotated with their sources, different levels of supporting evidence, and SIFT/PolyPhen prediction on protein function change. Finally, there is a external references tab which links the Ensembl IDs with NCBI CCDS and NCBI RefSeq IDs. There are many ways to explore different aspects of this gene, and it seems everything at multiple biological levels is simply connected.

I always think of the Ensembl ecosystem as a decent learning portal, so it is a pity if one cannot easily use its information in R/Bioconductor. After a quick research, I found using Ensembl annotations are quite straightforward even though the required files does not ship with Bioconductor. Also, there were some topics I failed to mention in the talk, such as AnnotationHub and genomic coordinate system conversion (e.g., from hg19 to hg38). I am going to cover these topics in the talk.

Fundamental Bioconductor packages
Ensembl genome browser and its ecosystem
OrgDb
TxDb
BSgenome and AnnotationHub
biomaRt
- Compatibility with AnnotationDb’s interface
- biomaRt’s original interface
Conversion between genomic coordinate systems
Summary

OrgDb

The same OrgDb object for human (org.Hs.eg.db) can be used. It relates different gene IDs, including Entrez and Ensembl gene ID. From its metadata, human’s OrgDb gets updated frequently. Most of its data source were fetched during this March. So one should be able to use it for both hg19 and hg38 human reference.

library(org.Hs.eg.db)
human <- org.Hs.eg.db

mapk_gene_family_info <- select(
    human,
    keys = c("MAPK1", "MAPK3", "MAPK6"),
    keytype = "SYMBOL",
    columns = c("ENTREZID", "ENSEMBL", "GENENAME")
)
mapk_gene_family_info

SYMBOL	ENTREZID	ENSEMBL	GENENAME
MAPK1	5594	ENSG00000100030	mitogen-activated protein kinase 1
MAPK3	5595	ENSG00000102882	mitogen-activated protein kinase 3
MAPK6	5597	ENSG00000069956	mitogen-activated protein kinase 6

Here comes a small pitfall for Ensembl annotation. We cannot sufficiently map Ensembl’s gene ID to its transcript ID,

select(
    human,
    keys = mapk_gene_family_info$ENSEMBL[[1]],
    keytype = "ENSEMBL",
    columns = c("ENSEMBLTRANS")
)
# 'select()' returned 1:1 mapping between keys and columns
#           ENSEMBL ENSEMBLTRANS
# 1 ENSG00000100030         <NA>

We got no Ensembl transcript ID for MAPK1, which is impossible. Therefore, to find the real Ensembl transcript IDs, we need to find other references.

TxDb

There is no pre-built Ensembl TxDb object available on Bioconductor. But with the help of ensembldb, we can easily build the TxDb ourselves.

Following the instructions in ensembldb’s vignette file, we can build the TxDb object from the Ensembl latest release, which is release 84 (Mar, 2016) at the time of writing. Ensembl releases all human transcript records as GTF file, which can be found here ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz.

After processing GTF file via ensDbFromGtf(), the generated data for creating the TxDb object will be stored in a SQLite3 database file Homo_sapiens.GRCh38.84.sqlite at the R working directory. Building TxDB is just one command away, EnsDb(). Putting two commands together, the script for building Ensembl TxDb is listed below. To prevent from rebuilding the TxDb every time the script is executed, we first check if the sqlite file exists,

# xxx_DB in the vignette is just a string to the SQLite db file path
ens84_txdb_pth <- './Homo_sapiens.GRCh38.84.sqlite'
if (!file.exists(ens84_human_txdb_pth)) {
    ens84_txdb_pth <- ensDbFromGtf(gtf="Homo_sapiens.GRCh38.84.gtf.gz")
}
txdb_ens84 <- EnsDb(ens84txdb_pth)
txdb_ens84  # Preview the metadata

The filtering syntax for finding desired genes or transcripts is different to the built-in TxDb object,

transcripts(
    txdb_ens84,
    filter=GeneidFilter(mapk_gene_family_info$ENSEMBL[[1]])
)
# GRanges object with 4 ranges and 5 metadata columns:
#                   seqnames               ranges strand |           tx_id           tx_biotype
#                      <Rle>            <IRanges>  <Rle> |     <character>          <character>
#   ENST00000215832       22 [21754500, 21867629]      - | ENST00000215832       protein_coding
#   ENST00000491588       22 [21763984, 21769428]      - | ENST00000491588 processed_transcript
#   ENST00000398822       22 [21769040, 21867680]      - | ENST00000398822       protein_coding
#   ENST00000544786       22 [21769204, 21867440]      - | ENST00000544786       protein_coding
#                   tx_cds_seq_start tx_cds_seq_end         gene_id
#                          <numeric>      <numeric>     <character>
#   ENST00000215832         21769204       21867440 ENSG00000100030
#   ENST00000491588             <NA>           <NA> ENSG00000100030
#   ENST00000398822         21769204       21867440 ENSG00000100030
#   ENST00000544786         21769204       21867440 ENSG00000100030
#   -------
#   seqinfo: 1 sequence from GRCh38 genome

So filtering is done by passing special filter functions to filter=. Likewise, there are TxidFilter, TxbiotypeFilter, and GRangesFilter for filtering on the respective columns.

tx_gr <- transcripts(
    txdb_ens84,
    filter=TxidFilter("ENST00000215832")
)
tr_gr
# GRanges object with 1 range and 5 metadata columns:
#                   seqnames               ranges strand |           tx_id     tx_biotype tx_cds_seq_start tx_cds_seq_end         gene_id
#                      <Rle>            <IRanges>  <Rle> |     <character>    <character>        <numeric>      <numeric>     <character>
#   ENST00000215832       22 [21754500, 21867629]      - | ENST00000215832 protein_coding         21769204       21867440 ENSG00000100030
#   -------
#   seqinfo: 1 sequence from GRCh38 genome

Check the result with the online Ensembl genome browser. Note that Ensembl release 84 use hg38.

BSgenome and AnnotationHub

We can load the sequence from BSgenome.Hsapiens.UCSC.hg38, however, we can obtain the genome (chromosome) sequence of Ensembl using AnnotationHub. References of non-model organisms can be found on AnnotationHub, many of which are extracted from Ensembl. But they can be downloaded as Bioconductor objects directly so it should be easier to use.

First we create a AnnotationHub instance, it cached the metadata all available annotations locally for us to query.

ah <- AnnotationHub()
query(ah, c("Homo sapiens", "release-84"))
# AnnotationHub with 5 records
# # snapshotDate(): 2016-05-12
# # $dataprovider: Ensembl
# # $species: Homo sapiens
# # $rdataclass: TwoBitFile
# # additional mcols(): taxonomyid, genome, description, tags, sourceurl, sourcetype
# # retrieve records with, e.g., 'object[["AH50558"]]'
#
#             title
#   AH50558 | Homo_sapiens.GRCh38.cdna.all.2bit
#   AH50559 | Homo_sapiens.GRCh38.dna.primary_assembly.2bit
#   AH50560 | Homo_sapiens.GRCh38.dna_rm.primary_assembly.2bit
#   AH50561 | Homo_sapiens.GRCh38.dna_sm.primary_assembly.2bit
#   AH50562 | Homo_sapiens.GRCh38.ncrna.2bit

From the search results, human hg38 genome sequences are available as TwoBit format. But having multiple results is confusing at first. After checking the Ensembl’s gnome DNA assembly readme, what we should use here is the full DNA assembly without any masking (or you can decide it based on your application).

# There are a plenty of query hits. Description of different file suffix:
# GRCh38.dna.*.2bit     genome sequence
# GRCh38.dna_rm.*.2bit  hard-masked genome sequence (masked regions are replaced with N's)
# GRCh38.dna_sm.*.2bit  soft-masked genome sequence (.............. are lower cased)
ens84_human_dna <- ah[["AH50559"]]

Then we can use it to obtain the DNA sequence of desired genomic range (in GRagnes).

getSeq(ens84_human_dna, tx_gr)

#   A DNAStringSet instance of length 1
#      width seq                              names
# [1] 113130 TTTATAGAGAAAA...CTCGGACCGATTGCCT ENST00000215832

biomaRt

BioMart are a collections of database that can be accessed by the same API, including Ensembl, Uniprot and HapMap. biomaRt provides an R interface to these database resources. We will use BioMart for ID conversion between Ensembl and RefSeq. Its vignette contains solutions to common scenarios so should be a good starting point to get familiar with it.

You could first explore which Marts are currently available by listMarts(),

library(biomaRt)
listMarts()
#                biomart               version
# 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 84
# 2     ENSEMBL_MART_SNP  Ensembl Variation 84
# 3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 84
# 4    ENSEMBL_MART_VEGA               Vega 64

Here we will use Ensembl’s biomart. Each mart contains multiple datasets, usually separated by different organisms. In our case, human’s dataset is hsapiens_gene_ensembl. For other organisms, you can find their dataset by listDatasets(ensembl).

ensembl <- useMart("ensembl")
ensembl <- useDataset("hsapiens_gene_ensembl", mart=ensembl)
# or equivalently
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")

Compatibility with AnnotationDb’s interface

The way to query the ensembl Mart object is slightly different to how we query a AnnotationDb object. The major difference is the terminology. Luckily, Mart object provides a compatibility layer so we can still call functions such as select(db, ...), keytypes(db), keys(db) and columns(db), which we frequently do¹ when using NCBI/UCSC references.

A Mart can have hundreds of keys and columns. So we select a part of them out by grep(),

grep("^refseq", keytypes(ensembl), value = TRUE)
# [1] "refseq_mrna" "refseq_mrna_predicted" ...
grep("^ensembl", keytypes(ensembl), value = TRUE)
# [1] "ensembl_exon_id" "ensembl_gene_id" ...
grep("hsapiens_paralog_", columns(ensembl), value=TRUE)
# [1] "hsapiens_paralog_associated_gene_name"
# [2] "hsapiens_paralog_canonical_transcript_protein"
# [3] "hsapiens_paralog_chrom_end"
# ...

We start by finding the MAPK1’s RefSeq transcript IDs and their corresponding Ensembl transcript IDs, which is something we cannot do by our locally built Ensembl TxDb nor the human OrgDb.

select(
    ensembl,
    keys = c("MAPK1"),
    keytype = "hgnc_symbol",
    columns =  c(
        "refseq_mrna", "ensembl_transcript_id",
        "hgnc_symbol", "entrezgene",
        "chromosome_name",
        "transcript_start", "transcript_end", "strand"
    )
)
#   refseq_mrna ensembl_transcript_id hgnc_symbol entrezgene
# 1   NM_002745       ENST00000215832       MAPK1       5594
# 2                   ENST00000491588       MAPK1       5594
# 3   NM_138957       ENST00000398822       MAPK1       5594
# 4                   ENST00000544786       MAPK1       5594
#   chromosome_name transcript_start transcript_end strand
# 1              22         21754500       21867629     -1
# 2              22         21763984       21769428     -1
# 3              22         21769040       21867680     -1
# 4              22         21769204       21867440     -1

So some of the MAPK1 Ensembl transcripts does not have RefSeq identifiers. This is common to see since RefSeq is more conservative about including new transcripts. Anyway, we can now translate our analysis result between a wider range of naming systems.

Moreover, what’s awesome about BioMart is that almost all the information on the Ensembl genome browser can be retreived by BioMart. For example, getting the paralog and the mouse homolog of MAPK1,

# Get paralog of MAPK1
select(
    ensembl,
    keys = c("ENSG00000100030"),
    keytype = "ensembl_gene_id",
    columns = c(
        "hsapiens_paralog_associated_gene_name",
        "hsapiens_paralog_orthology_type",
        "hsapiens_paralog_ensembl_peptide"
    )
)
#   hsapiens_paralog_associated_gene_name hsapiens_paralog_orthology_type hsapiens_paralog_ensembl_peptide
# 1                                 MAPK3          within_species_paralog                  ENSP00000263025
# 2                                 MAPK6          within_species_paralog                  ENSP00000261845
# 3                                 MAPK4          within_species_paralog                  ENSP00000383234
# 4                                   NLK          within_species_paralog                  ENSP00000384625
# 5                                 MAPK7          within_species_paralog                  ENSP00000311005

# Get homolog of MAPK1 in mouse
select(
    ensembl,
    keys = c("ENSG00000100030"),
    keytype = "ensembl_gene_id",
    columns = c(
        "mmusculus_homolog_associated_gene_name",
        "mmusculus_homolog_orthology_type",
        "mmusculus_homolog_ensembl_peptide"
    )
)
#   mmusculus_homolog_associated_gene_name mmusculus_homolog_orthology_type mmusculus_homolog_ensembl_peptide
# 1                                  Mapk1                 ortholog_one2one                ENSMUSP00000065983

biomaRt’s original interface

The select() function we use is not the original biomaRt’s interface. In fact, keys and columns are interpreted as BioMart’s filters and attributes respectively. To find all available filters and attributes,

filters = listFilters(ensembl)
attributes = listAttributes(ensembl)

each of the command return a data.frame that contains each filter’s or attribute’s name and description.

Behind the scene, arguments of select(db, ...) is converted to getBM(mart, ...). For the same example of finding RefSeq and Ensembl transcript IDs, it can be re-written as

getBM(
    attributes = c(
        "refseq_mrna", "ensembl_transcript_id",
        "chromosome_name",
        "transcript_start", "transcript_end", "strand",
        "hgnc_symbol", "entrezgene", "ensembl_gene_id"
    ),
    filters = "hgnc_symbol",
    values = c("MAPK1"),
    mart = ensembl
)

Conversion between genomic coordinate systems

Somethings we need to convert between different verions of the reference. For example, today we’d like to convert a batch of genomic locations of reference hg38 to that of hg19, so we can compare our new research with previous studies. It is a non-trivial task that can be currently handled by the following tools:

CrossMap (used by Ensembl)
liftOver (used by UCSC)

Frankly I don’t have experience for such conversion in real study (the converted result still gives the sense of unease), but anyway here I follow the guide on PH525x series. In Bioconductor, we can use UCSC’s Chain file to apply the liftOver() method provided by package rtracklayer. To convert regions from hg38 to hg19, we need the hg38ToHg19.over.chain file, which can be found at ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/.

We still use MAPK1 as an example of conversion. First extract MAPK1’s genomic ranges in hg38 and hg19 respectively,

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(TxDb.Hsapiens.UCSC.hg19.knownGene
tx38 <- TxDb.Hsapiens.UCSC.hg38.knownGene
tx19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
MAPK1_hg38 <- genes(tx38, filter=list(gene_id="5594"))
MAPK1_hg19 <- genes(tx19, filter=list(gene_id="5594"))

Then we convert MAPK1_hg38 to use the hg19 coordinate system.

library(rtracklayer)
ch <- import.chain("./hg38ToHg19.over.chain")
MAPK1_hg19_lifted <- liftOver(MAPK1_hg38, ch)

MAPK1_hg19
# GRanges object with 1 range and 1 metadata column:
#        seqnames               ranges strand |     gene_id
#           <Rle>            <IRanges>  <Rle> | <character>
#   5594    chr22 [22113947, 22221970]      - |        5594
#   -------
#   seqinfo: 93 sequences (1 circular) from hg19 genome

MAPK1_hg19_lifted
# GRangesList object of length 1:
# $5594
# GRanges object with 2 ranges and 1 metadata column:
#       seqnames               ranges strand |     gene_id
#          <Rle>            <IRanges>  <Rle> | <character>
#   [1]    chr22 [22113947, 22216652]      - |        5594
#   [2]    chr22 [22216654, 22221970]      - |        5594
#
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths

So the conversion worked as expected, though it created a gap in the range (missing a base at 22216653). I haven’t looked into the results. To ensure the correctness of the conversion, maybe a comparison with CrossMap is needed.

Summary

We skimmed through OrgDb and TxDb again using the Ensembl references, including how to build the TxDb for Ensembl locally and obtain external annotations from AnnotationHub.

BioMart is an abundant resource to query across various types of databases and references, which can be used in conversion between different naming systems.

Finally, we know how to convert between different version of the reference. Though the correctness of the conversion requires further examination (not meaning it is wrong), at least the conversion by liftOver works as expected.

Starting here, you should have no trouble dealing with annotations in R anymore. For the next post, I plan to further explore the way to read sequencing analysis results in R.

Note that the select(mart, ...) compatibility does not apply to all existed filters (keys) and attributes (columns) of the given Mart. ↩

使用 uWSGI、nginx、systemd 部署 Django

2016-05-19T00:00:00-05:00

上一次很認真的 Django 部署記錄在《設定 Python 官方文件中文化自動更新 Server》一文。很巧地自己畢業的題目也要架個 Django 網站，所以就再跑了一次部署設定。舊文還提了有的沒的，這篇僅針對 Django 的部署。

這邊的部署設定都儘量不使用 root 權限，整個連線的流程圖如下：

nginx -- unix socket -- uWSGI -- Django

寫一個名為 PROJ.service 的 systemd unit 來管理這網站的啟動與否。之後 PROJ 就換成自己的專案名稱；USER 就換成執行網站的帳號。

作業系統
PostgreSQL
Django PROJ
tmpfiles.d
nginx
uWSGI
systemd
確認、總結

作業系統

使用 Ubuntu 16.04 LTS。我對 Ubuntu 其實沒愛，但因為很多人用，畢業之後應該還找得到人維護。他跟 Debian 差不多，所以跟舊文沒什麼差別。Ubuntu 16 內建就有 Python 3.5，不用再裝；PostgreSQL 也來到 9.5 版。

使用 unattended-upgrades 定期更新與 security 相關的套件，它預設一天檢查一次，更新的記錄會在 /var/log/unattended-upgrades 目錄中。

PostgreSQL

參考《安裝 PostgreSQL 9 於 Debian Jessie / OSX》一文設定。建立跟 OS user 同名的 PostgreSQL 帳號，給了建立 database 的權限，這樣開發比較方便。不用設定密碼。

Django PROJ

使用內建 venv 在自己家目錄下某處，建立名為 VENV 的虛擬環境：

python3.5 -m venv VENV

有關部署的設定（即 settings.py），利用 django-environ 把 secret key、database 連線資訊、寄信 SMTP server 等設定寫在獨立的檔案，就可以讓 local 和 production 環境讀到各自的設定。具體的做法可以參考 PyCon Taiwan 2016 網站管理設定的寫法。

在連 PostgreSQL 時使用 local connection (Unix-domain socket)，即使用者同名的身份。

DATABASE_URL=postgres:///TABLE_NAME

tmpfiles.d

把 nginx 與 uwsgi 溝通用的 socket 放在 /run/PROJ 底下，但這也表示重開機之後，/run/PROJ 資料夾就會消失不見，所以使用 tmpfiles.d¹。除了資料夾的命名改成用專案名稱，設定都跟舊文一樣。

nginx

nginx 設定跟舊文一樣。放在 /etc/nginx/sites-available/PROJ.conf

# Upstream Django setting; the socket nginx connects to
upstream django {
    server unix:///run/PROJ/django.sock;
}

server {
    listen      80;
    listen      443 default ssl;

    server_name 123.123.123.123
                ;
    charset     utf-8;

    client_max_body_size 10M;  # max upload size
    keepalive_timeout 15;

    location /static {
        alias /path/to/PROJ/assets;
    }

    # Finally, send all non-media requests to the Django server.
    location / {
        uwsgi_pass  django;
        include     /etc/nginx/uwsgi_params;
    }
}

/path/to/PROJ/assets 是 Django STATIC_ROOT 的路徑。只要執行 python manage.py collectstatic 後，即使 uWSGI 還沒設定就可以測試 /static/…/ 有沒有被 nginx 抓到。

啟動時，先把檔案連結到 /etc/nginx/site-enabled/，重載 nginx 設定：

cd /etc/nginx/sites-enabled/
sudo ln -s ../sites-available/PROJ.conf .
sudo systemctl reload nginx

uWSGI

跟舊文最大的差別，只要裝在 VENV 裡面就好了；然後也不使用 emperor mode。寫一個 /etc/uwsgi/vassals/PROJ.ini 放設定：

[uwsgi]
chdir        = /path/to/PROJ
# Django's wsgi file
module       = PROJ.wsgi:application
env          = DJANGO_SETTINGS_MODULE=PROJ.settings.production
# the virtualenv (full path)
home         = /path/to/VENV

# process-related settings
# master
master       = true
# maximum number of worker processes
processes    = 4
# the socket (use the full path to be safe
socket       = /run/PROJ/django.sock
# ... with appropriate permissions - may be needed
chmod-socket = 664
uid          = USER
gid          = www-data
# clear environment on exit
vacuum       = true

設定好後執行以下指令，就應該能看到網站能動了。

sudo /path/to/VENV/bin/uwsgi --ini /etc/uwsgi/vassals/PROJ.ini

systemd

這邊除了執行 uWSGI 的指令不同外，都跟舊文相同。Debian 系 systemd system unit 設定檔放在 /etc/systemd/system/PROJ.service：

[Unit]
Description=PROJ's Django server by uWSGI
After=syslog.target

[Service]
ExecStart=/path/to/VENV/bin/uwsgi --ini /etc/uwsgi/vassals/PROJ.ini
Restart=always
KillSignal=SIGQUIT
Type=notify
StandardError=syslog
NotifyAccess=all

[Install]
WantedBy=multi-user.target

這邊設定它會（有錯誤時）自動重新起動，並把 stderr 導到 syslog。接著，就要啟動這個 PROJ.service 服務：

sudo systemctl enable PROJ
sudo systemctl status PROJ

可以透過 sudo journalctl -xe -u PROJ 來查看 uWSGI 執行、連線 log。

確認、總結

重啟系統一次，如果網站還活著，就表示一切設定都沒問題。整體上不太複雜，但權限不符的錯誤可能會讓你鬼打牆，要有耐心。

也可以用 systemd.exec(5) 提到的 RuntimeDirectory=PROJ 來建立執行用目錄。但因為 PROJ.service 的 USER 必須是 root，這種情況 man page 就建議改用 tmpfiles.d。我覺得應該能解決使用 root 權限的問題，但太懶了就先這樣…… ↩

Jupyter Notebook Progress Bar

2016-03-23T02:00:00-05:00

相信很多人都已經在使用 Jupyter (IPython) Notebook 跑分析。隨著分析的資料越跑越多，有時候刷下去就是幾十分鐘甚至數小時。此時沒有個進度條還蠻無聊的，而且能讓自己感覺很有進度，何樂不為呢？

例如我去年介紹 aiohttp 時就有用到 notebook 和 console 底下的進度條 (progress bar)。不過，這幾個月 Jupyter Notebook 4+ 架構上的調整，可能 code 都不能用了。剛好昨天的 Taipei.py 有人提到這事，就來整理一下吧。

IPywidgets 介紹
- 安裝
使用進度條
Progress bar example in action
Misc. 螢幕截圖
- FFmpeg 轉檔

IPywidgets 介紹

Notebook 進度條使用 ipywidgets 中的元件實作。這件元件規範了 notebook client <-> server 間雙向的溝通，並且能把相關的 CSS / JS 包裝在一起。在 ipywidgets 範例的 Widgets Basics 中就有提到可能的用途：

You can use widgets to build interactive GUIs for your notebooks.
You can also use widgets to synchronize stateful and stateless information between Python and JavaScript.

所以除了像進度條這樣單向的從 python code 傳訊息到 notebook (HTML) front-end 之外，也可以做一些介面把 front-end 的值傳回 python code。

……只是要用個進度條而已，哪來這麼多背景知識。更多介紹可以參考 ipywidgets 官方範例。

安裝

使用 Python 3.5 示範。安裝除了 notebook 本身外，還要額外裝上 ipywidgets 這套件。

pip install notebook ipywidgets

再用 jupyter notebook 即可啟用 notebook。

使用進度條

from ipywidgets import IntProgress
from IPython.display import display

IntProgress 就是進度條，display 則是 IPython 顯示各種 Python 物件的函數，在這邊用它才能把 widget 以 HTML 顯示並與 python code 聯動。

建立一個進度條的方式很簡單。建立 IntProgress widget object，然後顯示它：

p = IntProgress()
display(p)

預設是進度條有 100 個單位，初始值為 0。進度條的值與最大值的狀態分別存在 .value、.max 屬性裡：

>>> p.value, p.max
(0, 100)

只要修改 p.value 前面的進度條的狀態就會自動更新（不用重跑 display(p)）：

p.value = 50

p.value = 100

當然，最大值調整也會即時更新。此外，還可以透過 .description 給進度條一個 label。重新做一個完整的例子：

p2 = IntProgress(max=56)
p2.value += 10
p2.description = 'Running'
display(p2)

完整的 code 就這樣，用起來非常方便。

Progress bar example in action

模擬一下真實情況，我們通常有一堆待做的 task，在這邊叫 todo_tasks 好了。

import time
todo_tasks = ['task %02d' % i for i in range(50)]

只是個字串，但用 time.sleep(sec) 來模擬有在做事。

搭配進度條的時候，把實際動態做成底下的動畫。

# Initialize a progess bar
progress = IntProgress()
progress.max = len(todo_tasks)
progress.description = '(Init)'
display(progress)
time.sleep(0.25)

# Simulating task execution
for task in todo_tasks:
    progress.value += 1
    time.sleep(0.05)
    progress.description = task
progress.description = '(Done)'

Your browser doesn't support HTML5 video in WebM with VP8 or MP4 with H.264. You can still download the screencast and view it locally.

Progressbar in action

非常方便吧！

Misc. 螢幕截圖

寫這篇文章花最多的時間是在截圖跟做動畫 XD

瀏覽器的截圖，現在 Firefox (45) 已經可以只選擇截某個 DOM，十分方便。

在最後動態的錄製花了不少時間。一開始想說是不是要用 GIF，但都 2016 年了還用什麼 GIF 啊！，雖然螢幕可以試著改 GIF palette 讓畫面不會很醜體積又小，但覺得用個 H.264 / VP9 簡單多了。

使用 QuickTime Screen Capture，開始錄的時候能只選擇螢幕一部份區域。以我 13” retina 螢幕為例，會得到 1636x736 H.264 .mov 檔。但我覺得解析度不用這麼高，所以最後輸出成 480p (1148x480) 就好，順便裁了一點白邊。

透過 HTML5 <video> 能把 MP4 / WebM 當成動畫來使用：

<video loop autoplay>
    <source src="vid.webm" type="video/webm">
    <source src="vid.mp4" type="video/mp4">
    Your browser doesn't support HTML5 video in WebM with VP8
    or MP4 with H.264. You can still download the
    <a href="vid.mp4">screencast</a> and view it locally.
</video>

各家 web browser 的支援度可參考 caniuse.com：WebM、MP4

FFmpeg 轉檔

筆記而已，沒有認真調參數讓輸出檔案最小。VP9 的部份參考 FFmepg¹。

# H.264 MP4
ffmpeg -i Untitled.mov \
    -vcodec h264 \
    -strict -2 -crf 22 -preset slow -r 24 \
    -vf "crop=iw:ih-52:0:10, scale=-1:480" \
    out.mp4

# VP9 WebM
ffmpeg -i Untitled.mov \
    -vcodec libvpx-vp9 \
    -b:v 150K -r 24 \
    -vf "crop=iw:ih-52:0:10, scale=-1:480" \
    out.webm

4s 的檔案最後大約 60KB，相當不錯。我很多 PNG 截圖都大多了。

$ du -sh ./* | gsort -rh
744K    ./Untitled.mov
 60K    ./out.mp4
 52K    ./out.webm

需要額外安裝 libvpx，例如：brew install ffmepg --with-libvpx ↩

Python 官方文件中文化 Server HTTPS 使用 Let's Encrypt

2016-02-21T14:00:00-06:00

現在可以透過 https://docs.python.org.tw 訪問 Python 官方文件中文化網站。

Server 本身的設定可以參考之前的文章。加入 HTTPS 就要設定相關的憑証，我選擇 Let’s Encrypt 做為憑証的簽署者。

Let’s Encrypt (LE) 使用 AMCE (Automated Certificate Management Environment) protocol 去驗証你是否擁有你欲簽証的 domain。官網上有圖文說明，簡單來說，LE 會要求你的 server 在特定的 path 加入特定的檔案，如果你做得到，就代表你擁有這個 domain。這樣的簽証第一次要在 LE server 上註冊，之後最長每 90 天認証一次。

參考資料
Let’s Encrypt Certificate
設定 systemd timer 定時更新 certificate
- Systemd service
- Systemd Timer
nginx HTTP redirect to HTTPS
測試 HTTPS 設定
心得
Misc.

參考資料

我不是網路安全相關的專家，設定都是參考網路上的說明整理而成。LE certificate 的設定參考 How to Secure Your Web App Using HTTPS With Letsencrypt by Rob McLarty 這篇文章。

Let’s Encrypt Certificate

沒有使用 LE 官方的 client，而是用 Daniel Roesler 所寫的 acme-tiny。這是一個不到 200 行的 Python script，可以自行檢查它有沒有做任何奇怪的事。acme-tiny 的 README 也有個設定教學，應該是大同小異。

基本上都是照著 How to Secure Your Web App Using HTTPS With Letsencrypt 該篇文章做，不過有調整了以下的東西：

建立 letsencrypt 帳號時，禁止使用 password login。
即： adduser --disabled-password ...
使用 git 管理 acme-tiny script。
改用 systemd 控制 nginx，而不是透過 service。
即： systemctl reload nginx
沒有用 crontab 而是使用 systemd Timers。
重新導引 http 連線至 https。

第 4、5 點設定比較多，整理到後面。

設定 systemd timer 定時更新 certificate

這邊參考 RHEL7: How to use Systemd timers. 一文。

Systemd Timer 概念如同 crontab，但差別是使用 timer 即與 systemd 整合。<unit>.timer 可以定期執行 <unit>.service，所以執行的結果與狀態都能顯示在 systemd 中，也帶入了 journald 有的 logging 功能。

更新 certificate 的指令寫在 /etc/letsencrypt/renew_cert.sh。Shell script 的內容與參考文章一樣。

Systemd service

首先建立一個 renew_cert，以 Debian 為例放在 /etc/systemd/system/renew_cert.service，

[Unit]
Description=Renew Let's Encrypt cert
After=syslog.target

[Service]
User=letsencrypt
Group=letsencrypt
ExecStart=/etc/letsencrypt/renew_cert.sh
Type=simple
StandardError=syslog

[Install]
WantedBy=multi-user.target

要手動更新 certificate 的時候執行這個 service 即可，

systemctl start renew_cert

我們不需要 enable 它，不然每次開機都會執行一次。看結果或記錄，

systemctl status renew_cert
journalctl -e -u renew_cert

Systemd Timer

建立一個 /etc/systemd/system/renew_cert.timer

[Unit]
Description=Update Let's Encrypt certificate every two months

[Timer]
OnCalendar=*-1/2-1 16:00:00
Unit=renew_cert.service

[Install]
WantedBy=multi-user.target

重點只有 [Timer] 這 directive。Unit= 表示要啟動的 service。OnCalendar=¹ 則是設定這 timer 根據指定的時間點 (UTC 時間²) 啟動。

以這邊寫的時間 *-1/2-1 16:00:00 為例，代表每年的 1+2n 月 1 日 16:00 UTC 更新 certificate，即臺灣時間 1、3、5、……月 2 日凌晨更新。

啟用 timer，它需要被 enable 確保重開機時被執行。

systemctl enable renew_cert.timer
systemctl start renew_cert.timer

可以用 systemctl list-timers 檢查它下次執行的時間：

$ sudo systemctl list-timers renew_cert.timer
NEXT                         LEFT                LAST PASSED UNIT             ACTIVATES
Tue 2016-03-01 16:00:00 UTC  1 weeks 2 days left n/a  n/a    renew_cert.timer renew_cert.service

nginx HTTP redirect to HTTPS

（這邊設定我沒信心，有更好的設定方法歡迎告訴我 > <）

要解決的問題為，ACME challenge 是透過 HTTP，其餘的連線都轉到 HTTPS。

在 nginx 中把主要的設定檔中拿掉 listen 80; 與 ACM challenge 的部份。把它們移成新的 server block：

server {
    listen 80;
    server_name docs.python.org.tw;

    # For Let's Encrypt ACME challenge files
    location /.well-known/acme-challenge/ {
        alias /var/www/challenges/;
        try_files $uri =404;
    }

    location / {
        return 301 https://$host$request_uri;
    }
}

測試 HTTPS 設定

用了 securityheaders.io 和 SSL Labs 測試了一下，應該還可以：

Report from securityheaders.io (Live Report)

Report from SSL Labs (Live Report)

心得

總結來說，使用 Let’s Encrypt 不難，但也沒到非常簡單。

如果你願意把 root 和 private key 權限給它的話，用官方 client 提供的 letsencrypt-auto 步驟能更少，用 Apache 它聲稱能全自動設定。覺得 acme-tiny 指令太複雜的話，原作者也寫了一個 Get HTTPS for free! 服務，用網頁的方式協助整個註冊流程。

要注意目前 public beta 階段，相同 domain 在 7 天只能被簽署 5 次，測試的時候不要太衝動不然就要等一週了。

Misc.

建立 CSR (Certificate Signing Request) 檔時，可以加入自己的 email 地址，不然預設是 webmaster@<domain>：

openssl req -new -sha256 \
    -key /etc/letsencrypt/private/domain.key \
    -subj "/CN=docs.python.org.tw/emailAddress=me+pydoctw@liang2.tw" \
    > /etc/letsencrypt/private/domain.csr

除了 OnCalendar 還有很多設定 timer 的方式，如 OnUnitActiveSec。不過其他的時間算法，都會受有沒有開機，影響時間的計算。 ↩
Debian Jessie 的 systemd.time (7) Calendar Events 裡並沒有指定時區的方式，所以加上時區會有 parse error。但新版的 systemd 似乎支援時區。總之應該用 systemctl list-timers 確定執行的時間。 ↩

Add code block language name into CSS classes in Pelican Markdown

2016-02-19T15:00:00-06:00

I used Pelican and its Markdown plugin to render blog post.

Recently I was playing with the Python Official Documentation, which has a decent code syntax highlighter powered by Pygments.

What’s more, the output of code examples can be toggled. That is, a code example:

>>> print('Hello World')
Hello World
>>> 6 * 7
42

can be toggled to:

print('Hello World')

6 * 7

which is very convenient for code copy-pasting.

However, the functionality is currently failed on the official Python doc (given by copybutton.js) because the jQuery updates break previous API behavior. I’ve filed issue 26246 on the Python issue tracker for this problem. (EDIT 2016-02-27: the patch has been merged.)

Code output toggle in Pelican

After I fixed the copybutton.js, I wished to add this functionality to my blog.

Code highlighting in Pelican markdown files is handled by its CodeHilite extension. To my surprise, I found CodeHilite does not express the language name specified for each code block.

What I expected was

<div class="highlight-python3">
    <div class="highlight">
        <pre>
            <!-- ... -->
        </pre>
    </div>
</div>

but the actual output was

<div class="highlight">
    <pre>
        <!-- ... -->
    </pre>
</div>

So no way to find the language name the code block used, nor the lexer aliases Pygments guessed when no language name was specified.

A quick dig into the source code showed that it is relatively easy to fix. Here is the diff:

diff --git a/extensions/codehilite.py b/extensions/codehilite_updated.py
index 0657c37..4fad7c5 100644
--- a/extensions/codehilite.py
+++ b/extensions/codehilite_updated.py
@@ -75,7 +75,8 @@ class CodeHilite(object):

     def __init__(self, src=None, linenums=None, guess_lang=True,
                  css_class="codehilite", lang=None, style='default',
-                 noclasses=False, tab_length=4, hl_lines=None, use_pygments=True):
+                 noclasses=False, tab_length=4, hl_lines=None, use_pygments=True,
+                 wrap_by_lang=True):
         self.src = src
         self.lang = lang
         self.linenums = linenums
@@ -86,6 +87,7 @@ class CodeHilite(object):
         self.tab_length = tab_length
         self.hl_lines = hl_lines or []
         self.use_pygments = use_pygments
+        self.wrap_by_lang = wrap_by_lang

     def hilite(self):
         """
@@ -114,13 +116,22 @@ class CodeHilite(object):
                         lexer = get_lexer_by_name('text')
                 except ValueError:
                     lexer = get_lexer_by_name('text')
+            lang = lexer.aliases[0]
             formatter = get_formatter_by_name('html',
                                               linenos=self.linenums,
                                               cssclass=self.css_class,
                                               style=self.style,
                                               noclasses=self.noclasses,
                                               hl_lines=self.hl_lines)
-            return highlight(self.src, lexer, formatter)
+            hilited_html = highlight(self.src, lexer, formatter)
+            if self.wrap_by_lang and self.lang:
+                return '<div class="%(class)s-%(lang)s">%(html)s</div>\n' % {
+                    'class': self.css_class,
+                    'lang': lang.replace('+', '-'),
+                    'html': hilited_html,
+                }
+            else:
+                return hilited_html
         else:
             # just escape and build markup usable by JS highlighting libs
             txt = self.src.replace('&', '&amp;')

I’m happy with the patched codehilite output. I am now able to give code toggle function to specific code languages.

However it’s quite busy these days, so it may take a while to submit a proper pull request (e.g. fix any broken unit tests, write new tests, and tune the API as well as the new behavior). Moerover, currently my site does not use jQuery so I am missing a huge dependency. Rewriting it using vanilla JS seems to require considerable work, and the very thing I don’t have at hand is time :(

I’ve decided to leave this improvement in future development. But if your site use Pelican Markdown and imports jQuery, the diff will add the code language back.

設定 Python 官方文件中文化自動更新 Server

2016-02-14T21:00:00-06:00

TL;DR 可至 http://docs.python.org.tw 看線上自動更新的中文化的文件和 build server。

EDIT 2016-02-16: 加上 language code、git sshconfig、swap 的設定；文句潤飾。
EDIT 2016-02-20: 加上 tmpfiles.d 的設定。

Python 說明文件中文翻譯計畫

最近一段時間都在準備 Python 說明文件中文翻譯計畫。翻譯本身雖然還沒很積極地進行，但經過前幾次 Taipei.py Projects On 的 sprint 活動，已經有蠻多人加入翻譯的行列。大家都有各自翻譯的主題，像我自己是從 Tutorial 的部份開始翻譯。

Sphinx 文件多國語言架構

先簡介一下 CPython Documentation（下稱 pydoc）的架構和翻譯方式。pydoc 是標準的 Sphinx 文件，因此翻譯使用 Sphinx 自帶的 internationalization (i18n or intl) 功能把文件的內容轉換到別的語言上。

如同 Django 等專案，i18n 都是透過 gettext，Sphinx 會按照 rst 檔案輸出同檔名的 po 檔。rst 檔案中的每個文字段落會對應到 po 檔一個 entry，不相干的程式碼範例等段落會被跳過。輸出的 po 檔放在對應的路徑例如 locale/<lang>/LC_MESSAGES/xxx.po。

po 檔的格式很簡單，跳過有的沒的 header，實質內容長這樣：

#: ../../tutorial/appetite.rst:50
msgid ""
"Python enables programs to be written compactly and readably.  Programs "
"written in Python are typically much shorter than equivalent C,  C++, or "
"Java programs, for several reasons:"
msgstr ""
"Python 讓程式寫得精簡並易讀。用 Python 實作的程式長度往往遠比用 "
"C、C++、Java 實作的短。這有以下幾個原因："

實際上 Sphinx 會先輸出一份乾淨的 po 檔範本（稱為 pot 檔）到 locale/pot/，基本上就是只有原文的 po 檔。每增加一個新語言就會從 pot 檔製作一份 po 檔到各自的 locale/<lang>/ 目錄下，翻譯時就修改那份 po 檔就可以。

翻譯完成後，首先 Sphinx 會先呼叫 gettext 把 po 檔編譯成 mo 檔加速搜尋翻譯字串速度。輸出翻譯後的文件只要設定不同語言，Sphinx 就會去找該語言的 mo 檔，並把原文字串換成 mo 檔裡的內容，就可以看到中文的文件。

Transifex 線上服務讓多人共同翻譯 po 檔

整個 Sphinx 文件翻譯流程就這樣，所以翻譯只要編輯中文 (lang code: zh-Hant¹) 的 po 檔就好了。不過要直接寫 po 檔格式門檻還是太高，於是就有像 Transifex 這樣的網站。上傳 po/pot 檔就能線上修改翻譯，然後再把翻完的結果用 po 檔格式下載下來。我認為這是現在參加以 gettext-based PO 檔翻譯門檻最低的方式，至少日本也是這麼做。於是想要參考 pydoc 翻譯的人，只要登入 Transifex 就可以開始編輯。

用 Transifex 還有額外的好處。例如他有 POS tagging 可以標注專有名詞，定義統一的譯名，這些譯名會整理在 glossary terms 裡，翻譯時出現這些詞就會自動提示。類似的原文文句也會放在 suggestion 裡，讓翻譯完的用語文法也能一致。此外也有修改歷史、防呆提示（如該有的格式沒在譯文出現）、加註解 (comment)、評論 (issue) 等功能。

翻譯體驗改善

這段時間翻譯的用詞、流程等規範都有個雛型了，相關的內容都可以在專案的 wiki 裡找到。所以開始想要怎麼讓大家更好參與翻譯和看到翻譯的結果。

我發現參加翻譯本身已經不困難，大家沒什麼疑問。維護整體的用詞、翻譯討論用 Transifex issue 和 comment 效果不錯。整體上能保持極度分散式的工作形式。

平常遇到最多問題是出現 rst 格式錯誤、缺少必要的空白、前後文加上程式碼範例之後不通順、譯文曲解或誤會原文的意思。這些問題，我覺得只要自己讀過翻完的 pydoc 該頁、看一下輸出的 log 就能明白，也不需要我多作解釋。

再來，看不到自己翻譯的成果很沒有成就感，過一段時間我怕會失去動力。

於是變成需要一份保持更新的翻譯成果。當然自己輸出 doc 的方法都有寫在 wiki 裡，但步驟很多，說簡單也沒多簡單，而且有錯或有問題可能都要來找我，就失去分散式分工的特性了。

不如做個 autobuild server。

於是有了這想法。但實在是個大坑，一直只能用想的。在過年的時候總算找到時間把 prototype 做出來了，其實蠻有成就感的。

PyDoc Autobuild Server

簡單整理幾個需求：

PyDoc 結果網址對應本家 https://docs.python.org/。例如 /3/ 就是 Python 3.x 版最新的，而現在 /3.5/ 就會自動轉址到 /3/²。
每一頁都有個更新翻譯連結，點一下就會從 Transifex 上抓新的翻譯，並更新輸出。
更新每頁翻譯的指令輸出都要保留，方便檢查 rst 語法等錯誤。
更新翻譯要有個 queue，才可以多人合作時不炸掉 autobuild server。
每日更新全部的文件，並且把更新加到 CPython-tw 的 git repo 中。更新的過程一樣要有記錄。
上述的所有功能都能在本機輕鬆地設定。

實作

目標就是完成上述的需求。pydoc 基本上就是個 static site，交給 nginx 設好路徑 host static files 就可以。Pydoc Sphinx 用 Jinja2 作 HTML template，所以只要多加一些變數就能控制頁面的輸出，在 autobuild server 上時就可以加上額外的連結。而 Autobuild server 本身是個 task queue，其實功能很簡單，但為了維護方便，並考慮到 local、production 環境都要能動的話，選擇 Django 為基礎。真的給 Django 管理的就顯示 task queue、task result、接受 rebuild doc request 這幾個 view。

Sphinx 文件

在 Sphinx 文件部份不想搞太複雜，就在每一頁加上一個自己的專屬連結，打這個網址就會加入一個更新該頁面的 task 到 autobuild server³。

在 autobuild 時加入專屬連結只要修改 Sphinx doc template 即可。Sphinx 在 build doc 時可以透過 -A <name=value> 增加 Jinja2 template 的變數，就可控制 template render 行為：

{# <cpython-src>/Doc/tools/templates/layout.html #}
{%- if autobuildi18n %}
<a href="/_build/update/?source_path={{ pagename }}">Update Translation</a>
{%- endif %}

sphinx-build -A autobuildi18n=1 時就會包含這個 Jinja2 block，多這個 Update Translation 連結。
{{ pagename }} 是每頁文件的 rst 路徑。

Autobuild Django server

Django server 目標就是接受 task request 和顯示 task result。一個標準的 task queue 就有這些功能。

Django 上的 task queue 選擇很多，從 Django Packages 上的 Workers, Queues, and Tasks 相關的套件可以看到有幾個有在更新而且 up 數多的：

扣掉不支援 Python 3 的套件⁴後，就剩 django-celery、django-RQ、django-Q 可以選。這裡面最紅也最老牌的是 django-celery ，它與 Celery 整合，功能完整且穩定，我用過也覺得十分不錯，缺點是功能太多有點複雜，加上不同 message queue 時會有很多設定要調整，需要一段時間上手。一般 Celery 常見的搭配使用 Rabbit-MQ 和 Redis，的確在 task 很多時有必要，但我們這個 build doc 一天可能才十幾次，在不隔離 build doc 環境的情況同時間的 worker 只能有一個，不會有效能上的問題。因此我傾向只要使用與 Django 同一個 database 就好，不要再有額外非 Python 的 dependency，讓 local 開發簡單一點。

最後選擇 django-Q。雖然很新但作者維護得很勤，worker 可以只用 Python 內建的 multiprocessing 完成。功能簡單卻完整，包含 monitor，跟 django-admin 整合，還可以排程。所以要啟動 django-Q 的 cluster，只要多一個

python manage.py qcluster

即可，十分方便。

怎麼使用 django-Q 就不在這篇 blog 討論範圍內了。我想我應該會投稿 PyCon TW 或 Taipei.py，到時候再整理成另一篇。Django-Q 的說明文件寫得很清楚，讀一讀應該就會了。

Autobuild server 部署

（這篇文的重點其實是部署，誰曉得背景介紹可以這麼長）

部署 (deploy) 方法百百種，有好有壞。但至少要會一種嘛，所以這邊就用其中一種：

nginx <-> uwsgi <-> Django

也算很流行的組合。更完整地來說，整個處理 request 的流程經過：

web client <-> nginx web server <-> socket <-> uwsgi <-> Django server

基本的設定與教學來自 uWSGI 官網的 Setting up Django and your web server with uWSGI and nginx 一文，搭配 uWGSI and Systemd 與 systemd 整合。

這也是目前 Pydoc production 的設定，記錄一下方便未來的維護。

作業系統

作業系統用 Debian Jessie，架設於 Amazon EC2 上，使用 t2.nano⁵。

Python web deploy 都會把套件裝在虛擬環境中，避免不同專案間互衝或與系統衝突。在 Debian 上可以用 apt buid-dep python3-<pkg> 把 Python 套件所需的 header 或 library 安裝好，十分簡單。

Python 3.5 and APT-pinning

我的 code 裡用到了 subprocess.run，這是 Python 3.5+ 才有的 API。但 Jessie 只有 Python 3.4，但我覺得很好用一點都不想改寫成相容舊版的 code。

因此需要安裝 Debian testing channel 上最新的 Python 3.5。這樣其實有安全上的疑慮，因為只有 stable channel 才有 security support，但自己編譯的問題更大，所以像 pyenv 這種多 Python 版本的工具不在考慮內。

於是用 Apt-Pinning 只讓 Python 3.5 相關的套件安裝 testing 的版本。首先把 testing channel 加到 /etc/apt/source.list

deb http://cloudfront.debian.net/debian testing main
deb-src http://cloudfront.debian.net/debian testing main
deb http://security.debian.org/ testing/updates main
deb-src http://security.debian.org/ testing/updates main

然後修改 /etc/apt/preferences 確定我們不會不小心裝到 testing 相關的套件，並把 Python 3.5 相關的套件設定權限 >= 990 讓它們能被自動安裝。

# Specify * rules first so later package-specfic rules can override them
Package: *
Pin: release a=testing
Pin-Priority: -10

Package: python3.5* libpython3.5*
Pin: release a=testing
Pin-Priority: 990

可以用 sudo apt-cache policy <pkg-name> 檢查目前的規則會裝到哪個版本。

$ sudo apt-get update 
$ sudo apt-get install python3.5 python3.5-venv python3.5-dev

這樣只有 Python 3.5 相關的套件才會裝到 testing。

資料庫 PostgreSQL

資料庫用 PostgreSQL 9.4。參照之前 blog《安裝 PostgreSQL 9 於 Debian Jessie / OSX》一文設定。

Swap

其實是上線不久才注意到 EC2 預設沒有 swap 空間。我很窮所以 production server 的 RAM 只有 512 MB，觀察一下有時候 build doc RAM 就全滿了，所以還是加個 swap 安心一點。

因為 Amazon EBS SSD I/O 數不會另外收錢（應該吧？），就建 swap file 在主硬碟裡。

Swap 設定的教學很多，這邊就參考 Arch Wiki 上的做法，我選擇放在 /var/swap.1。大小設定為 RAM 的 2 倍，即 1GB。

首先把這個檔案建出來，權限改為 600。

sudo /bin/dd if=/dev/zero of=/var/swap.1 bs=1M count=1024
# or faster with fallocate
sudo fallocate -l 1G /var/swap.1

sudo chmod 600 /var/swap.1

再來把這個檔案改成 swap 格式並啟用它，

sudo /sbin/mkswap /var/swap.1
sudo /sbin/swapon /var/swap.1

修改 fstab 讓每次開機都有這個 swap 設定，

# /etc/fstab
/var/swap.1 none swap defaults 0 0

用 free -h、cat /proc/meminfo 檢查此時應該有個 1GB swap 了。

Git repo ssh config

再來是 code 的同步與更新。autobuild server 只要更新 source code，但 cpython-tw source 需要定時 commit 新的翻譯，因此 deploy server 會有修改 git repo 的權限。

不應該使用自己的 SSH key，deploy server 上應該有專屬的 deploy key，其中 cpython-tw 的 deploy key 有寫入權限（即可以 commit）。

查了一下，要讓不同 git repo 使用不同的 SSH key 也不複雜。以這邊的例子，先修改 ~/.ssh/config 加入兩個新的 host，使用不同的 SSH key：

Host github-pydoc_autobuild
  HostName github.com
  User git
  IdentityFile /home/pydoc/.ssh/id_rsa.pydoc_autobuild

Host github-cpython_tw
  HostName github.com
  User git
  IdentityFile /home/pydoc/.ssh/id_rsa.cpython_tw

建立對應的 SSH keypair，

ssh-keygen -t rsa -f ~/.ssh/id_rsa.pydoc_autobuild
ssh-keygen -t rsa -f ~/.ssh/id_rsa.cpython_tw

把兩個 repo 的 URL host 換掉，

git remote set-url origin git@github-pydoc_autobuild:python-doc-tw/pydoc_autobuild.git

這樣兩個 repo 會透過給定的 ssh key 連線。GitHub 會顯示每個 key 最近使用的時間，檢查時間就能確認設定正確與否（而且改 host 沒設定對應該直接連不上）。

tmpfiles.d

之後 nginx 和 uwsgi 溝通用的 socket 打算放在 /run/django/xxxx.sock ⁶。因為只需要非 root 的權限，修改 tmpfiles.d 的設定，讓這個資料夾能在開機時自動建立。增加設定檔 /etc/tmpfiles.d/pydoc_autobuild.conf

d /run/django 0755 pydoc www-data

Django Stack – nginx + uWSGI

在本地開發都用 python manage.py runserver 啟動 Django。但上線時內建的 runserver 就無法同時間服務太多人。因此需要像 nginx、uWSGI 等工具來協助。

參照 uWSGI Setting up Django and your web server with uWSGI and nginx 一文以及 TP 寫的《為程式人寫的 Django Tutorial》系列文中 Day 27 - Deploy to Ubuntu server 關於部署的文章。

Autobuild server 有特別為 production 寫一份設定檔，切換時只要設定成 settings.production 即可。在 Django 設定部份，建議把所有路徑都設成絕對路徑（包含執行檔）。不然後續在設定 systemd 要調整很多環境變數，systemd 也不會帶入使用者的 PATH 變數，不用絕對路徑其實蠻麻煩的也容易錯。

nginx 設定

nginx 會接受 incoming HTTP request，需要跟 Django server 聯絡時，就會會連到 uWSGI 開的 UNIX socket。

我們先假設 uWSGI 這段沒問題，首先設定 nginx 本身。由於 static files 在 nginx 就直接導到對應的檔案，不會經過 uWSGI ，所以設定好 nginx 之後 pydoc 文件本身就上線了。用這個來測試設定的正確性。

對本網站而言，/static 導到 Django staticfiles；/3/、/3.5/ 導到 pydoc build HTML 的路徑；其餘路徑再交給 Django 處理。其中，/3.5/* 的連結將重新導向到 /3/* 上。

整理上述的需求，寫個 nginx 設定檔在 /etc/nginx/sites-available/pydoc_autobuild.conf：

# Upstream Django setting; the socket nginx connects to
upstream django {
    server unix:///run/django/pydoc_autobuild.sock;
}

server {
    listen      80;
    listen      443 default ssl;

    server_name docs.python.org.tw
                52.69.170.26
                ;
    charset     utf-8;

    client_max_body_size 10M;  # max upload size
    keepalive_timeout 15;

    location /static {
        alias /path/to/code/pydoc_autobuild/assets;
    }

    location /3 {
        alias /path/to/code/cpython-tw/Doc/build/html;
    }

    location ~ /3\.5/(.*) {
        return 302 /3/$1;
    }

    # Finally, send all non-media requests to the Django server.
    location / {
        uwsgi_pass  django;
        include     /etc/nginx/uwsgi_params;
    }
}

再把檔案 soft link 到 /etc/nginx/sites-enabled/，更新 nginx 設定：

cd /etc/nginx/sites-available/
sudo ln -s pydoc_autobuild.conf ../sites-enabled/
sudo systemctl reload nginx

確定 pydoc 上線就可以專心處理 uWSGI 了。

uWSGI 設定

uWSGI 在 VENV 外也要裝，我覺得還是用 pip 比較簡單，雖然這樣就要自己注意 uWSGI 的版本更新了：

sudo python3.5 -m pip install uwsgi

把 uWSGI 設定存成 pydoc_autobuild_uwsgi.ini 並且在測試時，都使用：

sudo uwsgi --ini pydoc_autobuild_uwsgi.ini

模擬實際上的執行方式，這樣之後改用 systemd 執行才不會又丟一堆權限的問題。設定檔的內容：

[uwsgi]
chdir        = /path/to/code/pydoc_autobuild
# Django's wsgi file
module       = pydoc_autobuild.wsgi:application
env          = DJANGO_SETTINGS_MODULE=pydoc_autobuild.settings.production
# the virtualenv (full path)
home         = /path/to/VENV

# process-related settings
# master
master       = true
# maximum number of worker processes
processes    = 4
# the socket (use the full path to be safe
socket       = /run/django/pydoc_autobuild.sock
# ... with appropriate permissions - may be needed
chmod-socket = 664
uid          = pydoc
gid          = www-data
# clear environment on exit
vacuum       = true

權限上的設定可能要花點時間處理一下，nginx 使用 www-data/www-data 的身份執行，socket 要確定 nginx 能讀寫，但我的 code 放在 pydoc 使用者路徑下，用 www-data 可能會有權限的問題。建議把 uid、gid 都設定好。

過程中，搭配 nginx 的錯誤訊息比較好 debug：

sudo less +F /var/log/nginx/error.log

成功後，再用 uWSGI 的 Emperor mode，把設定檔丟到一個路徑底下（該路徑稱為 vassals）。uWSGI 在 Emperor mode 時會自動把 vassals 路徑內所有設定檔都讀進來並執行。

這裡 vassals 路徑使用 /etc/uwsgi/vassals/。因為有設 uid、gid，跑的時候就不用再設了：

sudo uwsgi --emperor /etc/uwsgi/vassals

這樣應該 Django 相關的 view 都沒問題了。接下來，要把啟動 uWSGI 的步驟交給系統來管理。

Systemd services

Autobuild server 包含兩個部份：Django Server 與 Django-Q cluster。所以寫成 systemd service 時會有兩個服務。

Debian system service 放在 /etc/systemd/system/ 底下，因此建立 uwsgi.service 和 qcluster.service 分別管理 uWSGI Emperor mode 和 Django-Q cluster。

uwsgi.service 參考 uWSGI 官網 Django and Systemd 一文的設定：

[Unit]
Description=uWSGI Emperor
After=syslog.target

[Service]
ExecStart=/usr/local/bin/uwsgi --emperor /etc/uwsgi/vassals
RuntimeDirectory=uwsgi
Restart=always
KillSignal=SIGQUIT
Type=notify
StandardError=syslog
NotifyAccess=all

[Install]
WantedBy=multi-user.target

qcluster.service 算是自己硬寫模擬 python manage.py qcluster 行為。因此環境變數都要設定好（當然用絕對路徑就沒問題了，我只是覺得這樣 build log 內的執行檔路徑都很長會很醜 xd）

[Unit]
Description=Django-Q Cluster for site pydoc_autobuild
After=syslog.target
Wants=uwsgi.service

[Service]
User=pydoc
Group=www-data
Environment=VIRTUAL_ENV=/path/to/VENV
Environment=PATH=/path/to/VENV/bin:$PATH
Environment=DJANGO_SETTINGS_MODULE=pydoc_autobuild.settings.production
WorkingDirectory=/path/to/code/pydoc_autobuild
ExecStart=/path/to/VENV/bin/python manage.py qcluster
Restart=always
KillSignal=SIGQUIT
Type=simple
NotifyAccess=none
StandardError=syslog

[Install]
WantedBy=multi-user.target

這樣的設定檔應該不是 systemd 的慣例，我還在想是不是應該要改寫到 user service 去（但我不會）。

加入到 systemd 之後管理就很簡單，啟動這兩個 service：

sudo systemctl enable uwsgi
sudo systemctl enable qcluster

查看他們的狀態：

sudo systemctl status uwsgi
sudo systemctl status qcluster

查看它們的 log 也變得很簡單，因為有把它們的 stderr 抓起來。systemd 好處是 rotation 等等都會幫你注意，看 log 的功能也很多。

例如要查最近一小時 uWSGI 的連線記錄，並在有新連線時持續更新 log：

sudo journalctl -xef -u uwsgi --since '1 hour ago'

總結

介紹了 Python 說明文件翻譯計畫，線上文件autobuild server 基於 Django 與 Django-Q 的架構，以及在 Debian 上結合 nginx、uWSGI、systemd 的部署設定。

查資料時覺得文章還不多，只有幾篇像 How to Set Up Django with Nginx, uWSGI & systemd on Debian/Ubuntu 的文章，剩下要自己組裝還是要花一點時間。同時也把部署 pydoc server 的設定都記在這，將來要重建也比較簡單。

關於說明文件翻譯，應該會再花篇文章好好寫整個計畫本身。

（是說如果有人能從頭看到尾的話，給個回饋吧 > <）

八卦是，臺灣繁體中文的 language code (or locale identifier) 究竟是 zh_TW、zh-Hant、zh-Hant-TW、zh-Hant_TW、zh_Hant 還是 zh_Hant_TW？這問題本身就可以寫一篇了。

查國際規範 BCP 47 的話，只有 zh-Hant 和 zh-Hant-TW，更多關於標準的說明與定義可以參考 Understanding the New Language Tags, W3C 一文。

不過現狀是很奇妙的。參考 OSX 定義 Language and Locale IDs 的話應該是 zh_TW、zh-Hant 或 zh-Hant_TW。而在 Debain 中，所有支援的 locale 寫在 /usr/share/i18n/SUPPORTED，裡面只有 zh_TW，不過 Debian 只用 language[_country][.charset] 所以不會有定義中為 script 的 Hant，雖然在 locale 中使用底線與 BCP 47 的定義不同。Sphinx 透過 Babel 處理 locale，但它不允許 locale 中有 -，因此只能考慮 zh_Hant 或 zh_Hant_TW。更有趣的是，locale 應該是 case-insensitive 所以大小寫是不重要的 XD ↩
其實在 https://docs.python.org/ 上面 /3/ 和 /3.5/ 是不同份文件，即使是同個版本號它們更新的時間不相同。蠻意外會是這樣的情況。不過我們不用搞這麼複雜，只要轉址就好。 ↩
開發時一直都是用 GET，即如文中所說，有個專屬的 link。但發現會有 robot / crawler 打這些路徑，因此最後改成 POST，把 {{ pagename }} 用 data-* 即 <a href="#" data-pagename="{{ pagename }}">...</a> 的方式存起來，在用 jQuery 綁定 click listener。 ↩
看 huey 和 jobtastic master branch 上有 py3k 的 commit 但感覺是最近的事，有待觀察。 ↩
吐嘈一下，t2.nano vCPU 真的時快時慢，有時 build doc 幾分鐘就搞定了，有時要幾十分鐘，有一天超慢，然後又被 web crawler 抓到，讓 task queue timeout 陷入了 timeout、restart、timeout 的無限地獄…… ↩
/var/run = /run，這個路徑是個 tmpfs 所以每次重開機就會清空，目錄要記得重建。 ↩

安裝 PostgreSQL 9 於 Debian Jessie / OSX

2016-01-25T17:00:00-06:00

平常用最多的是 SQLite，但 PostgreSQL 有很多好用的功能，每次要用想不起來怎麼裝。總之把相關設定的筆記存在這。

用 Debian Jessie (Debian 8.3) 與 OSX Homebrew 舉例。不過 OSX 大概也不會沒事把 PostgreSQL 開著，主要是著重在 Debian 的環境設定上。目前 PostgreSQL 出到 9.5 但 Debian stable 是 9.4。基本設定應該完全沒差別。

安裝 PostgreSQL
- 安裝在 OSX
- 安裝在 Debian Jessie
初始個人的 Database
psql 指令
刪除使用者、Database
進階主題
- 透過 psql 創建使用者帳號、資料庫
- PostgreSQL Logging
  - Logging 讓 PostgreSQL 自己管
  - Logging 透過 Systemd
Reference

安裝 PostgreSQL

安裝在 OSX

brew install postgresql

要用的時候手動把 PostgreSQL server 打開，

postgres -D /usr/local/var/postgres

PostgreSQL 的設定參考 Debian 的版本。

安裝在 Debian Jessie

$ sudo apt-get install postgresql-9.4

現在系統服務都由 Systemd 管理了，檢查 PostgreSQL 有沒有跑起來透過 systemctl 這指令就可以。

# systemctl status postgresql.service
● postgresql.service - PostgreSQL RDBMS
   Loaded: loaded (/lib/systemd/system/postgresql.service; enabled)
   Active: inactive (dead) since Mon 2016-01-25 17:26:08 CST; 4s ago
  Process: 913 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
 Main PID: 913 (code=exited, status=0/SUCCESS)

不過這 service 看不太到什麼運行的資訊，其實是個 dummy service，它會 trigger 可能很多個 PostgreSQL database cluster 什麼的。預設只有一個 main 的 cluster。

# systemctl status postgresql@9.4-main.service
● postgresql@9.4-main.service - PostgreSQL Cluster 9.4-main
   Loaded: loaded (/lib/systemd/system/postgresql@.service; disabled)
   Active: active (running) since Mon 2016-01-25 17:26:30 CST; 4min 7s ago
  Process: 9641 ExecStop=/usr/bin/pg_ctlcluster -m fast %i stop (code=exited, status=0/SUCCESS)
  Process: 9717 ExecStart=postgresql@%i %i start (code=exited, status=0/SUCCESS)
 Main PID: 9723 (postgres)
   CGroup: /system.slice/system-postgresql.slice/postgresql@9.4-main.service
           ├─9723 /usr/lib/postgresql/9.4/bin/postgres -D /var/lib/postgresql/9.4/main -c config_file=/etc/postgr...
           ├─9725 postgres: checkpointer process
           ├─9726 postgres: writer process
           ├─9727 postgres: wal writer process
           ├─9728 postgres: autovacuum launcher process
           └─9729 postgres: stats collector process

初始個人的 Database

在 OSX 上用 homebrew 安裝 PostgreSQL 的使用者會有 superuser 的權限，反正是本地開發也沒差，建 database 等設定都比較簡單。

在 Debian 上的話，有這 superuser 權限的使用者為 postgres。所以預設使用者（這邊以 vm 為例）會無法連線。

$ psql
psql: FATAL:  role "vm" does not exist

切到 root 再切到 postgres 身份就能用 psql （PostgreSQL 的 REPL shell）連到 database。用 \q 就可以退出 psql。

$ sudo -u postgres psql
[sudo] password for vm:
psql (9.4.5)
Type "help" for help.

postgres=# \q
$

但用 postgres 這 superuser 去連資料庫不是很安全，一開始養成好習慣應該用個人帳號。所以接下來要完成：

建立同使用者名稱的 PostgreSQL 帳號
建立與帳號同名稱的 database

建立同使用者名稱的 PostgreSQL 帳號

在 Debian 上可以用 $USER 來抓到現在登入者的帳號，即使用 sudo 切換身份這環境變數的值不會變。（讀 Ubuntu wiki 看到的技巧）

擔心的話就直接在有 $USER 的地方打出帳號即可。先確認一下，

vm@vm-debian:~$ echo $USER
vm
vm@vm-debian:~$ sudo -u postgres echo $USER
vm

建立使用者是透過 createuser 這指令。這是使用者帳號就不給太多權限。

$ sudo -u postgres createuser --interactive $USER
Shall the new role be a superuser? (y/n) n
Shall the new role be allowed to create databases? (y/n) n
Shall the new role be allowed to create more new roles? (y/n) n

這時候透過 psql 看就會多一個使用者。

-- Run with command `sudo -u postgres psql`
postgres=# \du
                             List of roles
 Role name |                   Attributes                   | Member of
-----------+------------------------------------------------+-----------
 postgres  | Superuser, Create role, Create DB, Replication | {}
 vm        |                                                | {}

建立與帳號同名稱的 database

透過 createdb 這指令。把與帳號同名 database 的 owner 設定成該帳號。

$ sudo -u postgres createdb --owner=$USER $USER

要多建別的 database 給這帳號也沒問題，例如名為 vm_database 的 database，

$ sudo -u postgres createdb --owner=$USER vm_database

用使用者帳號連接 psql

這時候打 psql 就沒問題了。

$ psql
psql (9.4.5)
Type "help" for help.

vm=> \conninfo
You are connected to database "vm" as user "vm" via socket in "/var/run/postgresql" at port "5432".

Prompt 從 #= 變成 => 表示現在連線的使用者不是 superuser。透過 psql 的指令 \l 或 \l+ 可以看現在所有的 database：

vm=> \l
                                   List of databases
    Name     |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges
-------------+----------+----------+-------------+-------------+-----------------------
 postgres    | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 template0   | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
             |          |          |             |             | postgres=CTc/postgres
 template1   | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
             |          |          |             |             | postgres=CTc/postgres
 vm          | vm       | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 vm_database | vm       | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
(5 rows)

連另一個 database vm_database 也很簡單，

$ psql -d vm_database
psql (9.4.5)
Type "help" for help.

vm_database=>

psql 指令

psql 的指令很多，用 \? 可以看到列表。完整的版本可以見官網 psql meta-commands 的介紹。底下列幾個常用的：

\l          # list all database
\d          # list tables in current database
\du         # list roles
\conninfo   # show current SQL connection
\q          # quit
help        # print a hub message for all kinds of help

刪除使用者、Database

各有一個指令對應。

$ dropuser <usr>
$ dropdb <db>

進階主題

透過 psql 創建使用者帳號、資料庫

Ref: http://www.cyberciti.biz/faq/howto-add-postgresql-user-account/

-- Run with command `sudo -u postgres psql -d template1`
template1=# CREATE USER <usr> WITH PASSWORD '<pwd>';
template1=# CREATE DATABASE <db>;
template1=# GRANT ALL PRIVILEGES ON DATABASE <db> TO <usr>;

PostgreSQL Logging

設定在 /etc/postgresql/9.4/main/postgresql.conf 裡。不同管理 log 的方式就要選擇不同的 log_destination。

PostgreSQL 自己管：stderr, …
透過 Systemd：syslog

不過我沒深入研究就是，看那個 conf 裡很多設定可以調整。設定修改後要 restart PostgreSQL cluster，

sudo systemctl restart postgresql.service

Logging 讓 PostgreSQL 自己管

# At /etc/postgresql/9.4/main/postgresql.conf
log_destination = 'stderr' 
logging_collector = on

Log files 預設寫在 /var/lib/postgresql/9.4/main/pg_log/。

Logging 透過 Systemd

我覺得 systemd 的優點之一就是能把 log 都集中管理，只要照它的規則，就能用一樣的方法管理 logging 是蠻方便的。

# At /etc/postgresql/9.4/main/postgresql.conf
log_destination = 'syslog'  
logging_collector = off   # on 也只會說被導向到 syslog 了

這時候重啟服務，再看 systemctl status 就能看到最近的 log 了。

# systemctl status postgresql@9.4-main
● postgresql@9.4-main.service - PostgreSQL Cluster 9.4-main
   Loaded: loaded (/lib/systemd/system/postgresql@.service; disabled)
   Active: active (running) since Mon 2016-01-25 17:52:02 CST; 1min 13s ago
  Process: 14632 ExecStop=/usr/bin/pg_ctlcluster -m fast %i stop (code=exited, status=0/SUCCESS)
  Process: 14641 ExecStart=postgresql@%i %i start (code=exited, status=0/SUCCESS)
 Main PID: 14648 (postgres)
   CGroup: /system.slice/system-postgresql.slice/postgresql@9.4-main.service
           ├─14648 /usr/lib/postgresql/9.4/bin/postgres -D /var/lib/postgresql/9.4/main -c config_file=/etc/postg...
           ├─14650 postgres: checkpointer process
           ├─14651 postgres: writer process
           ├─14652 postgres: wal writer process
           ├─14653 postgres: autovacuum launcher process
           └─14654 postgres: stats collector process

Jan 25 17:52:00 vm-debian postgres[14648]: [1-1] 2016-01-25 17:52:00 CST [14648-1] LOG:  ending log output to stderr
Jan 25 17:52:00 vm-debian postgres[14648]: [1-2] 2016-01-25 17:52:00 CST [14648-2] HINT:  Future log output ...log".
Jan 25 17:52:00 vm-debian postgres[14649]: [2-1] 2016-01-25 17:52:00 CST [14649-1] LOG:  database system was...9 CST
Jan 25 17:52:00 vm-debian postgres[14649]: [3-1] 2016-01-25 17:52:00 CST [14649-2] LOG:  MultiXact member wr...abled
Jan 25 17:52:00 vm-debian postgres[14648]: [2-1] 2016-01-25 17:52:00 CST [14648-3] LOG:  database system is ...tions
Jan 25 17:52:00 vm-debian postgres[14653]: [2-1] 2016-01-25 17:52:00 CST [14653-1] LOG:  autovacuum launcher started
Jan 25 17:52:00 vm-debian postgres[14658]: [3-1] 2016-01-25 17:52:00 CST [14658-1] [unknown]@[unknown] LOG: ...acket
Jan 25 17:52:38 vm-debian postgres[14793]: [3-1] 2016-01-25 17:52:38 CST [14793-1] root@root FATAL:  role "r...exist
Hint: Some lines were ellipsized, use -l to show in full.

或用 systemd 標準看 log 的方式 journalctl

# journalctl -u postgresql@9.4-main 
-- Logs begin at Mon 2016-01-25 16:46:25 CST, end at Mon 2016-01-25 19:22:07 CST. --
Jan 25 17:47:06 vm-debian postgres[13699]: [1-1] 2016-01-25 17:47:06 CST [13699-1] LOG:  redirecting log output to logging collector process
Jan 25 17:47:06 vm-debian postgres[13699]: [1-2] 2016-01-25 17:47:06 CST [13699-2] HINT:  Future log output will appear in directory "pg_log".
Jan 25 17:47:06 vm-debian postgres[13699]: [2-1] 2016-01-25 17:47:06 CST [13699-3] LOG:  ending log output to stderr
Jan 25 17:47:06 vm-debian postgres[13699]: [2-2] 2016-01-25 17:47:06 CST [13699-4] HINT:  Future log output will go to log destination "syslog".
Jan 25 17:47:06 vm-debian postgres[13701]: [3-1] 2016-01-25 17:47:06 CST [13701-1] LOG:  database system was shut down at 2016-01-25 17:47:05 CST
Jan 25 17:47:06 vm-debian postgres[13701]: [4-1] 2016-01-25 17:47:06 CST [13701-2] LOG:  MultiXact member wraparound protections are now enabled
Jan 25 17:47:06 vm-debian postgres[13699]: [3-1] 2016-01-25 17:47:06 CST [13699-5] LOG:  database system is ready to accept connections
Jan 25 17:47:06 vm-debian postgres[13705]: [3-1] 2016-01-25 17:47:06 CST [13705-1] LOG:  autovacuum launcher started
Jan 25 17:47:07 vm-debian postgres[13710]: [4-1] 2016-01-25 17:47:07 CST [13710-1] [unknown]@[unknown] LOG:  incomplete startup packet
Jan 25 17:49:30 vm-debian postgres[14170]: [4-1] 2016-01-25 17:49:30 CST [14170-1] root@root FATAL:  role "root" does not exist
...

Reference

How to Install and Use PostgreSQL 9.4 on Debian 8 by Digital Ocean
PostgreSQL on Arch Wiki
PostgreSQL on Debian wiki
PostgreSQL on Ubuntu wiki

Coding 初學指南附錄 - Bioinfo Practices using Python

2016-01-21T23:30:00-06:00

Last Edited: Jan, 2016 （如果內容有誤，你可以留言，或用任何管道告訴我）

We are going to walk through a series of practice created by Rosalind Team.

Once you register an account at Rosalind, you can use their judging system to work through all problems. However, in this case you cannot arbitrarily skip easy levels and it sucks. So I’m not going to force you using the system. Luckily, in each problem one set of example data and expected output is given, which can be used for checking our answer.

Note: Their code assumes Python 2 but everything I mention here is Python 3.

Python Basics
Bininfo First Try

其他 Coding 初學指南系列文章：

Introduction

Chapter 1 – Linux

Chapter 2 – Text Editing (Markdown, Text Editor)

Chapter 3 – Version Control (Git)

Chapter 4 – Python

Appendix 1 – OSX Development Environment

Appendix 2 – Python in Bioinformatics

或者，用 labcoding 這個 tag 也可以找到所有的文章。

Python Basics

Do their Python Village problem sets. If any topic you don’t know, go read your Python reference.

Should be very trivial.

Bininfo First Try

Q DNA: Counting DNA Nucleotides

Link: http://rosalind.info/problems/dna/

Hint: use collections.Counter provided by Python’s stdlib
More Hint: use ' '.join and list comprehension to output the answer

Q REVC: The Secondary and Tertiary Structures of DNA

Link: http://rosalind.info/problems/revc/

Hint: reversed for any sequence object and a dict for nucleotide code mapping
More Hint: done in a list comprehension

Q: GC: Computing GC Content

Link: http://rosalind.info/problems/gc/

This is the first complicated problem that some abstraction should help you come up the solution. Try write some re-usable code blocks, for example, function calls and class definitions.

Don’t worry about the computation complexity

Workthrough

You should implement by yourself before looking my solution. Also I didn’t see their official solution so my solution can differ a lot from theirs.

Intuitively, we need to implement a FASTA file parser. FASTA contains a series of sequence reads with unique ID. From a object-oriented viewpoint, we create classes Read for reads and Fasta for fasta files.

Read is easy to design and understand,

class Read:
    def __init__(self, id, seq):
        self.id = id
        self.seq = seq

Since we need to compute their GC content, add a method for Read.

class Read:
    # ... skipped
    def gc(self):
        """Compute the GC content (in %) of the read."""
        # put the logic here (think of problem Q DNA)
        gc_percent = f(self.seq)
        return gc_percent

Then we have to implement the FASTA parser, which reads all read entries and converts them through Read. In real world we are dealing with myfasta.fa-like files, but here the input is string.

class Fasta:
    def __init__(self, raw_str):
        """Parse a FASTA formated string."""
        self.raw_str = raw_str
        # convert string into structured reads.
        self.reads = list(self.parse())

    def parse(self):
        """Parse the string and yield read in Read class."""
        # though we have no idea right now, the code structure
        # should be something like the following.
        raw_lines = self.raw_str.splitlines()
        for line in raw_lines:
            yield Read(...)

Here I use yield Read(...), which may be unfamiliar for Python beginners. It turns parse(self) function as a generator. Generator makes you focus on the incoming data. Once data is parsed and converted, the result is immediated thrown out by yield. We don’t care about how to collect all the results. In our case, we catch all the results into a list by list(...).

So how should we read FASTA file? A simple rule in this case is that every read consists by two continuous row. Also, the first row will always be the first read id.

All we need is read two lines at the same time. Here a Pythonic idiom is introduced. The following code read two non-overlapping lines,

for first_line, second_line in zip(*[iter(raw_lines)]*2):
    yield Read(id=first_line, seq=second_line)

By zip(*[iter(s)]*n) magic, we are very close to implement a full parser. You could find a lot of explanations for this magic.

Read id line percedes with a > sign, so we could use something like first_line[1:] or first_line[len('>'):] for explicity.

Then sorting the GC% of reads in a FASTA file is easy.

fasta = Fasta('...')
sorted_reads = sorted(fasta.reads, key=lambda r: r.gc())  # note 1
top_gc_read = sorted_reads[-1]  # note 2
print(
    '>{0:s}\n'
    '{1:.6f}'  # note 3, 4
    .format(top_gc_read.id, top_gc_read.gc())
)

The code above completes the following steps:

sorted(list, key=key_func) sorts the list based on the return value of key_func applied to each element.
or top_gc_read = sorted(..., reversed=True)[0]
two string with no operands in between will be joint automatically. In this case it is exactly >{0:s}\n{1:.6f}. This is useful to tidy a super long string.
'...'.format() fills the string with given values. See doc.

In real case FASTA can span across multiple lines, also likely the file we parse is broken. How could we modify this parser to handle these situations?

Q: (next?)

I’m super tired now so I’ll leave the rest for you. Try those problems within yellow correct ratio range.

Coding 初學指南附錄 - OSX 開發環境

2016-01-21T23:00:00-06:00

Last Edited: Jan, 2016 （如果內容有誤，你可以留言，或用任何管道告訴我）

以下的設定都蠻主觀的，見人見智。總之我把我的環境分享出來。

Terminal
Homebrew, Git and Python
Text Editors
Terminal Multiplexers
Git GUI
Documentation Searcher
Misc.

其他 Coding 初學指南系列文章：

Introduction

Chapter 1 – Linux

Chapter 2 – Text Editing (Markdown, Text Editor)

Chapter 3 – Version Control (Git)

Chapter 4 – Python

Appendix 1 – OSX Development Environment

Appendix 2 – Python in Bioinformatics

或者，用 labcoding 這個 tag 也可以找到所有的文章。

Terminal

OSX 系統有內建一個 Terminals.app 能像在 Linux 上一樣使用。他其實使用上沒什麼問題，不過想要調顏色，有更多自定功能的話，許多人會安裝 iTerm2。

Homebrew, Git and Python

OSX 上官方沒有一個管理套件的工具，所以社群自行開發了一個叫做 Homebrew。你可以按照這篇教學安裝 Homebrew。

裝好了之後你可以以下指令去看它該怎麼操作

$ brew --help
$ man brew    # for full documentation

OSX 雖然內建有 git 與 python，但我們可以用 homebrew 安裝比較標準（新）的版本，

$ brew install git python3

如果 homebrew 有問題可以用 brew doctor 來檢測。把錯誤訊息問 google 通常就能找到解決方式。

Text Editors

我最常用的是 Vim。OSX 有內建，但也可以用 homebrew 安裝。

除了 console based 的 Vim，OSX 上也有像 gVim 的 MacVim。一樣能用 homebrew 安裝。

$ brew info macvim  # 看 MacVim 在安裝有什麼選項可以調整
$ brew install macvim --override-system-vim --custom-icons
$ brew linkapps macvim

使用 Macvim 的時候除了 vim 之外，也可以呼叫 mvim 打開 MacVim。

Terminal Multiplexers

你有可能有聽過 screen 或者 tmux。前者在 osx 上有內建但版本很舊，在顯示顏色上會有問題，因此可以透過 homebrew 再安裝新的。但因為 screen 跟系統提供的重覆到了，所以預設不在 homebrew 的 repo 中，要先新增 repo 清單：

brew tap homebrew/dupes
brew install screen tmux

Git GUI

初學 Git 可能會不熟那些指令、常常不知道自己在 git log 哪個位置。這時候有個圖形化的工具會更方便了解。Git 有內建一個 gitk，但比較陽春。

在 OSX 上可以考慮用 SourceTree。

Documentation Searcher

要一直查 Python 官網有時候還蠻麻煩的，未來學了 HTML CSS 等等不同語言或各種 Python 套件，要查個東西會很費時。所以有人開發了一個離線的 documentation 查詢器叫做 Dash。他要錢但有免費版，似乎是會一直跳提示訊息。

Misc.

Alfred App：一個延伸版的 Spotlight，查應用程式很快速，同時也可以跟 Dash 整合讓查 doc 更方便。
Macdown：OSX 上的 markdown 編輯器。

Coding 初學指南－Python

2016-01-21T22:50:00-06:00

Last Edited: Jun, 2017 （如果內容有誤，你可以留言，或用任何管道告訴我）

Python 是一種物件導向、直譯式的電腦程式語言，具有近二十年的發展歷史。它包含了一組功能完備的標準庫，能夠輕鬆完成很多常見的任務。

(From Wikipedia)

選擇 Python 作為第一個深入學習的語言有很多好處。他的語法跟英文相似，比起其他語言經常用到 ;{}() 來控制語法不同的段落，Python 主要用的是空白與縮排。

Python 能用互動式的方式（read–eval–print loop, REPL）來操作，以邊試邊做的方法來開發很適合初學者。

內建的標準庫（standard library）功能很豐富，在網路、文字處理、檔案處理、甚至 GUI 介面都能用它完成。除此之外，它的第三方套件也很多，在 Linux 上很好安裝，這樣幾乎能用 Python 完成各種事情。

聽說系列
- 聽說 Python 跑很慢，是不是不能用來計算/分析/大檔案？
- Python 2 還是 Python 3，聽我朋友說…比較好？
相關資源
學習目標

其他 Coding 初學指南系列文章：

Introduction

Chapter 1 – Linux

Chapter 2 – Text Editing (Markdown, Text Editor)

Chapter 3 – Version Control (Git)

Chapter 4 – Python

Appendix 1 – OSX Development Environment

Appendix 2 – Python in Bioinformatics

或者，用 labcoding 這個 tag 也可以找到所有的文章。

聽說系列

（需要接觸過 Python 之後才能理解）

聽說 Python 跑很慢，是不是不能用來計算/分析/大檔案？

Python 的確執行效率比編譯式的語言差（例：C/C++、Java），但這很可能不是你程式跑得慢的主因，所以也不代表 Python 不能處理計算量高的工作。

碰到程式跑得比想像中慢的時候，有幾個步驟：

到底是哪幾行程式跑得慢？
這是最佳的演算法嗎？
這是最有效率的 Python 語法嗎？

如果用到了最後一步情況還是沒有改善的話，就可以開始把那些部份用其他語言改寫，例如：C。Python 能很容易跟 C 語言的結合。而且常見的 C 語言加速，其實都有 Python 套件能支援了，例如 Numpy，所以大部份的時間，都能在不使用 Python 以外的語言完成高量計算。

我在實習時候，也常碰到需要優化的問題。用 Python 我能很輕鬆（一天內）把工作分配到 4 台主機 64 cores 上跑，也許方法不有效率，但比起我花幾天把 Python 改寫成 C/C++，實作更精密有效的算法（還要是 multithread），仔細處理可能的 corner case，平行化之後本來三四天的計算時間我 2 個小時就能收工。

更重要的是，這個實驗就只跑個兩次。

比起計算時間，開發時間對工程師而言是更加寶貴的。尤其在實驗室，最關心的是這個方法行不行得通，程式跑得慢有很多解決的方式，例如平行化。重點在解決問題，需要用多一點的資源其實不是很重要。

如果問我 Python 還是 Matlab 比較快？這邊有正經的 Python vs Matlab。一開始選 Python 慢的話有很多條路可以走，但 Matlab 呢？ meh

所以 Python 跑得快不快？它單打獨鬥有極限，但它有很多快樂夥伴。O’Reilly 有本 High Performance Python 值得一看。

Python 2 還是 Python 3，聽我朋友說…比較好？

隨著時間流逝，每過一天我都可以更確信的說「請學 Python 3」。現在有在用 Python 2 多半也是用 2.7 版本，要把 3.3+ 的程式碼改回 2.7 也不難。

EDIT 2017-06: Python 2.7 確定在 2020 年會停止官方支援，這不代表說在那年 Python 2.7 就會瞬間消失，目前世界上有非常多公司會繼續維護他們內部的 Python 2.x 程式碼，但新的專案都預設使用 3.5+ 版本開發。而市面上的書籍也已經都是針對 Python 3.x 撰寫，過往使用中文學習的障礙已經消失。

學習目標

打開自己 Linux 裡的 Python3，跟著學習用的參考資料動手操作。用 REPL 以及運行腳本兩種方法來執行 Python 程式。
學習使用 pip 和 venv (virtualenv) 來管理 Python 套件與環境。
- Hint: Python 官網是你的好夥伴。你可以在這裡 (pip) 和這裡 (venv) 找到兩者的教學。
youtube-dl 是一個用來下載 Youtube、Crunchyroll 等各大影音串流網站影片的工具。除了用 Linux 的套件管理工具安裝它，它其實是個用 Python 寫成的套件。為了避免跟 Linux 系統環境相衝，請開一個 Python 虛擬環境，並在裡面用 pip 安裝它。
- Note: youtube-dl 除了單純做下載串流檔之外，還支援轉檔、封裝、後製等影像處理，這需要 libav 或 ffmpeg 任一影像處理套件。在 Debian 系列的 Linux 上 libav 會好裝一點。
用 Python 解決一些實驗室會碰到的 Bioinfo 問題。有個網站 Rosalind 出了一系列的題目，我選了一些讓各位練習，請參考附錄 1。

EDIT 2016-05-22: 把 ptt 發文的內容更新上來，增加一些新書和中文翻譯；調整推薦的順序。
EDIT 2017-06-20: 更新書籍資訊與 2/3 比較。

Coding 初學指南－版本控制

2016-01-21T22:40:00-06:00

Last Edited: Jan, 2016 （如果內容有誤，你可以留言，或用任何管道告訴我）

所謂的版本控制就跟玩遊戲一樣可以存取「進度點」，破關前都會保存進度，這樣破關失敗的時候可以還成到保存進度的狀態，再重新打怪。版本控制用在管理程式碼時，就方便讓自己在把 code 搞炸掉的時候，還能回到先前有保存的狀態。

為什麼使用版本控制？

在軟體開發的過程中，程式碼每天不斷地產出，過程中會發生以下情況：

檔案被別人或自己覆蓋，甚至遺失

想復原前幾天寫的版本

想知道跟昨天寫的差在哪裡？

是誰改了這段程式碼，為什麼？

軟體發行，需要分成維護版跟開發版

因此，我們希望有一種機制，能夠幫助我們：

可以隨時復原修改，回到之前的版本

多人協作時，不會把別人的東西蓋掉

保留修改歷史記錄，以供查詢

軟體發行時，可以方便管理不同版本

(From: Git 教學研究站)

能做到版本控制的工具有很多，但目前主流就是 Git。

為什麼使用版本控制？
Git (Version Control)

其他 Coding 初學指南系列文章：

Introduction

Chapter 1 – Linux

Chapter 2 – Text Editing (Markdown, Text Editor)

Chapter 3 – Version Control (Git)

Chapter 4 – Python

Appendix 1 – OSX Development Environment

Appendix 2 – Python in Bioinformatics

或者，用 labcoding 這個 tag 也可以找到所有的文章。

Git (Version Control)

Git 是一個版本控制的工具。

Git 會在你的專案（repo）目錄¹底下建一個 .git 的資料夾來管理這些「進度點」，而不會去動專案其他路徑裡的東西。

這些進度點可以傳到 server 上，別人下載下來的時候就可以除了得到現在的 code 以外，還能看到過去開發的記錄；而別人上傳了他的更新進度點之後，你抓下來就可以得到他更改的進度。這個就是「同步」的概念，多人之間能彼此共享、更新彼此開發的成果。

能夠處理 Git 同步操作的伺服器就叫做 git server。Github 就是一間公司提供免費的 git server 讓大家同步公開的 Git 專案。很多 Linux 的工具都使用 git 來讓大家合作開發，也有不少工具已經把 git server 轉到了 Github 上面。所以非常多人在用，因此建議大家不妨申請一個 GitHub 帳號。

Git 雖然最常用來管理程式碼，但它其實可以有效地管理任何純文字的檔案，也可以把 binary 檔案加到 repo 中。

（可能需要有一些 git 操作經驗才能了解以下的術語）

操作建議

建立多而小的進度點

每完成一系列的更動，就趕快 add 和 commit。一開始會煩，但這是好習慣。

日後更了解 git 操作的時候，會學到一些進階的指令（如 git rebase -i）就能把多個 commit 合成為一個。但要拆開一個大的 commit 會比較複雜。

Commit Style

常見的 commit message 大概會是一行文。如果這個更動需要說明，那就建議按照下面的格式：

第一行少於 50 個字元
第二行留空
第三行以後格式隨意，但單行不超過 75 個字元
善用條列式說明

以下是範例（From Git Book）：

Short (50 chars or less) summary of changes

More detailed explanatory text, if necessary.  Wrap it to
about 72 characters or so.  In some contexts, the first
line is treated as the subject of an email and the rest of
the text as the body.  The blank line separating the
summary from the body is critical (unless you omit the body
entirely); tools like rebase can get confused if you run
the two together.

Further paragraphs come after blank lines.

  - Bullet points are okay, too

  - Typically a hyphen or asterisk is used for the bullet,
    preceded by a single space, with blank lines in
    between, but conventions vary here

ps 你可以找到很多有趣的 commit message。例如：抱怨。

常見問題

Conflict

當單機只有在一個 branch 上開發的時候，很難有 conflict 的問題。但碰到多人共同開發，或把多條 branch merge 在一起時就會有 conflict。

Conflict 的發生，最常見的就是兩個人各自修改了同一個檔案相近位置的內容。這使得 git 在把兩個人的更動融合在一起的時候，會不知道到底要用誰的更動，這時候就無法自動處理了。

可以搜尋「resolve git conflict」找到相關的解決辦法。

Push fail

這通常發在 server 上的進度點比自己本機的還要新，所以必須先把 server 上的更新同步下來。如果都是同一個 branch 的話，你可以試著用 git pull --rebase 去避免額外的 merge。

學習目標

用 Git 管理這些練習的筆記（呈接在 Text Editors 的練習）
- 可以試著對它做一些 git 指令操作：
  - git status
  - git log --oneline --graph
建立 dotfiles 和 dotvim 來管理你的環境設定檔。
dotfiles 就是用來儲存 .xxx 的檔案們，像是 .bashrc 、 .screenrc 、 .tmux.conf 、 .gitconfig 等等，一般可能存放在 ~/.xxx 或 ~/.config/xxx 之類。用版本控制的好處是，這樣在不同的 server 之間設定可以同步。
dotvim 是存放 ~/.vim 的 Vim 設定檔。這些設定檔可以透過 soft link 連結回他們原本應該在的位置。

注意！永遠不要把 private key 放入版本控制中！
- Hint: 搜尋 dotfiles 就會有很多範例（Ex 我的）
建立自己的 Github 帳號，並把 dotfiles / dotvim repo 同步（push）到Github。
- Hint: 建立設立好 ssh key pair 使用 ssh 上傳。Github 有完整的教學。

所謂的專案目錄就是下 git init 指令的目錄。 ↩

Coding 初學指南－文字編輯

2016-01-21T22:30:00-06:00

Last Edited: Jan, 2016 （如果內容有誤，你可以留言，或用任何管道告訴我）

這個章節會帶大家認識一個很簡單的純文字格式 Markdown，方便大家整理筆記。同時希望大家學會一個 terminal based 的文字編輯器，方便往後在 server 環境底下的操作。

Markdown
- 相關資源
  - Markdown 語法
- 學習目標
Text Editor
正規表示式 Regex
- Regex 語法派別
- 相關資源
  - Regex One
  - Regex 101

其他 Coding 初學指南系列文章：

Introduction

Chapter 1 – Linux

Chapter 2 – Text Editing (Markdown, Text Editor)

Chapter 3 – Version Control (Git)

Chapter 4 – Python

Appendix 1 – OSX Development Environment

Appendix 2 – Python in Bioinformatics

或者，用 labcoding 這個 tag 也可以找到所有的文章。

Markdown

這是一個簡便的語法，它的概念是在純文字的檔案中用一些簡單的標記，就能做出大小標題、粗斜體、超連結、表格、程式碼上色等語法。

如果大家了解網頁的格式 HTML 的話，那 markdown 的語法能直接對應到 HTML 的語法，所以這個格式在網路的世界十分流行。它的副檔名為 .md，近代程式的 REAME 許多都用 markdown 寫成（例：README.md）

學習目標

這一系列的筆記就是用 markdown 寫成，你可以在這裡找到它的原始檔。
試著把 Linux 學習過的指令，或者自己常用的組合指令用 markdown 記錄。

Text Editor

在 Linux 的世界很多都是純文字檔案，再加上一些規定的語法成為新的格式。前面的 markdown 就是個例子。甚至許多可以執行的程式都只是個腳本檔，能用一般的編輯器（editor）打開就能讀懂。你可以試試

# cd is a shell script
nano `which cd`       # thanks TP's idea

常見的 editor 有：nano、vi、vim、emacs。到底什麼是最好用的文字編輯器，這是一場永無止盡的戰爭，近年來又有 Notepad++(GUI)、Sublime Text (GUI)、Neovim 的加入，這話題將不會有結論。對初學者而言，至少學會一個 editor 是必要的。

雖然一開始都說是介紹文字編輯器，但後來會開始學程式設計，所以最後大家在討論的都是「程式碼的編輯器」。

當在編輯一些設定檔、程式碼時，為了避免打錯關鍵字但難以查覺，多數人會把程式碼的關鍵字上色。按照程式碼不同的屬性、功能上色之後，多數人發現能更好的理解程式的結構，因此 editor 大多帶有語法上色（syntax highlighting）。

除了語法上色，這些 editor 都有自己的設定檔規範，可以讓使用者自行修改 editor 的行為。把自己常見的編輯器改得合乎自己習慣，是長期生活在 terminal 世界的第一步，大家可以參考（抄）別人的範本開始。

讓自己的編輯器有家的感覺。

除了設定檔之外，功能多的編輯器還會有「外掛」的功能，可以讓使用者增加自己的套件。這也等大家熟悉環境之後再自行玩玩吧。

Nano

這是一個操作簡單好懂的編輯器，~~沒有語法上色~~¹。多數的系統都有內建，所以到一個新的環境時幾乎都能使用。

鳥哥有教。其實直接執行它 nano 它的指令都會顯示在編輯畫面中。

Emacs

抱歉，我不會。但它是一個很好的編輯器。（誠徵大大補全）

Vim

一個老字號但維持穩定開發的編輯器。他有個特色是編輯器的模式，有些模式能編輯文字，有些不行，但能做選取、搜尋等動作。還有特有的指令合成方式（像連續技、buff 這樣）

初學者通常會難以習慣，初期不熟模式、指令記不住的話會很難操作。所以建議一開始先記住最基本的指令，隨時掌握自己在的模式，日後再慢慢加深對 vim 的了解。

如果真的很沒概念，鳥哥也有寫介紹。

Vim 相關資源

Open Vim

互動式的線上學習網站，很短，跟著操作完能會 Vim 基本動作、存檔。

http://www.openvim.com/

學習 Vim 的心法與攻略 (ptt)

了解最常用的 normal 與 insert 模式及最基本的指令。這篇的內容理解之後，就能用 vim 處理文字編輯了。

https://www.ptt.cc/bbs/Editor/M.1264056747.A.885.html

Vim adventure

如果很難學習 hjkl、wb 移動的話，這是個要用 vim 指令控制的小遊戲。

http://vim-adventures.com/

Vim 本身的使用手冊

可以使用 vimtutor 指令，或者在 vim normal 模式時鍵入 :help

學習目標

能在 terminal 中編修一個文字檔名為 foo.txt
- Hint: try nano
- nano foo.txt
搭配 root 權限修改系統的設定檔（你在鳥哥可能有經驗了）
- Hint: try sudo
能在 console 中編寫程式碼。用 1. 的方案也可，但建議再試試看另外一個
- Hint: try vi, vim or emacs
修改 editor 設定讓它更符合自己的習慣。
- Hint: for vim, try editing ~/.vimrc; for emacs, try editing ~/.emacs
用 terminal editor 使用 markdown 格式記錄這些練習的筆記與答案。

正規表示式 Regex

Vim 在 normal 模式下能用 /{pattern} 搜尋文中的字串。除了直接把想要查的字串寫在 pattern 裡以外，還可以設計規則找出符合 pattern 但不一樣的結果。這樣的規則稱之為正規表示式（Regular Expression, or regex）。

想做很複雜的字串比對時，都應該考慮是否能使用 regex

要做字串比對的地方，工具通常都會提供使用 regex，例如 grep、sed。Vim 與 Python 也都有提供 regex 的功能。

Regex 語法派別

既然 regex 是一套字串比對的規則，就有規範它的語法。主要的 regex 語法有兩大類：

BRE (basic regex)
- Ex. [:alnum:]
ERE (extended regex)
- Ex. \w

在 Linux 指令當中通常會因為使用 regex 語法的不同分成多個指令²。例如 grep 使用 BRE；egrep 使用 ERE。

與文字編輯相關的工具，像 Vim、Python、Perl³ 也有他們各自寫 regex 的方式，但多少都與前兩大類相似，使用時都應該先查一下他們的語法。Vim 可以用 :help regex 查看。

Coding 初學指南－Linux

2016-01-21T21:30:00-06:00

Last Edited: Jan, 2016 （如果內容有誤，你可以留言，或用任何管道告訴我）

學習使用 Linux 是第一個比較大的障礙，因為會在短時間接觸到非常多新的東西。後面的東西多少都與 Linux 相關，而 Linux 難的部份在開始使用 terminal 來操作「整台電腦」，對很習慣使用視窗介面的人會覺得很不直覺。好在近年幾個主流的 Linux Distribution 都有很好的圖形介面（正確稱 Desktop Environment），所以一開始能漸近地適應 terminal 操作。

在實驗室 server 上開發，「能在 terminal 裡做事情」是必須的。

Linux、Unix、BSD、*nix
Distro 簡介
桌面環境 GNOME、KDE、XFCE、LXDE
相關資源
學習目標

其他 Coding 初學指南系列文章：

Introduction

Chapter 1 – Linux

Chapter 2 – Text Editing (Markdown, Text Editor)

Chapter 3 – Version Control (Git)

Chapter 4 – Python

Appendix 1 – OSX Development Environment

Appendix 2 – Python in Bioinformatics

或者，用 labcoding 這個 tag 也可以找到所有的文章。

Linux、Unix、BSD、*nix

Linux 和 Unix 是不同的，但對初學者來說他們的分別不容易查覺，兩者的終端指令很相似，也因此有了 *nix 的通稱。Linux 介紹書多半會把他們的歷史好好的說一遍¹，有興趣聽軟體發展故事的就多留意這部份。

不完整地主要分成：

Redhat, CentOS, Fedora
Debian, Ubuntu, Linux Mint
ArchLinux
openSUSE
FreeBSD, OpenBSD

這些 distributions (distros)。其實 Linux、BSD (or Unix) 的系統非常多²，但對初入 Linux 的使用者，應該要找比較熱門的 distro 使用，才好找資料。

上面條列的方式是有意義的，我把類似的 distro 放在同一排，只要學了其中一個，同排的其他都很好上手。其中前兩排是兩大家族反映兩種生態。我們實驗室的大 server 用的是 CentOS；但近年來我自己的電腦已經漸漸換到 Debian 上。

Distro 簡介

以下是我不負責任的主觀介紹。從介紹篇幅就知道我是個傾向 Debian 的人。

Redhat / CentOS / Fedora

Redhat 是商用的版本，開源社群維護的對應版本是 CentOS。它以保守穩定著名，但相對來說新的東西在上面就要自己安裝，這對實驗室都用較新的工具來說是個減分的地方。他套件用 yum xxx 來操作。Fedora 上的東西會新一點但我們實驗室沒人用，所以不建議。

Debian / Ubuntu

Debian 是另一個大家族的頭，雖然是頭但一直保持穩定開發，在說明文件、wiki 上都有不錯的品質。本身有所謂的 stable、testing、unstable 並分別對應三個版本號碼與名稱。以 2016.01 為例，stable 是 jessie(8)、testing 是 stretch(9)。unstable 永遠對應到 sid。如字面上的意思代表當中套件（軟體）的新舊程度。stable 上的工具也因此會比較舊不適合實驗室使用，但 testing 還蠻適合的，我個人很推。套件用 apt-get xxx 來操作。

Debian 家族中的 Ubuntu 非常火紅，網路上教學非常多，背後也有公司加持。基本上 Debian 的優點都會傳到 Ubuntu 上。Ubuntu 雖然套件包等等都從 Debian 移植，但他有自己的版本號碼，每半年發佈一個版本。

ArchLinux

再來的有興趣自己看，但我要額外介紹一個 ArchLinux。這是一個很自幹的系統，一點都不適合新手與懶人。但他有一個寫得很仔細完整的 wiki 站。想要學新的套件、不會設定的話，去問 google 的時候請優先看他們的 wiki。

查資料的時候，除了 StackOverflow 、Ubuntu 論壇之外，請多看品質優良的 Arch Linux 和 Debian 的 wiki。

寫了這麼多，沒有給一個明確的選擇，多數人還是很難決定。所以如果你是初學者，我會建議安裝 Ubuntu，在此時請選擇 15.10 or 14.04 LTS Desktop 版本，因為他網路上的資源最豐富。

但我不是很喜歡 Ubuntu，所以等你有能力自己查詢 Linux 相關操作時，建議再看看別的 distro （例如我推薦的 Debian testing channel）

桌面環境 GNOME、KDE、XFCE、LXDE

圖形化介面（GUI），除了使用者的應用程式外，還需要系統輔助、管理等核心套件。一系列的 GUI 套件就稱之為桌面環境。

Windows、OSX 在安裝系統時都會自動安裝，即桌面環境只有一種選擇。但在 Linux 上，GUI 的安裝是選擇性的，系統能在只有單純的 terminal 介面便能完整使用（例如選擇安裝 Ubuntu Server 時），不少 server 為了效能、安全性的考量都不會裝桌面環境。

對 Linux 來說桌面環境是能之後再選擇安裝上去的，而且還有「不同口味」可以選擇，使用者也可以自由的移除它們（但很有可能會炸掉），常見就有 GNOME、KDE、XFCE、LXDE³ 幾種系統能使用。例如選擇安裝 Ubuntu Server 後，想要再加上 GNOME 列圖形化介面時，

sudo apt-get install ubuntu-gnome-desktop

這邊不會去細講這些實作方式的不同。簡單而言，GNOME 最流行。XFCE 使用的系統資源較少，在實驗室上的 server 常會裝這個。在 Ubuntu 上，預設是用 Unity，它是從 GNOME 沿伸出來的。

第一次安裝時，就使用預設的模式吧。裝好之後要換到不同的桌面環境時，需要對套件管理系統（例 apt、yum）、調整系統設定有足夠了解。

章節	章節名	重要的內容
0	計算機概論	沒聽過 CPU、RAM、MB GB 單位就從頭看；不然就讀資料表示方式(3)、軟體程式運作(4)
5	首次登入與線上求助 man page	文字模式下指令的下達(2)、man page 與 info page(3)、nano(4)
6	Linux 的檔案權限與目錄配置	全
7	Linux 檔案與目錄管理	除檔案隱藏與特殊屬性(4)外都重要
8	Linux 磁碟與檔案系統管理	檔案系統的簡單操作(2)
9	檔案與檔案系統的壓縮與打包	壓縮檔案的用途與技術(1)、打包指令(3)
10⁵	vim 程式編輯器	語系編碼轉換(4.3)
11	認識與學習 BASH	全。但可視情況忽略 2.4-2.8、6.4
12⁶	正規表示法與文件格式化處理	前言(1)
13	學習 Shell Scripts	全（等用到再看）
22	軟體安裝：原始碼與 Tarball	全（了解流程、懂有這些關鍵字就好）
23	軟體安裝： RPM, SRPM 與 YUM 功能	Ubuntu 用的是 APT⁷

Chp. No	Chp. Name	Highlights
1	GNU/Linux tutorials	Everything except for 1.3 Midnight Commander
2	Debian package management	Read 2.2 Basic package management operations
10	Data management	Read 10.1 Sharing, copying, and archiving

學習目標

因為這邊指的 Linux 算是一個蠻廣的內容，一開始學的時候很容易迷失方向。所以我額外列了幾個很重要的觀念，你應該能在學習 Linux 的初期接觸到他們：

了解 $PATH 與程式執行位置的關係
- 為什麼打 ls 可以找到這隻名為 ls 的程式
知道 stdin、stdout、stderr；pipeline 的使用
知道環境變數是什麼，怎麼修改
了解檔案、目錄、相對路徑；權限設定
使用 <cmd> -h <cmd> --help man <cmd> 來查看指令的功能、可下的參數
- <cmd> = 任何在 linux 下的指令

如果你花了一個禮拜的時間，但上述的內容連聽都沒聽過（或沒什麼使用到），那很可能你學習 Linux 的方式跟我想得很不一樣，請先寫個信告訴我。上面這些觀念的學習也是漸近式的，過了一個禮拜只有聽過但不是很了解，這是很正常的現象。

自己從零開始安裝一次 Linux 系統（可以用 VM）。
定期使用它一個星期以上（即熟悉 cd ls 等基礎指令）
使用 ssh 連線到遠端的 Linux。（要打開 ssh 的 port）
- Bonus: 在 ssh 連線時不用打密碼。
- Bonus hint: 查 authorized_keys。會需要建立 ssh user identity keypair，這會在上傳 GitHub 時用到）
安裝一個叫 htop 的系統監控軟體。使用它來查看系統資料的使用狀況
- Bonus:
  - 調整欄位的排版
  - 開啟 Tree Veiw
  - 選擇顯示單一使用者運行的程序（太舊的 htop 可能沒這功能）
安裝一個叫 aria2 的續傳軟體，他可以多線程下載 HTTP(S)、FTP、甚至 BT。今天想要下載 Debian Jessie netinst 的映像檔，使用 2 個線程同時下載。
- Hint: 查 aria2c 的 man page。
學會查看系統硬碟的使用量；查看當前目錄內所有檔案的大小（絕對不是 ls -l）
- Hint: df 和 du
scp 是個透過 ssh 傳送一或多個檔案的指令，試著用它把自己電腦的檔案（們）傳到 server 上。
- Bonus:
  - 在路徑中搭配特殊字元 *? 傳多個檔案
  - 有一個更精密的傳檔工具叫 rsync，試著改用它來傳檔。
使用 GUI 的遠端介面。這相關的技術有很多：VNC、RDP 最常見。RDP 在 windows 連接上比較順暢；VNC 在畫面傳輸比較沒效率，這會對 server 造成不小的負擔，也很容易 lag。有一個新的通訊協定叫 NX，它對畫面壓縮使用即便網速很慢依然能使用圖形介紹。
試著用實作 NX 協定的軟體 X2go 做遠端桌面連線到 server。
- Hint: 你需要在 server 與 client 端（通常是自己的電腦）都裝上 X2go 的軟體，並會使用到 SSH 的連線設定。
只用 Linux 生存一個星期以上（包含中文輸入、上網等等）

Linux distros 源流 http://en.wikipedia.org/wiki/Linux_distribution ↩
Distro Watch 是一個介紹各種 Linux、BSD 系統的地方，可以來這邊看各個 distro 的介紹。 ↩
LXDE 原作者是 PCMan 喔，也有相當多的台灣人在維護它。 ↩
LVM 不懂沒關係，有興趣可以參考鳥哥十五章、Arch Wiki 介紹 ↩
學 vim 有別的資源，詳見 2 Text Editing。
↩
正規表示（regex）很重要，但初學 Linux 時會覺得很複雜可以跳過。 2 Text Editing 會再接觸到一次 vim 的 regex、4 Python 也會學到 Python 的 regex，可以等到時候再回來學 sed、egrep 等指令。
↩
APT 的使用教學可以參考 Ubuntu 官網、網路上大大的筆記。 ↩

Coding 初學指南－總章

2016-01-21T21:00:00-06:00

Last Edited: Jan, 2016

（如果內容有誤，你可以留言，或用任何管道告訴我）

雖然實驗室不是正規寫程式的地方，多數的人也沒把寫程式當成一回事。不過隨著要分析的資料、樣本數越來越多，多到自己電腦跑不動，server 也要跑很久的時候，就會顯示出程式的重要性。另一方面，現在的研究講求 reproducibility，如果要讓自己的分析在一年之後也能重現，或者讓世界上其他的研究者也能重現的話，需要基本的程式技巧。

這一系列的文章，目標讓實驗室的新生，不論是不是 CS 背景，都能了解當今軟體開發流程及基本技能。軟體開發需要一些背景知識，才能與開發者正常交流。這些背景知識包括：

熟悉在 server 上工作（或 Linux 的使用）
一個很多人用的軟體工具長什麼樣
如何把自己的程式與別人分享
多人共同開發
其他人都怎麼樣寫程式

這些內容在學校課程的訓練中較少，尤其是電機背景的話在寫程式又更為隨性，但這對一個長期的軟體專案是必需的。希望大家能養成這些習慣。

這些主題需要熟練到什麼程度，見人見智，深入下去每一個都夠花幾個月的時間鑽研，但以一個實驗室專題、或要完成的軟體專案來說都不實際，至少老闆一點都不在意。所以個人覺得，最低標準就是在遇到該課題不懂的地方時，「能知道怎麼下關鍵字查」且「查完的結果能看得懂」。

系列文原本放在 Github Gist 上，但現在有自己的 blog 了，就整理到這邊順便更新。

學習方式
- 問：為什麼不帶大家手把手教學？
熟練來自生活中的實踐
給 Windows 使用者
給 OSX 使用者
文章目錄

學習方式

每一篇都會是一個主題，主題底下會列出一些資源。主題的最後有學習目標，方便讓你評估自己學到什麼程度。學習目標會給一個明確的任務，我盡量讓它能跟（宅宅的）日常生活結合。通常只要完成前一、二個目標就行了，這也不是功課所以不用給我看。但如果你不介意給我看的話，我會分享我主觀的建議。大部份的任務是沒有唯一的正確答案，只要能解決問題都是好方法。

總之，這些資源不用全看，任務不用全做，大家自己斟酌要花多少時間在不同的主題上。

挑你喜歡的東西盡量鑽，沒有 fu 的就隨便看看會了就好

我會盡量按照難易度排列，中英文的資源都放。

問：為什麼不帶大家手把手教學？

簡單地說沒空。認真地說，大家的學習速度跟底子都不一樣，同步學只是浪費各位的時間。

我大概沒有辦法一個一個項目帶大家練習，底下的很多連結只是提供一個學習的窗口，真正要學下去，都是要花一定時間的。所以也不要抱著「只要讀完這些文章就會了○○○」這樣的想法。

白話的來說，這些背景知識就像遊戲的技能樹，基礎技能要先點好才能點進階技能。要把基礎技能點滿了再練等也可能，但不必要一直練等，大家未必喜歡，現實中也不許你練等不解任務。

一開始可能碰到小問題就要查，或者要連續查很多個網頁被導向四、五次才能稍微解答自己的疑問。這個現象是非常正常的，如果大家能撐過初期這段比較挫折的時期，日後要自學軟體基本上就沒問題了。

熟練來自生活中的實踐

要很快地學好程式，我推薦練習把程式應用在生活中。例如用文字命令列來下載檔案；把自己筆電變成 linux 桌面系統，練習自己編譯軟體、解決各式安裝的狀況。讓自己的電腦成為一個自己能接受的軟體開發環境，並經常的使用它，就能降低對寫程式的陌生與不知所措感。

上面的方法可能稍難一些，負擔比較小的可以開始做「思考練習」。思考練習包含去想生活中的大小事該怎麼寫程式來控制。例如我該怎麼設計一個電梯系統？臉書怎麼呈現大家的動態？只要大概想一想就好了，想不出來也不會怎樣，也不用特別查資料。過一段時間對程式的 sense 也會提昇。

如果需要更硬派的學習方式，不妨把自己電腦安裝的軟體的源始碼都拿出來看一下，加入幾個自己平常用的軟體的專案來修改它，讓它更少問題更多功能（一般叫 contribute）。也可以把實驗室有用到的工具的原始碼拿出來看一下，例如 sratoolkit、cutadapt，看看自己能不能讀懂別人的程式碼。

給 Windows 使用者

建議大家想辦法裝個 Linux（或用 Mac）。如果不想取代掉自己的 Windows 環境話，可以安裝 VirtualBox 裝個虛擬的 Linux，或者在 Amazon 等 VPS 架一台虛擬主機。

Windows 因為對圖形介面（GUI）設計的很好，也不容易讓使用者用命令列模式（terminal, console)。雖然 Windows 上有像 command prompt、Powershell 之類的環境，但都很難用它來操控整個系統。而且它打從骨子就跟 Linux 不一樣，所以相關的指令不好在網上的教學文章中找到，而多數 Windows 的開發者也不喜歡用 terminal。

另一方面，大家對 Visual Studio 的印象都是要收錢的¹，這是 Windows 上最完整最好用的開發環境，不想付費的情況下可能就會覺得「寫程式充滿障礙」，更何況 VS 也是圖形化的編輯器，也會不了解背後運作的方式。市面的開源軟體多半在 *nix 上開發，對 Windows 支援度差也加深這個障礙（「想要自己裝但都一堆限制又很容易失敗」）

不用 terminal 的壞處是很難想像自己系統中的軟體為什麼能運作。感覺要寫個有 GUI 的程式都要很厲害，跟自己學過的寫程式都不一樣。但實質上是沒有太多區別的，只是要完成一個能安裝在系統中的視窗軟體，需要的步驟多很多，一般簡單的專案都不會到那個階段。

給 OSX 使用者

Mac OSX 使用者也有這樣的現象，但因為 OSX 在底層用的是跟 FreeBSD 很相似，而 FreeBSD 跟 Linux 相似，所以它的 terminal 環境是很完整的。現在軟體開發者很多人用 OSX，因此網路上 Linux、OSX 資源都很多，兩者的經驗常能自然地移植。

如何在 OSX 上開發程式，可以參考附錄 0 的筆記，但內容很主觀，不是每個人都會像我這樣用。

文章目錄

寫一寫內容也變得蠻多的，所以把它切成了幾篇文章，請按照數字順序閱讀：

或者，用 labcoding 這個 tag 也可以找到所有的文章。

Visual Studio 自 2013 後有 Community 版本，免費但內容與 Professional 版本大致無異，所以未來要有 C/C++ 32/64bit Compiler 會更容易。但多數的開源軟體還沒跟進，所以很多還在用舊的 VS 版本（就要付費），這個現象還會持續一陣。學校都有買。 ↩

Numpy Indexing

2016-01-18T02:00:00-06:00

前幾天需要寫 numpy 時，突然發現跟 pandas 在 indexing 的行為蠻不一樣的。我感覺未來一定會忘記，先筆記起來。

就用時事來舉例吧，把維基百科上各政黨 2016 年臺灣立法委員提名數的表格抓下來。處理原始資料的程式放到文末，做完大概長這樣：

	區域	原住民	不分區
中國國民黨	72	5	33
民主進步黨	60	2	34
台灣團結聯盟	2	0	15
親民黨	6	1	16
無黨團結聯盟	0	1	7
民國黨	13	1	10
綠黨社會民主黨聯盟	11	0	6
中華統一促進黨	14	0	10
時代力量	12	0	6
大愛憲改聯盟	12	0	6

Pandas indexing

Pandas indexing 花俏到用一、兩頁也介紹不完。

不過今天只想說有關兩個維度以上的 indexing，例如想看國民黨、民進黨、時代力量區域與不分區的提名好了，

df.iloc[
    [0, 1, 8], [0, -1]
]
df.loc[
    ['中國國民黨', '民主進步黨', '時代力量'], 
    ['區域', '不分區']
]

上述兩個方法都能拿到一部份的表格。

	區域	不分區
中國國民黨	72	33
民主進步黨	60	34
時代力量	12	6

Numpy indexing

下意識地以為 numpy indexing 會是一樣的，畢竟 pandas 底層就是一個 numpy array。

>>> arr = df.values
>>> arr[:5]
array([[72,  5, 33],
       [60,  2, 34],
       [ 2,  0, 15],
       [ 6,  1, 16],
       [ 0,  1,  7]])
>>> arr[[0, 1, 8], [0, -1]]
...
IndexError: shape mismatch: indexing arrays could not be broadcast 
together with shapes (3,) (2,)

回去看官方文件才想起來， numpy 這時候是如同給定 (x, y) 座標這樣，一個個把元素選出來。

>>> arr[[0, 1, 8], [0, 1, 2]]
[72, 2, 6]

簡單的方式是分兩次選，

arr[[0, 1, 8], :][:, [0, 2]]

但這樣 numpy 會傳兩次 copy¹ 回來，資料很大的時候就沒效率。所以要怎麼做呢？

參考 Stack Overflow 上的回答，底下幾種方式都可以。最簡單的方法就是透過 numpy.ix_()，

arr[np.ix_([0, 1, 8], [0, 2])]

如果了解 numpy broadcasting 機制的話，

# index must be numpy array
cols = np.array([0, 1, 8])
rows = np.array([0, 2])
arr[cols[:, np.newaxis], rows]

# np.newaxis is None
arr[cols[:, None], rows]

或者直接把所有包含的 index 值都做出來，

indices = np.meshgrid(
    [0, 1, 8], [0, 2], 
    indexing='ij'
)
arr[indices]

整理一下，這只要一段時間沒用就常會忘記。

維基原始資料處理

Wikipedia 原始資料從這裡取得，這就是展現 pandas 處理能力的時候了。新版本對字串處理提供更多功能，都讓我忘了底下的 numpy 對 unicode 支援其實不怎麼樣 XD

import pandas as pd
import numpy as np
from urllib.parse import quote_plus

dfs = pd.read_html(
    'https://zh.wikipedia.org/wiki/%s' 
    % quote_plus('2016年中華民國立法委員選舉')
)
df = next(
    df for df in dfs 
    if '立法委員政黨提名名額' in str(df.iloc[0, 0])
)

# Data cleaning
df.columns = df.iloc[1, :].values
df['政黨'] = df['政黨'].str.replace(
    r'\[(註 |)\d+\]', ''
)
df.index = df['政黨']
df = df.replace('－', 0)
df = df.iloc[2:-1, 1:-1].astype(np.int)

在這情況資料會被 copy 傳回來，但如果是 start:end:step 的 simple indexing 就只會回傳 view。 ↩

Plot Sequencing Depth with Gviz

2016-01-15T23:50:00-06:00

TL;DR Plot exome sequencing depth and coverage with genome annotation using Gviz in R. Then apply detail control on Gviz annotation track displaying.

This is an extending post from Genomic Data Processing in Bioconductor, though I haven’t finished reading all the reference in that post. The background knowledge of this post is basic understanding of how to deal with annotation and genome reference in Bioconductor/R. If you don’t deal with genome annotations in R before, you should find some time learning it anyway, a truly life saver.

Convert sequencing depth to BedGraph format
Plot depth in Gviz
Plot fine tune
Summary
Supplementary - Plot BAM files directly
- Fancier alignment display

I got the chance trying new tricks today when I and other lab members were analyzing our human cancer exome sequencing data. The results were a bunch of BAM files aligned by BWA-MEM using reference hg19.

We want to see how was the sequencing depth and the coverage of all exons designed to be sequenced. Roughly, this can be done in the genome viewer such as IGV.

Visualize sequencing depth in IGV

IGV is good for daily research, but when it comes to customization, there aren’t many options. And if the visualization is aimed for publishing, one might want the figure to be vectorized and, more importantly, reproducible.

Therefore, combining with what I learnt in Genomic Data Processing in Bioconductor, I tried to plot the sequencing depth in R with Gviz. I thought learning Gviz will be demanding, since its vignette has 80 pages and the function documentation are scarily long spells. But both of them turned out to be really helpful and informative, especially when trying to tune its behavior. Figures produced by Gviz are aesthetically pleasing, and Gviz has many features as well (still trying). I’m glad that I gave it a shot.

If you want to follow the code yourself, any human BAM alignment files will do. For example, the GEO dataset GSE48215 contains exome sequencing of breast cancer cell lines.

Convert sequencing depth to BedGraph format

After a quick search, Gviz’s DataTrack accepts BedGraph format. This format can display any numerical value of chromosome ranges, shown as follows,

chromosome	start	end	value
chr1	10,051	10,093	2
chr1	10,093	10,104	5
…	…	…	…

So we need to convert the alignment result as BedGraph format, which can be done by BEDTools’ genomecov command. On BEDTools’ documentation, it notes that the BAM file should be sorted.

bedtools genomecov -bg -ibam myseq.bam > myseq.bedGraph

The plain text BedGraph can be huge, pipe’d with gzip will reduce file size to around 30% of the original.

bedtools genomecov -bg -ibam myseq.bam | gzip > myseq.bedGraph.gz

Plot depth in Gviz

R packages of human genome annotations (Homo.sapiens) and Gviz itself are required. Also, data.table gives an impressed speed at reading text tables so is recommended to use. During the analysis, I happened to know that data.table supports reading gzip’d file through pipe, which makes it more awesome.

First Gviz track

We should first start at reading our sequencing depth as BedGraph format and plot it.

library(data.table)
library(Gviz)

bedgraph_dt <- fread(
    './coverage.bedGraph',
    col.names = c('chromosome', 'start', 'end', 'value')
)

# Specifiy the range to plot
thechr <- "chr17"
st <- 41176e3
en <- 41324e3

bedgraph_dt_one_chr <- bedgraph_dt[chromosome == thechr]
dtrack <- DataTrack(
    range = bedgraph_dt_one_chr,
    type = "a",
    genome = 'hg19',
    name = "Seq. Depth"
)
plotTracks(
    list(dtrack),
    from = st, to = en
)

So we read the sequencing depth data, create a Gviz DataTrack holding the subset of our data on chr17, then plot Gviz tracks by plotTracks (though we only made one here) within a given chromosome region. Here is what we got.

Add genome axis

The figure is a bit weird and lack of information without the genomic location.

Adding genomic location can be done automatically by Gviz through a new track GenomeAxisTrack. Also, we’d like to show which region of chromosome we are at. This can be done by adding another track, IdeogramTrack, to show the chromosome ideogram. Note that the latter track will download cytoband data from UCSC so the given genome must have a valid name.

itrack <- IdeogramTrack(
    genome = "hg19", chromosome = thechr
)
gtrack <- GenomeAxisTrack()

plotTracks(
    list(itrack, gtrack, dtrack),
    from = st, to = en
)

Better now :)

Add annotation

Since we are using exome sequencing, the curve of sequencing depth only makes senses when combined with the transcript annotations.

Gviz has GeneRegionTrack to extract annotation from the R annotation packages. Package Homo.sapiens includes the gene annotation package using UCSC knownGene database. Adding this new track and we will have annotation on our plot.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

grtrack <- GeneRegionTrack(
    txdb,
    chromosome = thechr, start = st, end = en,
    showId = TRUE,
    name = "Gene Annotation"
)

plotTracks(
    list(itrack, gtrack, dtrack, grtrack),
    from = st, to = en
)

The plot should now be as informative as what we can get from the IGV. In fact, Gviz can plot the alignment result too. It can read the BAM file directly and show a more detailed coverage that matches what IGV can do. I’ll leave that part at the end of this post.

So far we’ve shown the sequencing depth of some chromosome region with annotation. However, there still leave something to be desired, mostly about the annotation:

Can we show only the annotation of certain genes?
knownGene’s identifier is barely meaningless, can we show the gene symbol instead?

So here comes the second part, annotation fine tuning.

Plot fine tune

Say, we only care about gene BRCA1. So we need to get its location, or specifically, the genomic range that cover all BRCA1 isoforms. In the following example, I will demonstrate the Gviz’s annotation fine tuning.

Genome annotation query in Bioconductor/R

If you are not familiar with how to query annotations in Bioconductor, it’s easier to think by breaking our goal of finding BRCA1’s ranges into two steps:

Get the transcript IDs
Query the transcript locations by their IDs

Getting transcript IDs given their gene symbol is a select() on OrganismDb object,

# Get all transcript IDs of gene BRCA1
BRCA1_txnames <- select(
    Homo.sapiens,
    keys = "BRCA1", keytype = "SYMBOL",
    columns = c("ENTREZID", "TXNAME")
)$TXNAME

> BRCA1_txnames
 [1] "uc010whl.2" "uc002icp.4" "uc010whm.2" "uc002icu.3"
 [5] "uc010cyx.3" "uc002icq.3" "uc002ict.3" "uc010whn.2"
 [9] "uc010who.3" "uc010whp.2" "uc010whq.1" "uc002idc.1"
[13] "uc010whr.1" "uc002idd.3" "uc002ide.1" "uc010cyy.1"
[17] "uc010whs.1" "uc010cyz.2" "uc010cza.2" "uc010wht.1"

Look like it has plenty of isoforms!

via `transcripts()`

For the transcript location, the easiest way will be querying the txDb via transcript(),

BRCA1_txs <- transcripts(
    Homo.sapiens,
    vals=list(tx_name = BRCA1_txnames),
    columns=c("TXNAME","SYMBOL", "EXONID")
)

> BRCA1_txs
GRanges object with 20 ranges and 3 metadata columns:
       seqnames               ranges strand   |                   EXONID          TXNAME          SYMBOL
          <Rle>            <IRanges>  <Rle>   |            <IntegerList> <CharacterList> <CharacterList>
   [1]    chr17 [41196312, 41276132]      -   | 227486,227485,227482,...      uc010whl.2           BRCA1
   [2]    chr17 [41196312, 41277340]      -   | 227487,227486,227485,...      uc002icp.4           BRCA1
   [3]    chr17 [41196312, 41277340]      -   | 227487,227464,227463,...      uc010whm.2           BRCA1
   [4]    chr17 [41196312, 41277468]      -   | 227489,227486,227485,...      uc002icu.3           BRCA1
   [5]    chr17 [41196312, 41277468]      -   | 227489,227486,227482,...      uc010cyx.3           BRCA1
   ...      ...                  ...    ... ...                      ...             ...             ...
  [16]    chr17 [41243452, 41277340]      -   | 227487,227486,227485,...      uc010cyy.1           BRCA1
  [17]    chr17 [41243452, 41277468]      -   | 227489,227486,227485,...      uc010whs.1           BRCA1
  [18]    chr17 [41243452, 41277500]      -   | 227488,227486,227485,...      uc010cyz.2           BRCA1
  [19]    chr17 [41243452, 41277500]      -   | 227488,227486,227485,...      uc010cza.2           BRCA1
  [20]    chr17 [41243452, 41277500]      -   |            227488,227474      uc010wht.1           BRCA1
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Then get the genomic range of these transcripts by seqnames(), start() and end() functions on the GRanages object,

thechr <- as.character(unique(
    seqnames(BRCA1_txs)
))
st <- min(start(BRCA1_txs)) - 2e4
en <- max(end(BRCA1_txs)) + 1e3

Some space are added at both ends so the plot won’t tightly fit all transcripts and leave some room for the transcript names.

> c(thechr, st, en)
[1] "chr17"    "41176312" "41323420"

via `exonsBy()`

Another way to obtain the genomic range is getting the exact range of CDS (e.g. exons and UTRs) for each transcript via exonsBy().

BRCA1_cds_by_tx <- exonsBy(
    Homo.sapiens, by="tx", use.names=TRUE
)[BRCA1_txnames]

The function returns a GRangesList object, a list of GRanges that each GRanges object corresponds to a transcript respectively.

> BRCA1_cds_by_tx
GRangesList object of length 20:
$uc010whl.2
GRanges object with 22 ranges and 3 metadata columns:
       seqnames               ranges strand   |   exon_id   exon_name exon_rank
          <Rle>            <IRanges>  <Rle>   | <integer> <character> <integer>
   [1]    chr17 [41276034, 41276132]      -   |    227486        <NA>         1
   [2]    chr17 [41267743, 41267796]      -   |    227485        <NA>         2
   [3]    chr17 [41258473, 41258550]      -   |    227482        <NA>         3
   [4]    chr17 [41256885, 41256973]      -   |    227481        <NA>         4
   [5]    chr17 [41256139, 41256278]      -   |    227480        <NA>         5
   ...      ...                  ...    ... ...       ...         ...       ...
  [18]    chr17 [41209069, 41209152]      -   |    227462        <NA>        18
  [19]    chr17 [41203080, 41203134]      -   |    227461        <NA>        19
  [20]    chr17 [41201138, 41201211]      -   |    227459        <NA>        20
  [21]    chr17 [41199660, 41199720]      -   |    227458        <NA>        21
  [22]    chr17 [41196312, 41197819]      -   |    227457        <NA>        22

...
<19 more elements>
-------
seqinfo: 93 sequences (1 circular) from hg19 genome

GRangesList is not merely a R list structure, which can correctly propagate the GRanges-related functions to all the GRanges it contain.

> start(BRCA1_cds_by_tx)
IntegerList of length 20
[["uc010whl.2"]] 41276034 41267743 41258473 ... 41201138 41199660 41196312
[["uc002icp.4"]] 41277199 41276034 41267743 ... 41201138 41199660 41196312
...

Here we only cares about the widest range, so the hierarchical structure is not useful. It would be better to flatten the GRangesList first,

> BRCA1_cds_flatten <- unlist(BRCA1_cds_by_tx)
> BRCA1_cds_flatten
GRanges object with 284 ranges and 3 metadata columns:
             seqnames               ranges strand   |   exon_id   exon_name exon_rank
                <Rle>            <IRanges>  <Rle>   | <integer> <character> <integer>
  uc010whl.2    chr17 [41276034, 41276132]      -   |    227486        <NA>         1
  uc010whl.2    chr17 [41267743, 41267796]      -   |    227485        <NA>         2
  uc010whl.2    chr17 [41258473, 41258550]      -   |    227482        <NA>         3
  uc010whl.2    chr17 [41256885, 41256973]      -   |    227481        <NA>         4
  uc010whl.2    chr17 [41256139, 41256278]      -   |    227480        <NA>         5
         ...      ...                  ...    ... ...       ...         ...       ...
  uc010cza.2    chr17 [41249261, 41249306]      -   |    227477        <NA>         7
  uc010cza.2    chr17 [41247863, 41247939]      -   |    227476        <NA>         8
  uc010cza.2    chr17 [41243452, 41246877]      -   |    227474        <NA>         9
  uc010wht.1    chr17 [41277288, 41277500]      -   |    227488        <NA>         1
  uc010wht.1    chr17 [41243452, 41246877]      -   |    227474        <NA>         2
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

We have the BRCA1 genomic region, rest of the plotting is the same.

Show only the annotations of certain genes

Before we start to create our own annotation subset, we first take a look at what Gviz generated. The GeneRegionTrack track store its annotation data at slot range.

> grtrack@range
GRanges object with 459 ranges and 7 metadata columns:
        seqnames               ranges strand   |     feature          id         exon  transcript        gene      symbol   density
           <Rle>            <IRanges>  <Rle>   | <character> <character>  <character> <character> <character> <character> <numeric>
    [1]    chr17 [41177258, 41177364]      +   |        utr5     unknown uc002icn.3_1  uc002icn.3        8153  uc002icn.3         1
    [2]    chr17 [41177365, 41177466]      +   |         CDS     unknown uc002icn.3_1  uc002icn.3        8153  uc002icn.3         1
    [3]    chr17 [41177977, 41178064]      +   |         CDS     unknown uc002icn.3_2  uc002icn.3        8153  uc002icn.3         1
    [4]    chr17 [41179200, 41179309]      +   |         CDS     unknown uc002icn.3_3  uc002icn.3        8153  uc002icn.3         1
    [5]    chr17 [41180078, 41180212]      +   |         CDS     unknown uc002icn.3_4  uc002icn.3        8153  uc002icn.3         1
    ...      ...                  ...    ... ...         ...         ...          ...         ...         ...         ...       ...
  [455]    chr17 [41277294, 41277468]      -   |        utr5     unknown uc010cyx.3_1  uc010cyx.3         672  uc010cyx.3         1
  [456]    chr17 [41277294, 41277468]      -   |        utr5     unknown uc002idc.1_1  uc002idc.1         672  uc002idc.1         1
  [457]    chr17 [41277294, 41277468]      -   |        utr5     unknown uc010whr.1_1  uc010whr.1         672  uc010whr.1         1
  [458]    chr17 [41277294, 41277468]      -   |        utr5     unknown uc010whs.1_1  uc010whs.1         672  uc010whs.1         1
  [459]    chr17 [41322143, 41322420]      -   |        utr5     unknown uc010whp.2_1  uc010whp.2         672  uc010whp.2         1
  -------
  seqinfo: 1 sequence from hg19 genome; no seqlengths

So we filter out unrelated ranges by checking if the value of metadata column transcript is one of BRCA1’s transcript IDs,

BRCA_only_range <- grtrack@range[
    mcols(grtrack@range)$transcript %in% BRCA1_txnames
]
grtrack@range <- BRCA_only_range

or by less hacky way that use the new range to construct another GeneRegionTrack,

grtrack_BRCA_only <- GeneRegionTrack(
    BRCA_only_range,
    chromosome = thechr, start = st, end = en,
    showId = TRUE,
    name = "Gene Annotation (BRCA1 only)"
)
plotTracks(
    list(itrack, gtrack, dtrack, grtrack_BRCA_only),
    from = st, to = en
)

Display gene symbols at annotation track

It’s more obvious now about how Gviz stores the annotation. All we need is to replace the symbol name with whatever we desire.

First, we extract the metadata of the GeneRegionTrack, and query for their gene symbols. Using either the transcript ID or Entrez ID will do.

grtrack_range <- grtrack@range
range_mapping <- select(
    Homo.sapiens,
    keys = mcols(grtrack_range)$symbol,
    keytype = "TXNAME",
    columns = c("ENTREZID", "SYMBOL")
)

> head(range_mapping)
      TXNAME SYMBOL ENTREZID
1 uc002icn.3   RND2     8153
2 uc002icn.3   RND2     8153
3 uc002icn.3   RND2     8153
4 uc002icn.3   RND2     8153
5 uc002icn.3   RND2     8153
6 uc002icn.3   RND2     8153

Then we concatenate the information of transcript ID and gene symbol using stringr.

library(stringr)
new_symbols <- with(
    range_mapping,
    str_c(SYMBOL, " (", TXNAME, ")", sep = "")
)

> head(unique(new_symbols))
[1] "RND2 (uc002icn.3)" "NBR2 (uc002idf.3)" "NBR2 (uc010czb.2)"
[4] "NBR2 (uc002idg.3)" "NBR2 (uc002idh.3)" "NBR1 (uc010czd.3)"

Like how we extract BRCA1-only annotations, we construct a new GeneRegionTrack.

grtrack_symbol <- GeneRegionTrack(
    grtrack@range,
    chromosome = thechr, start = st, end = en,
    showId = TRUE,
    name = "Gene Annotation w. Symbol"
)
symbol(grtrack_symbol) <- new_symbols
plotTracks(
    list(itrack, gtrack, dtrack, grtrack_symbol),
    from = st, to = en
)

Summary

So we’ve learnt how to plot using Gviz. You should go explore other data tracks or try to combine sequencing depth of multiple samples. I found the design of Gviz is clean and easy to modify. I think I’ll use Gviz whenever genome-related plots are needed.

Really glad I’ve tried it :)

Supplementary - Plot BAM files directly

We will start by replacing DataTrack with AlignmentsTrack. Also we select a smaller region this time so the read mapping can be clearly seen.

st <- 41.196e6L
en <- 41.202e6L
gtrack <- GenomeAxisTrack(cex = 1)  # set the font size larger
altrack <- AlignmentsTrack(
    "myseq.bam", isPaired = TRUE, col.mates = "deeppink"
)
plotTracks(
    list(gtrack, altrack, grtrack),
    from = st, to = en
)

To plot only the coverage, set the type as coverage.

altrack <- AlignmentsTrack(
    "myseq.bam", type = "coverage"
)

Fancier alignment display

Spend some time reading the documentation, the alignment can be much more fancier.

For example, when looking at a much smaller genome region, we many want to see the sequence and read mismatches. It could be done by adding a new track SequenceTrack to include the genome sequence,

small_st <- 41267.735e3L
small_en <- 41267.805e3L

library(BSgenome.Hsapiens.UCSC.hg19)
strack <- SequenceTrack(
    Hsapiens,
    chromosome = thechr, from = small_en, to = small_st,
    cex=0.8
)

We tweak other tracks as well to make sure the figure won’t explode by too much information. Gene annotations are collapsed down to one liner. Also, aligned read’s height is increased to fit in individual letters (e.g., ATCG).

grtrack_small <- GeneRegionTrack(
   grtrack@range,
   chromosome = thechr,
   start = small_st, end = small_en,
   stacking = "dense",
   name = "Gene Annotation"
)
altrack <- AlignmentsTrack(
    "myseq.bam",
    isPaired = TRUE,
    min.height = 12, max.height = 15, coverageHeight = 0.15, size = 50
)
plotTracks(
    list(gtrack, altrack, grtrack_small, strack),
    from = small_st, to = small_en
)

We found a C>T SNP here!

Jupyter Notebook Theme

2016-01-07T00:00:00-06:00

Jupyter Notebook，也就是以前的 IPython Notebook，應該是許多人在用 Python 做資料分析時記錄實驗步驗與結果的工具。

現在 IPython (v4.0+) 已經回歸到 Interactive Python Shell 的本質，變成只是擴充內建 Python REPL 的套件，相依的模組也清掉了。原本的 IPyton Notebook 主要是提供一個像 Mathematica Notebook 的環境，功能很多就不多提。它可以用 web 或者 QT 介面來跑。

後來又開始整合很多語言，變成像 Julia / R / Lua 等語言都可以利用這樣的 Notebook 架構，於是 Jupyter 就因此誕生，變成原本的 IPython 只是其中一個可能的語言 kernel。Notebook 本身可以是 R 語言或者 Julia 語言。

Jupyter Notebook
Custom Theme on v4.1+
Custom Theme Before v4.1

Jupyter Notebook

用 Python 裝十分簡單，

$ pip install jupyter

Jupyter 預設走 web 介面，會跑一個 tornado server 預設在 http://localhost:8888 上。

$ jupyter notebook
[I 23:50:58.449 NotebookApp] Serving notebooks from local directory: /Users/liang
[I 23:50:58.449 NotebookApp] 0 active kernels
[I 23:50:58.450 NotebookApp] The IPython Notebook is running at: http://localhost:8888/
[I 23:50:58.450 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

有裝 browser 的話就會自動開一個視窗。

Jupyter Notebook Hub

Notebook 預設是 .ipynb 的檔案。常見的內容像這樣：

Jupyter Notebook Example

Custom Theme on v4.1+

EDIT: 2016-01-11 Jupyter Notebook 4.1 已釋出，設定路徑預設都在 ~/.jupyter 底下。

今天重點是換主題嘛。趕快來做吧。

可以透過 jupyter --config-dir 或 jupyter --paths 找到設定檔應該放的位置。

$ jupyter --config-dir
~/.jupyter
$ jupyter notebook --generate-config  # 建立 Notebook 設定檔
Writing default config to: ~/.jupyter/jupyter_notebook_config.py

如果只是改主題，不需要更動 Notebook 設定檔。

目前自己使用的主題來自 dunovank，他有收集了至少深淺兩色，應該足夠使用了。CSS 從別人的基礎上來調整也相對簡單，我自己有改寫了一點（忘了改什麼)。dunovank 有寫個安裝 theme 的套件，不過不用也沒關係，只要準備好 CSS 就能用。我用 Grade3 這個主題來示範。

把 custom.css 放到 ~/.jupyter/static/custom.css。設定的目錄長得如下：

$ tree ~/.jupyter -F -L 3
~/.jupyter
├── custom/
│   └── custom.css -> /path/to/custom_light.css
└── jupyter_notebook_config.py

Custom Theme Before v4.1

現在因為歷經 IPython 到 Jupyter 的過程，設定還蠻分散的。以往的設定會在 ~/.ipython，而到了 Jupyter 之後，相關設定會在 ~/.jupyter。有時候設定怪怪的話就兩個路徑都檢查一下吧。

只要把這個 CSS 放到 ~/.ipython/profile_default/custom.css 再重開 Jupyter Notebook 就可以了¹。效果如下：

Jupyter Notebook Theme Grade3 Demo

把 Toolbar 全部 toggle 起來，以及表格的樣子。

個人覺得長時間使用下來，對比度低一點對眼睛比較好。黑底也不錯，不過畫圖常常會自己帶白底，整體感覺就不是很漂亮，可能要連 matplotlib theme 一起改吧 XD

這路徑並不符合 Jupyter 跨 kernel 的設計理念，~~感覺未來會改路徑~~已經在 4.1+ 版本中完成整合，與 IPython 設定分家了。 ↩

Blog defaults to HTTPS

2016-01-06T00:00:00-06:00

簡言之，現在 blog 使用 https。一般的 http 連線會被重新導向至 https。

Blog 本來就是架在 GitHub Pages 上，其實預設有 https，但在網址改成自訂 domain 之後 https 自然就失效了。在 GitHub 上有開 issue 請他們加入 HTTPS support for custom domain 這功能，不過目前還是需要自己想辦法。隨著 Let’s Encrypt 這種服務的流行，GitHub 才會去積極尋找比較合適的解決方案吧。

CloudFlare SSL and CDN
Disqus
justfont
Hinet 轉址服務

CloudFlare SSL and CDN

在看那個 issue 就可以找到其他人用 CloudFlare 的解法。概念上就再用一層 CloudFlare CDN，然後它的 CDN 有提供 https 簽章。直接看 CloudFlare 在 Crypto 頁的介紹比較快：

source: CloudFlare one-click SSL

所以 CloudFlare 去 cache GitHub 頁面時用的是 https，再到使用者時也是用 https。剩下就是你要不要相信 CloudFlare 了。

CloudFlare 的設定可以參考 Keanu’s Blog。一些重點筆記：

換成 CloudFlare 的 DNS Server
Crypto SSL options 選 Full（不是 Strict 目前 GitHub 還不支援）
在 Page Rules 強迫所有 http 連結重新使用 https（例如：http://blog.liang2.tw/*）

HTTPS 以及 DNS 的設定都需要一段時間，過幾個小時或觀察個一天再把 http 關掉。

Pelican 發佈設定 publishconf.py 管網址的 SITE_URL 能設成 //blog.liang.tw 不用帶 protocol（這麼重要的資訊沒寫在文件裡啊），這樣就能同時 serve http(s)。

這樣其實就完成了。但出乎意外還是有些小問題：

網頁字型 justfont 要 Business Plan 才能支援 HTTPS。
留言系統 Disqus HTTP 和 HTTPS 竟然是當作不同留言板來使用，而且要手動 merge。

Disqus

似乎解法只有全部導向到 https。這還不能直接改 Disqus 設定，要用它的 URL Mapper 下載所有留言版出現連結的 CSV 手動修改。

感覺很土砲。不過站上的留言不多，也不用改多少留言，很快就同步到新的位置上。

justfont

之前有贊助金萱計畫，其實有拿到兩年的 Business Plan。寫信給客服一天就改好設定了。不過之後就要多付錢啦。

Hinet 轉址服務

我沒有自己架任何 server，懶得維護。不過也很懶得打字。在其他 subdomain 都沒用的情況下，有透過 Hinet 設定 http://liang2.tw 會導向至 http://blog.liang2.tw 再被導向到 https。

CloudFlare DNS setting

大概是這樣。希望能在不要自己架 server 的情況下繼續經營這個 blog。

Overview of Genomic Data Processing in Bioconductor

2015-12-29T20:28:00-06:00

Sorry for the late update. In the past two months, I finished my Ph.D. applications (hope to hear good news in the next two months) and was busy preparing the PyCon Taiwan 2016. Also, a year-long website development finally came to the end.

Now most things are set so I can back to writing my blog.

Since September, there accumulates at least 5 drafts and I don’t know when I can finish them, so I think I have to change my writing strategy. I will first publish things as soon as information collection is done, and deeper reviews will be given in the following posts. Right now I will focus on Bioconductor (and general Bioinformatics topics) and Django.

Bioconductor
Summary

Bioconductor

Bioconductor is indeed a rich resources for R both in terms of data and tools. And I found I have yet spent time seriously understanding the whole ecosystem, which I believe can drastically lighten the loading of daily analysis.

Bioconductor’s website is informative. If you are familar with R, you should already know that in order to understand the usage of a package, one of the best way is to read its vignettes. Packages on Bioconductor generally have vignettes, which is really helpful and the website makes them accessible. On top of that, they have Courses & Conferences and Workflows. The former section collects all conference materials in the past few years, which contains package hands-on, analysis tutorial, and R advanced topics. It’s a hidden gem to me since I have already found numerous materials worth reading only after a glance over it. The latter one should be well-known. It gives examples of typical analysis workflows.

I’m interested in the following topics in Biocondutor:

Annotation and genome reference (OrgDb, TxDb, OrganismDb, BSgenome)
Experiment data storage (ExpressionSets)
Operations on genome (GenomicRanges)
Genomic data visualization (Gviz, ggbio)

Keywords in Biocondutors for each topic are attached in the parens, mostly being the package name. For each topic, I’ll put the related resources I collected in the following sections.

Before the listing, I found PH525x series maintained by Rafael Irizarry and Michael Love from Harvard serves as a comprehensive entry point for almost every related topic. The site is the accompanied resources for their edX classes. Both of them worth taking a look.

Annotation and Genome Reference

Annotating phenotypes and molecular function from PH525x series gives a good overview and a taste of the powerful ecosystem Bioconductor provides.
Annotation Resources from BioC 2015 gives more extensive introduction about all available types of references from genome sequences to transcriptome and gene info.

For example, human comes with

Experiment Data Storage

ExpressionSet helps store the expression experiment data, which one can combine expression values and phenotypes of the same sample. Additionally the experiment data (like descriptions of GEO dataset) can be attached as well.

The ExpressionSet container from PH525x series gives an intro. It should be sufficient enough to use ExpressionSet in daily work.
The ExpressionSet Introduction from its package Biobase’s vignette gives detailed explanation.

Operations on Genome

I haven’t gone into the details, but operations about genomic ranges are often tricky and more importantly, badly optimized.

IRanges and GRanges and GRanges operations from PH525x series give the overview of using the package GenomicRanges.
An Introduction to Genomic Ranges Classes, a GenomicRanges vignette, gives a detailed view.
Also, their paper, “Software for Computing and Annotating Genomic Ranges”, PLOS One should be another overview source of the package.
data.table’s foverlap function worth the comparison, since I already use it and I know it is blazingly fast. foverlap handles the overlapping of integer ranges so it can be applied to genomic operation. Its code is quite complex so its mechanism is still a myth to me. I’d like to see its comparison with using database like SQLite.

Genomic data visualization

Basically I can find two packages:

Gviz
ggbio

Don’t know their difference yet. Both of them can produce well-done figures. But I think I have some experience with ggbio, which was a bit tricky to use. So for now I will go for Gviz.

Visualizing genomic features with the Gviz package given at Bioc Europe 2012 has a decent introduction about Gviz.
The Gviz User Guide looks very comprehensive, which also cover usage with expression and alignment results.

Summary

These resources should be enough for weeks of trying. It’s excited to find so many useful tools.

So, good luck to me for my Ph.D. application, PyCon Taiwan 2016, and a shorter blog posting frequency.

Customize Django User Model

2015-11-04T18:23:00-06:00

Django 帳號的欄位定義在 django.contrib.auth 的 User 中，對使用者而言包含了：username*、first_name、last_name、email、password*。同時對開發者來說，還有：

指定 Group 和 Permission
是否為 staff、superuser
帳號開通、最後一次登入時間

內建的帳號功能應該很實用，安全性也很好。所以一般來說都不會去改它。

如果只是想要幫 User 加個 profile，例如生日、來自哪個星球等欄位，也不需要改寫 User。參考官網 Extending the existing User model，只需要建一個 one-to-one relationship 指到 User 就好了：

from django.conf import settings

class UserProfile(models.Model):
    user = models.OneToOneField(settings.AUTH_USER_MODEL)
    birth = models.DateField()
    orig_planet = models.CharField(max_length=255)

但 Django 預設用 username 來登入，如果要改用 email 登入要怎麼做？

改用 Email 做帳號登入

因為 User 是個很重要的 model，所以改寫時要注意相容性的問題。其實官網也有教學 Specifying a custom User model，不過這教學比前面長很多。

網路上已經有人 @jcugat 做了一個套件 django-custom-user，他實作了 EmailUser 即用 email 作為帳號登入。已經把所有苦工都做好了，所以如果想要再加上自己的欄位等等，可以繼承他的 AbstractEmailUser。

其實如果看完自定 User 之後，寫好 User Model 不難，比較複雜的是像創建、修改 User 以及 admin 的設定。除了讀這個套件的 source code 之後，這串 Stack Overflow 討論也提到了不同的實作方式。Django 這部份的 source code 蠻好讀的，也可以看一下。

因為之後要做 Email 認証，應該會用 django-allauth 做。感覺很久沒發文了，應該要把文章拆短才對 XD

數位時代的生氣

2015-10-10T00:00:00-05:00

在數位時代，很用力地的生氣是件很困難的事。

首先溝通媒介就不太鼓勵這種事。想要在 FB 上放地圖砲狂譙一件事，但 FB 會因為太激烈的發言，所以真正看到的人其實不多，或者只會推播到擁有相同偏好的那群人，好像達不到「譙人」的目標。其實絕大多數的社群網站或多或少都有這種「取暖的功能」。

再來，撇開生氣的對象，不論是一個群體，或者是私下的幾個人之間，數位時代的生氣還是很不一樣的。先從一般的情緒表達說起吧，數位時代的工具們，大幅地加強抒發自己日常的情緒的表現力。從傳統文書多少能使用的顏文字，到表情符號，到貼圖、會動的貼圖，人們越來越有辦法把日常中再小的情緒波動，用很誇張的方式表現。

例如，發現垃圾車今天早來的十分鐘：

今天垃圾車竟然早來十分鐘（怒）
今天垃圾車竟然早來十分鐘 Σ(°Д°;
今天垃圾車竟然早來十分鐘 !!!!!!!!（附上抱頭痛哭的動圖）

但很極端的負面情緒呢？

今天醒來發現總統跟國會最大黨，都還是國民黨

這樣僅陳述了事實，但沒有表達出自己心中的憤怒。

乾瞪著眼躺在床上未能起身，咬著牙憤恨地向天花板空揮拳頭，雙手交替著直到手臂也失去了力量。

……干我屁事，而且動作太長了像在寫小說，可能也超過一些媒體的字數限制。顏文字好像也不痛不癢，用（╯‵□′）╯︵┴─┴ （翻桌）表達這樣的情緒？只會感覺很「不正式」頗具戲謔之意，但真正憤怒的人是不會開玩笑的。

想了想，好像還是只能回歸到傳統畫信時代的方式，外加實體的動作。像是失聯，解除朋友，Unfollow，或者改用更激進的手段與詞彙，髒話就很有效。

但這樣的推論讓我很好奇，以前的人是怎麼跟別人吵架的。

其實讓我觀察到生氣在不同媒介的表達，並不是緣於前述狀況。這起源於一次我在看老人家用 FB 訊息吵架。例如年輕人失約了，老人就會很生氣的拿出……手機敲那個人，我猜他想說的是

全部人都在等你一個了！…（下略五十字氣話）

但最後送出的訊息是

在等你恩

也太溫柔了。原因很簡單，老人很難連續打超過十個正確字，在爆怒的狀態，能靜靜地（手寫）選字五個字就了不起了，那個「恩」也只是不小心按到。所以就會看到一個人嘴巴上很生氣的霹靂啪啦迸出一堆字句，但手機上打的、傳過去的卻沒什麼。

意外地是，這樣的媒介反而充當一個不錯的情緒緩衝，對老人們來說。本來很氣的，在送出訊息，另一端回覆之後，不知怎麼地就平復了許多。

那年輕人呢？

還沒有足夠的觀察記錄（大家也不要這樣戰我 > <）但我想，明年大家都投柱柱姐就能收集足夠的樣本了（笑）

EDIT 2015-10-10: 可能還沒機會投柱柱姐呢。世局改變得太快。

題目是在旅程中想到的。

一開始是想到中文的「數位時代的生氣」，但總覺得不夠帥氣，改取了個 dichotomized affection，dicho- 是個在寫論文學到的詞，其實跟 binarize 是差不多的概念，但想了想，這篇要講的也不是在情感單位裡找個 threshold 去區分 0/1，再改成了 polarized affection，但想了想，這個情境應該比較像無法原原本本地被傳輸訊號，應該是個 low-pass filter……是說我在幹麻，中文曖昧地乾脆，這樣就好了 :P

這是篇舊文，寫於 2015-08-26。

今天看到 Facebook 要推新更多樣的表情符號，而不再只是讚。於是就想起了它。也順便看看部落格貼這種文章會有什麼回響。

Today, Facebook is taking the wraps off what form the new Like may take. It is rolling out “Reactions,” a new set of six emoji that will sit alongside the original thumbs-up to let users quickly respond with love, laughter, happiness, shock, sadness and anger. (Source: TechCrunch)

那時候有人回覆說應該要考慮髒話的形況。

髒話應該就是最有效傳達情緒的用詞了，但我媽有教我不能講髒話，所以不在考慮範圍內 xd

例如（假設）

明明說今年可以畢業，結果老闆不肯簽離校單，幹
去你的明明說今年可以離開這他X的鬼地方，結果那個死老頭不肯簽離校單，幹這什麼屎情況

我覺得還是沒有比表情符號來得有效，或者說，情緒密度不夠高，而且想要達到更深的幹意(?)，常見就多加幾個髒話。

但我不確定大家平常怎麼閱讀的，除了很多匿名討論區外，網路上大家也不用指著對方鼻子講，話講起來都沒什顧忌，這些看久之了後，對這種用詞會自動濾掉，不然就是很難知道對方到底有多生氣。

對我來說，改變對某人的認知遠比起他使用的髒話來得震撼。「哇原來這個人生氣起來這麼可怕（筆記）」，好像是跨過絕對不能跨過的紅線。但這變成是討論「數位時代的底線」或「數位時代的形象」了，遠超過我目前思考範疇啊。

用 Django 與 SQLite 架抽籤網站

2015-10-04T14:55:00-05:00

前情提要

我把 LoveLive! 兩季看完了！μ’s 在第一季的成長充滿感動啊。\真姫最高/

……呃好啦，之前講了用 Flask 去架一個抽籤網站。不過我們最終的目標是用 Django 嘛，所以接下來就要改寫。也藉這個機會比較一下兩個 Framework 設計概念的不同（~~例如 Django 一開始寫有多冗~~、~~Flask 寫到最後有多冗~~）。

From Flask to Django

為了轉換但又不要一下子把所有 Django 的功能都放進來，中間過程有很多「不常見的寫法」。想要直接寫 Django best practice 的話，可以參考 TP 大大的《為程式人寫的 Django Tutorial 》，他的規劃是 30 個單元做一個訂餐系統。

過程中會用到很多 Django API，沒有解釋的話可以到官網去查使用。另外我發現如果能用 debugger 去 trace Django 執行的流程能幫助理解，想要一個精美的 debugger 的話可以裝像 PyCharm 的 IDE。

整體的規劃會漸近把 Django 的功能加進來，依序應該是：

Django View, Template
Django Model, ORM
Django Form
(Django Admin 沒有用到)

如果看 Django doc 首頁的話，也是分這幾個部份，雖然這篇文章並不會把所有概念都介紹一遍。

另外，在改寫的時候會跳過用 raw SQL，因為完全不用 ORM 有點難銜接其他 Django 部份。有興趣的話在講完 Model 之後可以參考 Details。

Django 初始設定

一樣開一個 Python 虛擬環境（這時候就是它的好處了，能把不同專案的套件隔離）。

pip install django pytz ipython pyyaml

pytz 在前一篇已經介紹過，是處理時區的套件。IPython 全名是 Interactive Python，同樣是 Python shell 但提供了很多附加功能，最常用的應該是自動補完。PyYAML 用來處理 YAML 物件，可裝可不裝，不裝之後的例子就用 JSON 即可。

我們的專案根目錄是 demo_django_draw_member。因為 Django 的設定很多，先在這目錄下用 django-admin 把基本的架構建起來。我們建了一個名為 draw_site 的專案（Project）。

(VENV) $ django-admin startproject draw_site

執行完之後應該會多出一堆檔案，結構如下。注意到有兩層 draw_site。

demo_django_draw_member/
└── draw_site/
    ├── draw_site/
    │   ├── __init__.py
    │   ├── settings.py
    │   ├── urls.py
    │   └── wsgi.py
    └── manage.py*

之後工作的目錄其實是 demo_django_draw_member/draw_site/，也就是有 manage.py 的那層目錄，之後的路徑都是相對於 demo_django_draw_member/draw_site/。介紹一下每個檔案。

manage.py 之後就會取代 django-admin 的功能。兩者最大的差別是 manage.py 知道 project 的設定。
draw_site/settings.py 裡面存著 Django 的各種設定，像 secret key、database、template engine、app 等。
draw_site/urls.py 裡面存著 URL dispatching 設定，即哪個路徑要用哪個 function 去處理。
draw_site/wsgi.py WSGI 是規範 Python web server 的標準，通常不會動這個檔案就不細提。Flask、Django 都是相容 WSGI 的實作。

一個 Django 由一個 project 和很多個 apps 所組成。每個 app 就專注在網站的某個功能上，各自包著各自需要的 database schema、template、view logics。這樣的好處是同樣的功能就不用重寫，同時在很大的網站時這樣的結構有助於管理運作的邏輯。

Django server

先把 Django 跑起來看看吧。

$ python manage.py runserver
...
Django version 1.8.5, using settings 'draw_site.settings'
Starting development server at http://127.0.0.1:8000/

Django Hello World

這是 Django 內建在什麼 URL 都沒設定時的歡迎畫面。看到這個至少表示基本的 settings 正常。Django 跟 Flask 一樣，內建的 server 會在 source code 有改變的時候 reload，所以一直開著跑也可以。

第一個 Django app

我們的網站只會用到一個 app，把它建出來取名為 draw_member。

python manage.py startapp draw_member

demo_django_draw_member/
└── draw_site/
    ├── draw_member/
    │   ├── __init__.py
    │   ├── admin.py
    │   ├── migrations/
    │   ├── models.py
    │   ├── tests.py
    │   └── views.py
    ├── draw_site/
    │   └── ...
    └── manage.py*

可以看到 app 與 project 的架構是不一樣的。

要把這個新的 app 加到 project 裡，修改 draw_site/settings.py。

# draw_site/settings.py

INSTALLED_APPS = (
    'draw_member',    # 加這一行
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
)

預設其實裝了很多 app。暫時不理他們是什麼。

Django settings

先簡單介紹一下 draw_site/settings.py。除了剛剛用到 INSTALLED_APPS，講幾個跟這邊比較有關的參數。

# Database
# https://docs.djangoproject.com/en/1.8/ref/settings/#databases

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.sqlite3',
        'NAME': os.path.join(BASE_DIR, 'db.sqlite3'),
    }
}

# Internationalization
# https://docs.djangoproject.com/en/1.8/topics/i18n/

LANGUAGE_CODE = 'en-us'
TIME_ZONE = 'UTC'
USE_TZ = True

DATABSES 裡定義了使用的資料庫。預設會使用 db.sqlite3 這個 SQLite 資料庫。

再來是語言、時區的設定。預設是 UTC 並且使用 timezone，也就是 server 的時間都是用 UTC 記錄的。

Database Migration

在什麼 code 都還沒寫之前，介紹一個 database 觀念：migration。

在之前的例子可以知道，我們會先設計一個資料庫該存什麼東西，整個網站流程會怎麼用這些資料，這些形成 table schema。但是隨著時間，可能網站有新的功能，很難說完全不去更動 schema。

更動 schema 不是件簡單的事，如果是上 production 的網站，資料庫會有運作以來累積的資料，總不能 schema 改了這些資料就丟掉吧？而且在網站開發的時候，在不同版本的（或不同人開發的）code 就可能有不同的 schema。要怎麼確保 code 與 database 的狀態就要靠 migration。

……一開始就這麼複雜？好啦我們的例子沒有用到 migration 大多數的功能，只有用它 initiate database。內建的 app 都有自己的 database schema，可以用它把資料庫的 table 建出來。

$ python manage.py migrate
Operations to perform:
  Synchronize unmigrated apps: messages, staticfiles
  Apply all migrations: sessions, auth, contenttypes, admin
Synchronizing apps without migrations:
  Creating tables...
    Running deferred SQL...
  Installing custom SQL...
Running migrations:
  Rendering model states... DONE
  Applying contenttypes.0001_initial... OK
  Applying auth.0001_initial... OK
  Applying admin.0001_initial... OK
  Applying contenttypes.0002_remove_content_type_name... OK
  Applying auth.0002_alter_permission_name_max_length... OK
  Applying auth.0003_alter_user_email_max_length... OK
  Applying auth.0004_alter_user_username_opts... OK
  Applying auth.0005_alter_user_last_login_null... OK
  Applying auth.0006_require_contenttypes_0002... OK
  Applying sessions.0001_initial... OK

migration 就會一步步把 database 調整到符合現在 code 的狀態，這些調整就會記錄在 <app>/migrations/ 底下，等等就會看到了。

URL dispatcher

我們接下來要改首頁，把 Django 預設的 / 首頁換成 Hello World。

Flask URL routing 是直接用 decorator 寫在 view function 上面。幫大家回顧一下：

@app.route('/')
def index():
    return "<p>Hello World!</p>"

Django 的 view 和 URL 是分開的，首先是 view：

# draw_member/views.py
from django.shortcuts import render  # 先暫時留著
from django.http import HttpResponse

def home(request):
    return HttpResponse("<p>Hello World!</p>")

結構上大同小異（也因為有 WSGI 規範的關係啦）。

再來是 URL 設定。我們先把 URL 加在 project 設定。這邊可能覺得設定有點分散比較怪，等一下再把它放到 app 裡面。

# draw_site/urls.py
"""draw_site URL Configuration

The `urlpatterns` list routes URLs to views. For more information please see:
    https://docs.djangoproject.com/en/1.8/topics/http/urls/
...
"""
from django.conf.urls import include, url
from django.contrib import admin
from draw_member.views import home

urlpatterns = [
    url(r'^$', home, name="home"),
    url(r'^admin/', include(admin.site.urls)),
]

概念也很簡單，把要的 view function 從 app import 進來（所以 app 目錄是個 Python module，底下會 __init__.py），給一個 regex 表示的路徑，後面放上處理 function 以及一個 optional 的名字，這個名字就代表了這個 URL 路徑，之後可以反查。

測一下確認設定都是正確的。

$ curl -XGET "localhost:8000"
<p>Hello World!</p>

再看一下 draw_site/urls.py，可以看到 Django 預設放了個 /admin 後面用的是 include(app.urls)，表示這一整包只要是 admin/ 開頭的 URL 都交給 admin.site.urls 去處理路徑。這樣方便 app 在不同網站中重覆利用，因為可能放的路徑都不一樣，但一個 app 內的 URL 處理會有一致性。

馬上來改寫一下。首先在 app draw_member 底下加一個 urls.py。

# draw_member/urls.py
from django.conf.urls import include, url
from .views import home  # explicit relative import

urlpatterns = [
    url(r'^$', home, name="home"),
]

基本上格式就是照抄原本就有的。因為放在同個 app 裡面了，import view 時就可以用 explicit relative import（這不是 relative import 喔）

原本的 urls.py 就改成把 URL 的處理「dispatch」給這個 app，改成底下這樣。

# draw_site/urls.py
from django.conf.urls import include, url
from django.contrib import admin


urlpatterns = [
    url(r'^admin/', include(admin.site.urls)),
    url(r'^', include('draw_member.urls')),
]

r'^' 代表從根目錄就交給這個 app 去管理，也因為這樣比較專一的路徑要放前面，像是 /admin。用字串表示在執行的時候才 import 這個 module，不想也可以拿掉字串把 app import 進來。

以上就是最基本的 URL dispatching。

Django Model and ORM

接著處理資料庫的問題。當然可以在 Django 裡面寫 raw SQL，但這邊提供另一個想法：Object-relational Mapping (ORM)。ORM 把資料用物件導向的方式整理，把 SQL、table、database 的細節交給 ORM engine 去翻譯。這可以在找到非常多介紹，直接跳到實作。

    ┌─────────────────────┐
    │ members             │
    ├─────────────────────┤
    │ id          INTEGER │ <─┐
    │ name           TEXT │   │
    │ group_name     TEXT │   │
    └─────────────────────┘   │
                              │
    ┌─────────────────────┐   │
    │ draw_histories      │   │ foreign
    ├─────────────────────┤   │ key
    │ memberid    INTEGER │ ──┘
    │ time       DATETIME │
    └─────────────────────┘

回想一下我們的 schema 設計。改用 ORM 來思考我們就會有成員（Member）以及抽籤歷史（History）兩大 models。Member 記錄了名字與所屬團體；History 會記錄時間、這筆抽籤是屬於哪個成員的。

在 Django 中，model 定義在 models.py 裡面，馬上來寫寫看。

# draw_members/models.py
from django.db import models
from django.utils.timezone import now


class Member(models.Model):
    name = models.CharField(max_length=256)
    group_name = models.CharField(max_length=256)

    def __str__(self):
        return '%s of %s' % (self.name, self.group_name)


class History(models.Model):
    member = models.ForeignKey(Member, related_name="draw_histories")
    # now() will return datetime.utcnow()
    time = models.DateTimeField(default=now)

    def __str__(self):
        return '%s at %s' % (self.member.name, self.time)

一個 class 裡的屬性就對應到一個欄位（Field），欄位會有他的型別以及資料庫實作上的限制（例如字串有上限，當然也可以不設）。Field type 可以參考官網。

Member 底下都是字串所以是 CharField。 History 稍微複雜一點，時間的記錄 date 用 DateTimeField，這樣欄位拿回來就會轉換成 Python datetime object；另一個 member 用的是 ForeignKey，也就是 relationship field，來表示這筆抽籤屬於拿個成員。後面的 related_name 提供了反查功能，也就是能從一個 member 去查他所有的 histories。

同時先寫好兩個 class 底下的 __str__，這樣等下在 Python shell 操作時容易辨認每個物件的內容。

Migration the tracker of model changes

多說無用，馬上來試一試。

……等等，想到 migration 了嗎？每次更動 database model 都要跑 migration，確保 code 與資料庫狀態一致。

$ python manage.py makemigrations draw_member
python manage.py makemigrations draw_member
Migrations for 'draw_member':
  0001_initial.py:
    - Create model History
    - Create model Member
    - Add field member to history

可以看到 Django 很聰明的知道我們多定義了兩個 models，裡面有些對應到資料庫的欄位型態。這些資訊會寫在 migration file 裡面，

# draw_member/migrations/0001_initial.py
class Migration(migrations.Migration):

    dependencies = [
    ]

    operations = [
        migrations.CreateModel(
            name='History',
            fields=[
                ('id', models.AutoField(serialize=False, primary_key=True, verbose_name='ID', auto_created=True)),
                ('time', models.DateTimeField(default=django.utils.timezone.now)),
            ],
        ),
        migrations.CreateModel(
            name='Member',
            fields=[
                ('id', models.AutoField(serialize=False, primary_key=True, verbose_name='ID', auto_created=True)),
                ('name', models.CharField(max_length=256)),
                ('group_name', models.CharField(max_length=256)),
            ],
        ),
        migrations.AddField(
            model_name='history',
            name='member',
            field=models.ForeignKey(to='draw_member.Member', related_name='draw_histories'),
        ),
    ]

注意到 Django ORM 自動幫我們加了 id 這個 primary key，等等就會用到。Migration 裡面的細節等對 Django 更熟了之後就能慢慢了解了。

有了新的 migration 就要同步資料庫的狀態，

$ python manage.py migrate
...
Running migrations:
  Rendering model states... DONE
  Applying draw_member.0001_initial... OK

ORM queries in shell

接下來我們操作一下 ORM。

$ python manage.py shell

就會打開一個 Python shell。如果裝了 IPython 就會打開 IPython shell。這個與一般的有什麼差別呢？他會帶有 Django project 的設定。如果是從一般的 shell 可以先跑以下的指令來達到相同的效果。

$ DJANGO_SETTINGS_MODULE="draw_site.settings" python
>>> import django
>>> django.setup()

In [1]: from draw_member.models import Member, History
In [2]: m1 = Member(name="高坂 穂乃果", group_name="μ's")
In [4]: m2 = Member(name="平沢 唯", group_name="K-ON!")
In [5]: m1, m2
Out[5]: (<Member: 高坂 穂乃果 of μ's>, <Member: 平沢 唯 of K-ON!>)
In [7]: m1.save()
In [8]: m2.save()
In [6]: h1 = History(member=m1)
In [9]: h1.save()

使用上就把資料當作物件來操作，如同 ORM 字面的意思。注意只有在 .save() 才真正被存到資料裡。拿沒有存的 object 來操作 database 就會出現 exception。

>>> h_failed = History(member=Member(name='FF', group_name='f'))
>>> h_failed.save()
Traceback (most recent call last):
...
IntegrityError: NOT NULL constraint failed: draw_member_history.member_id

覺得麻煩的話，用 Model.objects.create() 就可以一步搞定。正確的存好之後，現在資料庫已經有資料了。我們可以先在 SQLite 裡確認。

-- sqlite3 db.sqlite3
sqlite> .header on
sqlite> SELECT * FROM draw_member_member;
id|name|group_name
1|高坂 穂乃果|μ's
2|平沢 唯|K-ON!
sqlite> SELECT * FROM draw_member_history;
id|time|member_id
1|2015-10-05 15:17:32.061384|1

透過像剛剛 object 的操作，我們也能建出如同手寫 SQL 一樣的資料庫，當然像 id、member_id 這些欄位是 ORM engine 自動幫我們做出來的，這些可以自訂，不過預設的行為不難理解。

要怎麼從 ORM 像剛剛下 SQL 一樣撈資料呢？

>>> from draw_member.models import Member, History
>>> Member.objects.all()
[<Member: 高坂 穂乃果 of μ's>, <Member: 平沢 唯 of K-ON!>]
>>> History.objects.all()
[<History: 高坂 穂乃果 at 2015-10-05 15:17:32.061384+00:00>]

資料透過 Model.objects 這個 Manager 去查詢，細節就去看 Django 關於 Making queries 的內容吧。查詢資料庫就會回傳 QuerySet，這並不會真的去「查」資料庫，但先把指令存著等真的要用到值時才去計算，也就是 lazy evaluation。

QuerySet 底下就有很多對應到 SQL 指令的查詢，像是拿回所有 objects 的 QuerySet.all()，前面已經用過了。或者篩選的 QuerySet.filter()，

>>> Member.objects.filter(group_name='K-ON!')
[<Member: 平沢 唯 of K-ON!>]
>>> Member.objects.filter(group_name__contains='!')
[<Member: 平沢 唯 of K-ON!>]

其中 <field>__contains 就是 Django ORM 為了實做像 SQL LIKE 指令的對應欄位。

先講幾個有關的，首先每個 Model 都有個 primary key pk，預設指到 Model.id 這個欄位上，另用 QuerySet.get() 可以拿到單一物件，這時候萬用的 pk 就派上用場了。

>>> Member.objects.get(pk=1)
<Member: 高坂 穂乃果 of μ's>

查 relation 也很簡單，

>>> h1 = History.objects.get(pk=1)
>>> h1.member
<Member: 高坂 穂乃果 of μ's>
>>> h1.member.name
'高坂 穂乃果'

還記得之前設得 related_name="draw_histories"，表示我們能從 Member 反查回去該人相關的歷史，

>>> m1 = Member.objects.get(pk=1)
>>> m1.draw_histories.all()
[<History: 高坂 穂乃果 at 2015-10-05 15:17:32.061384+00:00>]

最後我們來刪資料，

>>> Member.objects.all().delete()
>>> History.objects.all().delete()

當然一開始我們可以暴力把 db.sqlite3 整個刪掉再重新 python manage.py migrate 一次就可以讓 database 對應的 table 都建立好，不過只適用於 SQLite 而已。或者，正確的「清空資料庫」做法是用 flush 指令，

$ python manage.py flush
You have requested a flush of the database.
This will IRREVERSIBLY DESTROY all data currently in the 'draw_site/db.sqlite3' database,
and return each table to an empty state.
Are you sure you want to do this?

    Type 'yes' to continue, or 'no' to cancel: yes
Installed 0 object(s) from 0 fixture(s)
Installed 0 object(s) from 0 fixture(s)

Data in ORM and fixtures

我們把 members.csv 的資料填到資料庫吧。這邊就不用細說了。

In [1]: import csv
In [2]: with open('../../draw_member/members.csv', newline='') as f:
   ...:    csv_reader = csv.DictReader(f)
   ...:    members = [
   ...:    (row['名字'], row['團體'])
   ...:    for row in csv_reader
   ...:    ]
In [3]: from draw_member.models import Member
In [4]: for m in members:
   ...:     Member(name=m[0], group_name=m[1]).save()
   ...:

可以自己檢查一下是不是 14 個人都寫到資料庫了。

不過現在有個問題是，之後可能會常常把資料庫砍掉重練，或者要把這些（或很多來源）的資料讀到資料庫，每次都重新讀寫也是可以，但有沒有別的做法能把資料先存起來？

這邊就要介紹 Django fixtures 了。他能把資料庫的資料存成 JSON、YAML（需要 PyYAML）等格式。

一般 fixtures 是被在 <app>/fixtures/ 目錄底下，記得先把目錄建出來。

mkdir draw_member/fixtures

根據 database 的內容建立 fixtures 可以使用 dumpdata 指令：

python manage.py dumpdata \
    --format=yaml \
    --indent=4 \
    --output draw_member/fixtures/anime_members.yaml
    draw_member.Member \

# draw_member/fixtures/anime_members.yaml
-   fields: {group_name: "\u03BC's", name: "\u9AD8\u5742 \u7A42\u4E43\u679C"}
    model: draw_member.member
    pk: 1
-   fields: {group_name: "\u03BC's", name: "\u7D62\u702C \u7D75\u91CC"}
    model: draw_member.member
    pk: 2
# ...

用 JSON 輸出也可以，改成 --format=json 就可以了

[
{
  "model": "draw_member.member",
  "pk": 1,
  "fields": {
    "name": "\u9ad8\u5742 \u7a42\u4e43\u679c",
    "group_name": "\u03bc's"
  }
},

我們可以用 python manage.py flush 把資料庫清掉，模擬資料的讀入。

$ python manage.py loaddata anime_members.yaml
Installed 14 object(s) from 1 fixture(s)

這樣資料的存取就介紹得差不多了。更多的細節可以參考官網 model layer 的說明。

Django Template

在進行下去之前，先確認我們的目錄結構是一樣的。

demo_django_draw_member/
└── draw_site/
    ├── db.sqlite3
    ├── draw_member/
    │   ├── __init__.py
    │   ├── admin.py
    │   ├── fixtures/
    │   │   ├── anime_members.json
    │   │   └── anime_members.yaml
    │   ├── migrations/
    │   │   ├── 0001_initial.py
    │   │   └── __init__.py
    │   ├── models.py
    │   ├── tests.py
    │   ├── urls.py
    │   └── views.py
    ├── draw_site/
    │   ├── __init__.py
    │   ├── settings.py
    │   ├── urls.py
    │   └── wsgi.py
    └── manage.py*

Django 的 template 預設是放在 <app>/templates/ 底下。不過為了在跨 app 時不要衝到名字，我們會多包一層 app 為名的資料夾。

mkdir -p draw_member/templates/draw_member

它跟 Flask 用的 Jinja2 templates 乍看下非常類似（Jinja2 模仿 Django template），兩者最大的差別是在 Jinja2 裡能很自由的使用 Python function，不過 Django 靠的是 template tag 以及 filter。我們的例子兩者是沒差多少。

一樣先把 base.html 以及 home.html 做出來。我們也先把 Form 寫上了，暫時先用 GET。

{# draw_member/templates/draw_member/base.html #}
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width">
  <title>{% block title %}抽籤系統{% endblock title %}</title>
</head>
<body>
{% block content %}{% endblock content %}
<hr>
<h3>功能列</h3>
<ul>
  <li><a href="{% url 'home' %}">首頁（抽籤）</a></li>
  <li><a href="{% url 'history' %}">歷史記錄</a></li>
</ul>
</body>
</html>

{# draw_member/templates/draw_member/home.html #}
{% extends 'draw_member/base.html' %}

{% block content %}
  <h1>來抽出快樂的夥伴吧！</h1>
  <p>選擇要被抽的團體</p>
  <form action="{% url 'draw' %}" method="get">
    <label for="group_name">團隊名稱：</label>
    <input type="radio" name="group_name" value="μ's">μ's
    <input type="radio" name="group_name" value="K-ON!">K-ON!
    <input type="radio" name="group_name" value="ALL" checked>（全）
    <input type="submit" value="Submit">
  </form>
{% endblock content %}

整體的概念應該很好理解。{% url 'xxxx' %} 就是 URL resolver，還記得在 urls.py 的設定時有給個 name 參數嗎，這邊就會根據那個名字回傳正確的網址。

順便更新一下 URL 把這些 view 先加好，不然等下 runserver 會說找不到這些網址。

# draw_members/urls.py
from django.conf.urls import include, url
from .views import home, draw, history

urlpatterns = [
    url(r'^$', home, name="home"),
    url(r'^draw/$', draw, name="draw"),
    url(r'^history/$', history, name="history")
]

# draw_members/views.py
from django.shortcuts import render
from django.http import HttpResponse


def home(request):
    return HttpResponse("<p>Hello World!</p>")


def draw(request):
    return HttpResponse("<p>Draw</p>")


def history(request):
    return HttpResponse("<p>History</p>")

緊接著改寫我們的首頁，讓它用上 home.html。

def home(request):
    return render(request, 'draw_member/home.html')

加上 template 的首頁

Template 更多的說明可以參考官網 template layer 的說明。

More on Django’s model, template and view (MTV)

我們把最重要的抽籤功能實作出來吧。

這邊需要理解的就是，Django 會把傳到 GET / POST 的參數以 dict 存在 request.GET / request.POST 裡面，@require_GET 限制只能使用 GET 去溝通。

其他的邏輯都是照抄以前的。

import random
from django.shortcuts import render
from django.http import HttpResponse, Http404
from django.views.decorators.http import require_GET
from .models import Member, History

@require_GET
def draw(request):
    # Retrieve all related members
    group_name = request.GET.get('group_name', 'ALL')
    if group_name == 'ALL':
        valid_members = Member.objects.all()
    else:
        valid_members = Member.objects.filter(group_name=group_name)
    # Raise 404 if no members are found given the group name
    if not valid_members.exists():
        raise Http404("No member in group '%s'" % group_name)
    # Lucky draw
    lucky_member = random.choice(valid_members)
    # Update history
    draw_history = History(member=lucky_member)
    draw_history.save()

    return HttpResponse(
        "<p>{0.name}（團體：{0.group_name}）</p>"
        .format(lucky_member)
    )

用 ORM 寫起來比 raw SQL 乾淨多了，不過一開始要把對應的 function 都記起來就是。馬上測試一下，一樣偷懶先不去寫 template。

$ curl -XGET "localhost:8000/draw/?group=ALL"
<p>小泉 花陽（團體：μ's）</p>

如果是從首頁去點的，觀察一下網址的變化。例如：http://localhost:8000/draw/?group_name=K-ON!，可以看到 form 的選項直接寫在網址列。這是使用 POST 與 GET 最大的不同。

再來把歷史記錄的部份也寫一下，也把 template 都補上。

{# draw_member/templates/history.html #}
{% extends 'draw_member/base.html' %}

{% block title %}抽籤歷史{% endblock title %}

{% block content %}
  <h1>抽籤歷史（最近 10 筆）</h1>
  <table>
    <thead>
    <tr>
      <th>名字</th>
      <th>團體</th>
      <th>抽中時間</th>
    </tr>
    </thead>
    <tbody>
    {% for history in recent_histories %}
      <tr>
        <td>{{ history.member.name }}</td>
        <td>{{ history.member.group_name }}</td>
        <td>{{ history.time|date:"r"}}</td>
      </tr>
    {% endfor %}
    </tbody>
  </table>
{% endblock content %}

history.html 與本來 Flask 不一樣的地方，在用上了 date:"r" 的 filter，傳的參數接在 : 之後。也更新對應 view 的動作，

def history(request):
    recent_draws = History.objects.order_by('-time').all()[:10]
    return render(request, 'draw_member/history.html', {
        'recent_histories': recent_draws,
    })

可以看到預設用的是 UTC 時區，時區的轉換細節放到文末吧。我們可以在 view 裡更改要呈現的時區，

from django.utils.timezone import activate

def history(request):
    activate('Asia/Taipei')
    # ...

這樣基本功能就搞定啦！細節一樣參考官網 view layer 的說明。

Django Form

直接把 form 寫在 template 裡面也是可以，有時候 form 可能跟 model 息息相關，而且 form input 多了之後每個欄位都要自己讀寫也太不直覺。想要驗証使用者的 input 的話就更複雜了。

於是就有了 Django Form。馬上來看用起來是怎麼樣。

# draw_member/forms.py
from django import forms

class DrawForm(forms.Form):
    GROUP_CHOICES = [
        ("μ's", "μ's"),
        ("K-ON!", "K-ON!"),
        ("ALL", "（全）"),
    ]
    group = forms.ChoiceField(
        choices=GROUP_CHOICES,
        label='團隊名稱',
        label_suffix='：',
        widget=forms.RadioSelect,
        initial='ALL'
    )

建了一個新的 form class，像 Model 一樣，裡面規定了每個欄位的屬性。我們這邊只有一個 group 是個單選的 ChoiceField，choices 是個 list of two-item tuples，第一個是內部的值，第二個是顯示的字。其他的都是細節的調整。

把這個 form 用到 view 裡面。新建一個 form object form，然後把這個變數 form 傳進 template 裡面。

from .forms import DrawForm

def home(request):
    form = DrawForm()
    return render(request, 'draw_member/home.html', {
        'form': form,
    })

再來修改 template，就不用自己寫 form 的內容了，改成 {{ form }} Django 就會自動產生。

{# draw_member/home.html #}
{% extends 'draw_member/base.html' %}

{% block content %}
  <h1>來抽出快樂的夥伴吧！</h1>
  <p>選擇要被抽的團體</p>
  <form action="{% url 'draw' %}" method="get">
    {{ form }}
    <input type="submit" value="Submit">
  </form>
{% endblock content %}

不過這個長得跟我們原本的 form 不一樣嘛。好在 Django form 是很彈性的，form 在被 render 成 HTML 時可以提供細節的調整，大家可以參考官網 Form rendering options 調整。我直接給調好的結果吧。

  <form action="{% url 'draw' %}" method="get">
    {{ form.group.label_tag }}
    {% for radio in form.group %}
      {{ radio.tag }}{{ radio.choice_label }}
    {% endfor %}
    <input type="submit" value="Submit">
  </form>

用結果去對照每個 {{ ... }} 部件對應的 HTML 元素吧。

More Django form in view

Form 的功能可不只這樣，可以在創建 DrawForm 時直接把 request.GET 傳入。

# draw_member/views.py
def draw(request):
    # Retrieve all related members
    form = DrawForm(request.GET)
    if form.is_valid():
        group_name = form.cleaned_data['group']
        if group_name == 'ALL':
            valid_members = Member.objects.all()
        else:
            valid_members = Member.objects.filter(group_name=group_name)
    else:
        # Raise 404 if no members are found given the group name
        raise Http404("No member in group '%s'" %
                      form.data.get('group', ''))
    # Lucky draw
    lucky_member = random.choice(valid_members)
    # ...

用 form.is_valid() 可以驗証每個欄位的資料是不是正確的。

我們也順便把 /draw 加上 template 吧。

{# draw_member/draw.html #}
{% extends 'draw_member/base.html' %}

{% block title %}抽籤結果{% endblock title %}

{% block content %}
<h1>抽籤結果</h1>
<p>{{ lucky_member.name }}（團體：{{ lucky_member.group_name }}）</p>
{% endblock content %}

# draw_member/views.py
def draw():
    # ...
    return render(request, 'draw_member/draw.html', {
        'lucky_member': lucky_member
    })

更多 Forms 的介紹一樣參考官網。

總結

做完的成品在 Github 上，參考 README 就可以設定好環境了。

這樣就把 Django 最基本的 Model, View, Template, Form 幾個大部份體驗一遍了。可以感覺出來 Django 提供的功能比 Flask 多很多，但也代表要花更多的時候學習使用它。其實改寫到最後我們的 code 非常少，可以為了結構化的 code 還比較多。

當然這不代表就學會 Django 了。最後來介紹幾個可以接續學習的 Django 資源：

《為程式人寫的 Django Tutorial 》是個真正從零到一的 30 天學習規劃（雖然我學了好幾個月 T___T），有了這個抽籤程式的概念再去讀一次應該會更清楚整個 Django 的設計。作者：Tzu-ping Chung (@uranusjr)
Mastering Django: Core, the successor to The Django Book last updated in 2009, is the definitive guide to Django targeting the latest Django version 1.8 at the time of writing.

更多的 Django 技能樹選擇請見 TP 的 lesson 30。

Details

跟 Flask 一樣，底下記錄一些細節或改善等等為了避免篇幅過長（已經太長了）而移至此的段落。

Raw SQL

在介紹 Django Model 的時候直接用了 ORM，但實際上 Django 是可以寫 raw SQL 了，而且還有「聰明版」的 raw SQL 能夠拿回對應的 model object。馬上來看怎麼回事。

先來看聰明版的 raw SQL，使用 Model.objects.raw 拿回所有團體是 K-ON 類的成員。

>>> list(Member.objects.raw("""
... SELECT id, name, group_name
... FROM draw_member_member
... WHERE group_name LIKE 'K-ON%%'
... """))
[<Member: 平沢 唯 of K-ON!>,
 <Member: 秋山 澪 of K-ON!>,
 <Member: 田井中 律 of K-ON!>,
 <Member: 琴吹 紬 of K-ON!>,
 <Member: 中野 梓 of K-ON!>]

會回傳一個 RawQuerySet，裡面其實也是 Member objects，這是靠 Django 去認對應的 primary key，也就是說在 raw() SQL query 裡一定要放 primary key。注意那個 % 需要被 escape 因為 raw() 的 SQL query 是能放參數的（就像 Python 內建 str %-formatting）。

不過我們怎麼知道 Member 是存在哪個 table 呢？預設是 <app>_<model>，但資訊在 meta options 裡的 db_table，也能被覆寫。

>>> Member._meta.db_table
'draw_member_member'

因為 Member 裡面有像 name、group_name 等欄位，在下 query 的時候不一定都會寫在 SELECT 裡面把拿值回來，那麼這些欄位就是 deferred 狀態，只有在真的拿值時才會去跟 database 要。一般使用不會有感覺兩者的差異。

>>> m = list(Member.objects.raw(
...     "SELECT id FROM draw_member_member"
... ))[0]
>>> type(m)
draw_member.models.Member_Deferred_group_name_name
>>> m.get_deferred_fields()
{'group_name', 'name'}

但我就是不想用 ORM，速度慢，也沒辦法寫複雜的 query（戰）。這就回歸到最傳統的 database connection, cursor 這些概念，就像沒有 SQLAlchemy 的 Flask。

>>> from django.db import connection
>>> c = connection.cursor()
>>> list(c.execute("""
... SELECT name
... FROM draw_member_member
... WHERE group_name LIKE %s
... """, ["K-ON"]))
[('平沢 唯',), ('秋山 澪',), ('田井中 律',), ('琴吹 紬',), ('中野 梓',)]
>>> list(c.execute("""
... SELECT member_id, time
... FROM draw_member_history
... LIMIT 3
... """))
[(8, datetime.datetime(2015, 10, 5, 17, 36, 41, 608078, tzinfo=<UTC>)),
 (11, datetime.datetime(2015, 10, 5, 17, 37, 26, 164830, tzinfo=<UTC>)),
 (11, datetime.datetime(2015, 10, 5, 17, 37, 37, 483697, tzinfo=<UTC>))]

Here you go.

Better QuerySet

看過了 raw SQL 之後，我們來想想 ORM 的改善吧。雖然說每次要查詢的時候像寫 SQL 一樣把 query 組合出來也可以，但用 ORM 的好處應該是能把這些實作細節跟「包裝起來」。例如最近 n 次抽籤記錄、所有成員的團體名稱（目前是寫死在 DrawForm 裡面）。

這時候就可以把常用的 query 變成一個 method，例如最近 10 次抽籤記錄就只要用 History.objects.recent(10) 就可以了。

這其實有很多做法，像是寫一個 classmethod、Override default Manager、Override default QuerySet。哪個方法比較好呢？在 StackOverflow、mail list 都有討論。基本上都能達到相同的效果，但後兩者的做法是比較偏好的，因為 Manager(or QuerySet for Django 1.7+) 負責處理 model 對應到的 database table 等級的操作，但 classmethod 應該是處理已經從 table row 中拿出的一個 model object 相關的操作。如果把同樣性質的 code 放在一起，就應該使用 Manager(QuerySet)。

而且 TP 也在 Gitter 上開示了，就是這樣（結案）。來改寫 model。

# draw_member/models.py
class MemberQuerySet(models.QuerySet):

    def unique_groups(self):
        return self.values_list('group_name', flat=True).distinct()


class HistoryQuerySet(models.QuerySet):

    def recent(self, n):
        return self.order_by('-time')[:n]


class Member(models.Model):
    # ...
    objects = MemberQuerySet.as_manager()


class History(models.Model):
    # ...
    objects = HistoryQuerySet.as_manager()

在 Member 我們定義了一個 unique_groups 拿回所有團體的名稱；在 History 定義了 recent 拿出按時間排序最前面 n 個。新定義的 QuerySet.as_manager() 就取代掉本來的 Model.objects。

接著來改寫 view 把之前寫的 query 換掉。

#draw_member/views.py
def history(request):
    # ...
    recent_draws = History.objects.recent(10)
    # ...

這樣就簡潔一點。再來順便把 form 改得比較彈性，不要把團體名寫死。

#draw_member/forms.py
from .models import Member


def member_group_choices():
    valid_groups = Member.objects.unique_groups()
    choices = []
    for grp in valid_groups:
        choices.append((grp, grp))
    choices.append(('ALL', '（全）'))
    return choices


class DrawForm(forms.Form):
    group = forms.ChoiceField(
        choices=member_group_choices,
        # ...
    )

Timezone

感覺最近一直在寫時區相關的東西啊。基本上 server 記錄的時間都用 UTC 問題就少很多，但最後還是要呈現一個使用者用的時區。

但問題是 HTTP header 裡面並沒有這樣的資訊，所以一來用 geoip 去猜，二來用寫個 javascript 在使用者載入的時候去判斷時區，總之是個要另外記錄的東西。細節官網上也有說明。

在文中是使用 activate('Aisa/Taipei') 把時區改成 UTC+8。這邊介紹另一個方式，是寫在 template 裡面的。

{# draw_member/templates/draw_member/history.html #}
{% block content %}
  {% load tz %}
  <h1>抽籤歷史（最近 10 筆）</h1>
  <table>
  {# ... #}
    <tbody>
    {% timezone 'Asia/Taipei' %}
    {% for history in recent_histories %}
      <tr>{# ... #}</tr>
    {% endfor %}
    {% endtimezone %}
    </tbody>
  </table>
{% endblock content %}

POST form and CSRF

忘記講了，我們的 form 目前是用 action="get"，當然可以改回用 POST，也很簡單，就 GET 換成 POST 就好了。

# draw_site/views.py
from django.views.decorators.http import require_POST

@require_POST
def draw(request):
    # Retrieve all related members
    form = DrawForm(request.POST)
    # ...

{# draw_site/templates/home.html #}
  <form action="{% url 'draw' %}" method="post">

馬上來試試看。

POST form without CSRF token

拿到了一個 403 Forbidden “CSRF verification failed.”。CSRF (Cross Site Request Forgery) 在 wiki 有比較完整的介紹，這是一種攻擊手法，在使用者登入網站之後（session 為登入狀態），偽造一個跟網站上一樣的 form 來偽裝使用者的行為。例如購票系統買票，如果沒檢查的話，我可以拿使用者的 session 去網站上隨便買票，網站都會認為是使用者在操作。

因此 CSRF token 用來防範這個偽造，在產生 form 的時候，網站會再產生一個欄位的值，這個欄位的值每次都會改變，這樣就能確定這個 form 是從網站上拿到的。Django 處理 CSRF protection 是透過 Middleware，一個以前沒有提到的概念，表示他是比較底層的東西。相對而言，也不用改我們的 code，在這個例子就只要把 {% csrf_token %} 加到 form 裡面就可以了。

{# draw_site/templates/home.html #}
  <form action="{% url 'draw' %}" method="post">
    {# ... #}
    {% csrf_token %}
    <input type="submit" value="Submit">
  </form>

Datetime in SQLite and Python

2015-09-28T12:00:00-05:00

要正確處理時間並不容易。承接我們先前的例子，其實是直接把時間轉換出來的字串存在 SQLite 裡。這有幾個問題。

首先是時區的問題。我們直接把 server 所在時區的時間存到資料庫去，台北的時區為 Asia/Taipei (UTC+8)。如果今天 server 跑到另一個時區，例如東京 Asia/Tokyo (UTC+9) 好了，這時候資料庫裡就包含了兩個時區的時間，但因為是字串是完全看不出差異的。

再來用字串存時間也有一些問題。首先是排序，雖然我們的例子是能正確的排序，但如果時間格式換了（像 %H:%M:%S %Y-%m-%d）那就不一定。再來可以看到後續想處理時間就會比較複雜。不過這一部份是因為 SQLite 沒有專門處理日期時間的資料型態，像 PostgreSQL 就能看得懂。

所以想要正確處理時間有幾個要點：

存到資料庫的時間應該要 UTC 來表示
在處理時間時（排序、顯示、處理時區），應該轉成正確的資料格式（例如 datetime）
呈現給使用者時再轉換到該人（或 server）所在時區

底下是比較正確處理時間的方式。

時區（Timezone）
Datetime in SQLite, again
- Python 3 內建 timezone 支援
讓 Python 內建 SQLite adapter 支援時區
總結

時區（Timezone）

我們都還沒有處理過時區。時區在 Python 內建的 datetime 只是個「概念」，也就是說，使用者可以傳進去不同的時區（存在 datetime.tzinfo 中），Python 能針對有提供時區的 datetime 做正確的判斷。但台北的時區是多少，紐約的時區是多少它不知道。

為什麼不處理各地時區這麼重要的概念？因為時間變動的速度很快，加上日光節約時間每年可能都不一樣，Python 下一版還沒出時區的資訊已經更新了很多次。

因此在 Python 中實際上時區處理靠得是第三方套件 pytz。像安裝 Flask 一樣，用 pip install pytz 就可以了。

實際操作看看。

>>> from datetime import datetime
>>> datetime.now()      # local time
datetime.datetime(2015, 9, 29, 16, 33, 39, 537111)
>>> datetime.utcnow()   # UTC time
datetime.datetime(2015, 9, 29, 8, 33, 39, 538745)

首先，可以看到 datetime 本身提供了 now() 以及 utcnow() 兩個 function 來拿到現在的時間。台北是 UTC+8 所以時間比 UTC 時間字面上快 8 小時。注意到兩個回傳的 datetime 物件都沒有包含時區的資訊。

處理時間原則上都以 UTC 為基準。我們建立一個 UTC 的現在時間存在變數 utcnow，並且用 pytz 處理時間。Import pytz 進來，並且定義了兩個時區：UTC 以及 TPE（台北時間）。

>>> import pytz
>>> utc = pytz.utc
>>> tpe = pytz.timezone('Asia/Taipei')
>>> utcnow = datetime.utcnow()
>>> utcnow
datetime.datetime(2015, 9, 29, 8, 38, 14, 738241)
>>> utc.fromutc(utcnow)
datetime.datetime(2015, 9, 29, 8, 38, 14, 738241, tzinfo=<UTC>)
>>> tpe.fromutc(utcnow)
datetime.datetime(2015, 9, 29, 16, 38, 14, 738241, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>)
>>> utc.fromutc(utcnow) == tpe.fromutc(utcnow)
True

用 pytz 定義的時區處理 datetime 之後就會多了 tzinfo 的資訊。這時也能正確比較不同時區的時間。

如何處理一個任意定義的時間呢？例如 2016 年台北元旦好了，

>>> str(datetime(2016, 1, 1, 0, 0))
'2016-01-01 00:00:00'
>>> tpe_2016_newyear = tpe.localize(datetime(2016, 1, 1, 0, 0))
>>> tpe_2016_newyear
datetime.datetime(2016, 1, 1, 0, 0, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>)
>>> utc.normalize(tpe_2016_newyear)
datetime.datetime(2015, 12, 31, 16, 0, tzinfo=<UTC>)

使用 .localize(<datetime>) 給予一個初始沒有時區資訊的 datetime 時區。有了時區之後，要在不同時區間轉換就使用 .normalize(<datetime>)。

可以再查查當台北 2016 元旦時，美國東岸時間是幾點。

>>> est = pytz.timezone('US/Eastern')
>>> est.normalize(tpe_2016_newyear)
datetime.datetime(2015, 12, 31, 11, 0, tzinfo=<DstTzInfo 'US/Eastern' EST-1 day, 19:00:00 STD>)

以後要看球賽轉播、重要發表就不會再搞不清楚時間了。

Datetime in SQLite, again

我們會處理 datetime 與時區了，那麼就來改寫一下本來 SQLite 存時間的方式。其實 Python datetime 支援 SQLite 轉換，同樣從Python module 說明文件裡面拿出來的。

>>> from datetime import datetime
>>> import sqlite3
>>> import pytz
>>> utc = pytz.utc
>>> tpe = pytz.timezone('Asia/Taipei')

>>> db = sqlite3.connect(
...     "test.db",
...     detect_types=sqlite3.PARSE_DECLTYPES|sqlite3.PARSE_COLNAMES
... )
>>> db.execute('CREATE TABLE test(dt timestamp)')
<sqlite3.Cursor object at 0x10a59b960>

資料欄位的設為 timestamp，並且在連接的時候設定 PARSE_DECLTYPES 及 PARSE_COLNAMES，稍後可以看到他們的效果。趕快把時間存進去吧。

>>> utcnow = datetime.utcnow()
>>> utcnow
datetime.datetime(2015, 9, 29, 12, 48, 16, 671538)
>>> with db:
...     db.execute(
...         'INSERT INTO test(dt) VALUES (?)',
...         (utcnow, )
...     )
...
<sqlite3.Cursor object at 0x1082380a0>
>>> tpe_2016_newyear = tpe.localize(datetime(2016, 1, 1, 0, 0))
>>> utc.normalize(tpe_2016_newyear)
datetime.datetime(2015, 12, 31, 16, 0, tzinfo=<UTC>)
>>> utc_dt = utc.normalize(tpe_2016_newyear).replace(tzinfo=None)
>>> with db:
...     db.execute(
...         'INSERT INTO test(dt) VALUES (?)',
...         (utc_dt, )
...     )
...
<sqlite3.Cursor object at 0x10a59b960>

存了兩個時間，一個是 UTC 的現在時間，另一個是以 UTC 表示的台北 2016 元旦。注意兩個時間都把 UTC 時區去掉了，因為在某些情況底下 SQLite 與 python 的 datetime adapter 會看不懂時區（這是個 bug #19065）。

如果用 SQLite 可以看到時間都是以 UTC 呈現。仍可以用它內建的 datetime('<UTC time>', 'localtime') 把 UTC 時間字串轉換成電腦的當地時間。這樣處理是容易與其他應用程式相容的。

-- sqlite3 test.db
sqlite> .schema
CREATE TABLE test(dt timestamp);
sqlite> SELECT dt FROM test;
2015-09-29 12:48:16.671538
2015-12-31 16:00:00
sqlite> SELECT datetime(dt, 'localtime') FROM test;
2015-09-29 20:48:16
2016-01-01 00:00:00

再用 Python 讀回來仍然是 datetime 格式：

>>> ret_vals = db.execute('SELECT dt AS "[timestamp]" FROM test').fetchall()
>>> ret_vals
[(datetime.datetime(2015, 9, 29, 12, 48, 16, 671538),),
 (datetime.datetime(2015, 12, 31, 16, 0),)]
>>> [tpe.fromutc(t[0]) for t in ret_vals]
[datetime.datetime(2015, 9, 29, 20, 48, 16, 671538, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>),
 datetime.datetime(2016, 1, 1, 0, 0, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>)]

Python 3 內建 timezone 支援

為了寫這篇 blog 又研究了一下內建的 datetime.timezone。Python 2 沒有這個功能，不過基本的 timedelta 有，所以要自己做應該也是做得到…吧？

內建的 datetime.timezone 由一個 utcoffset 做建立，基本上就是傳個相對於 UTC 的時間差，以 datetime.timedelta 表示。一樣內建帶有 UTC 時區，這邊試著建了台北以及東京的時間。

>>> import datetime as dt
>>> tpe_now = dt.datetime.now()
>>> tpe_now
datetime.datetime(2015, 9, 29, 20, 40, 49, 347568)
>>> utc = dt.timezone.utc
>>> tpe_delta = dt.timedelta(hours=8)
>>> tpe = dt.timezone(tpe_delta)
>>> jpn_delta = dt.timedelta(hours=9)
>>> jpn = dt.timezone(jpn_delta)

我人在台北，所以 datetime.datetime.now() 會給我台北時間，再用 timedelta 手動算出各時區的時間。

>>> utc_now = tpe_now - tpe_delta  # manually time shift
>>> jpn_now = utc_now + jpn_delta
>>> utc_now
datetime.datetime(2015, 9, 29, 12, 40, 49, 347568)
>>> tpe_now == utc_now
False

直接比較這些算出來的時間，不意外不相等，因為預設的 tzinfo 是空的。

>>> tpe_now.replace(tzinfo=tpe)
datetime.datetime(2015, 9, 29, 20, 40, 49, 347568, tzinfo=datetime.timezone(datetime.timedelta(0, 28800)))
>>> tpe_now.replace(tzinfo=tpe) == utc_now.replace(tzinfo=utc)
True
>>> jpn_now.replace(tzinfo=jpn) == tpe_now.replace(tzinfo=tpe)
True

給了各地的時區的 tzinfo 之後，可以看到 datetime 在做比較的時候是有考慮時區位移的。

接著再來看一下pytz 與內建 datetime.timezone 的相容程度。

>>> import pytz
>>> pytz_utc = pytz.utc
>>> pytz_tpe = pytz.timezone('Asia/Taipei')
>>> pytz.utc.normalize(jpn_now.replace(tzinfo=jpn))
datetime.datetime(2015, 9, 29, 12, 40, 49, 347568, tzinfo=<UTC>)
>>> pytz_tpe.localize(tpe_now) == utc_now.replace(tzinfo=utc)
True

比較跟轉換都沒有問題，可以放心轉換。

讓 Python 內建 SQLite adapter 支援時區

看了一下 Python issue 19065，之所以沒有解決其實是缺 patch，因為現在的 patch 並不相容 Python 2.x（沒有 datetime.timezone），然後 pysqlite 的維護者並沒有想要支援 timezone 的意思。

不過那只是內建的 adapter for datetime.datetime object，要自己做也沒問題。參考 issue 裡面提供的解法（在 Github gist 上）。

# tz_aware_adpater.py
# Adapt from https://gist.github.com/acdha/6655391
import datetime
import sqlite3

def tz_aware_timestamp_adapter(val):
    datepart, timepart = val.split(b" ")
    year, month, day = map(int, datepart.split(b"-"))

    if b"+" in timepart:
        timepart, tz_offset = timepart.rsplit(b"+", 1)
        if tz_offset == b'00:00':
            tzinfo = datetime.timezone.utc
        else:
            hours, minutes = map(int, tz_offset.split(b':', 1))
            tzinfo = datetime.timezone(
                datetime.timedelta(hours=hours, minutes=minutes))
    else:
        tzinfo = None

    timepart_full = timepart.split(b".")
    hours, minutes, seconds = map(int, timepart_full[0].split(b":"))
    if len(timepart_full) == 2:
        microseconds = int('{:0<6.6}'.format(timepart_full[1].decode()))
    else:
        microseconds = 0

    val = datetime.datetime(
        year, month, day, hours, minutes, seconds, microseconds,
        tzinfo
    )
    return val

sqlite3.register_converter('timestamp', tz_aware_timestamp_adapter)

python3 -i tz_aware_adpater.py
>>> db = sqlite3.connect(
...     'test.db',
...      detect_types=sqlite3.PARSE_DECLTYPES|sqlite3.PARSE_COLNAMES
... )
>>> import pytz
>>> tpe = pytz.timezone('Asia/Taipei')
>>> pycontw = tpe.localize(datetime.datetime(2016, 6, 3, 8, 0))
>>> str(pycontw)
'2016-06-03 08:00:00+08:00'
>>> db.executemany(
...     'INSERT INTO test(dt) VALUES (?)',
...     [(pycontw,), (pytz.utc.normalize(pycontw),)]
... )
>>> db.commit()

存了兩個帶有時區的時間（兩個時間是相同的）。先從 SQLite 來讀讀看。

sqlite> SELECT dt FROM test;
2015-09-29 12:48:16.671538
2015-12-31 16:00:00
2016-06-03 08:00:00+08:00
2016-06-03 00:00:00+00:00
sqlite> SELECT datetime(dt, 'localtime') FROM test;
2015-09-29 20:48:16
2016-01-01 00:00:00
2016-06-03 08:00:00
2016-06-03 08:00:00

時區是直接寫到 SQLite 裡面，沒有的話就當成是 UTC 時區。

再用 Python 讀回來，測一下修改的 adapter。

>>> dts = db.execute('SELECT dt FROM test').fetchall()
>>> dts
[(datetime.datetime(2015, 9, 29, 12, 48, 16, 671538),),
 (datetime.datetime(2015, 12, 31, 16, 0),),
 (datetime.datetime(2016, 6, 3, 8, 0, tzinfo=datetime.timezone(datetime.timedelta(0, 28800))),),
 (datetime.datetime(2016, 6, 3, 0, 0, tzinfo=datetime.timezone.utc),)]
>>> [t[0].astimezone(tpe) for t in dts[2:]]
[datetime.datetime(2016, 6, 3, 8, 0, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>),
 datetime.datetime(2016, 6, 3, 8, 0, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>)]

讀回來沒有問題，如果要完整處理所有情況（前面兩個 datetime 是 naive 沒有時區）

>>> [(t[0].astimezone(utc) if t[0].tzinfo is not None
...   else utc.localize(t[0]))
...  for t in dts]
[datetime.datetime(2015, 9, 29, 12, 48, 16, 671538, tzinfo=<UTC>),
 datetime.datetime(2015, 12, 31, 16, 0, tzinfo=<UTC>),
 datetime.datetime(2016, 6, 3, 0, 0, tzinfo=<UTC>),
 datetime.datetime(2016, 6, 3, 0, 0, tzinfo=<UTC>)]
>>> [(t[0].astimezone(tpe) if t[0].tzinfo is not None
...   else tpe.fromutc(t[0]))
...  for t in dts]
[datetime.datetime(2015, 9, 29, 20, 48, 16, 671538, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>),
 datetime.datetime(2016, 1, 1, 0, 0, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>),
 datetime.datetime(2016, 6, 3, 8, 0, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>),
 datetime.datetime(2016, 6, 3, 8, 0, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>)]

總結

時區真的很煩，尤其是很多地方不一定都完整支援時區，最好的情況還是用 UTC 溝通，只有在真的需要時再轉換成當地時間。

如果大家對時區很有興趣，不久前 PEP 495 已經被接受，沒有意外應該會出現在 Python 3.6 裡面，它處理的是日光節約時間的問題。（感覺在臺灣對日光節約時間完全沒有概念啊）

不得不說要正確處理時間…很麻煩啊。

用 Flask 與 SQLite 架抽籤網站

2015-09-28T12:00:00-05:00

為了實驗室的專題生而寫。

目標其實是 Django + Django ORM + PostgreSQL，不過一次接觸太多會有反效果，先操作比較簡單的才好上手。所以這邊講的並不是 best practice，但使用最少（底層）的知識與工具。如果一開始讓太多套件（像 SQLAlchemy）做掉了細節部份，反而不太能掌握到重要的概念以及為什麼需要這些套件。

本篇文章非常長，應該沒辦法幾分鐘內讀完。對象是初學者學習簡單網站架設。

這個專案的目標：因為大家 meeting 的時候都不問問題，教授需要一個抽籤點人問問題的工具。我們實驗室有分成幾個組別，所以抽籤的時候也要能針對單個組別抽。

以下使用 LoveLive! 還有 K-ON! 的成員來當例子。先聲明我兩個動畫都沒有看過，如果有什麼名字打錯請告訴我，絕對不是故意的。（2016-06-14 更新：我把兩個動畫都看完了！）

資料設計

我們先假設所有檔案都放在同個資料夾裡，估且叫 draw_member。之後沒有額外說明的話，都是在這個目錄下操作。

原始資料用 CSV 格式來儲存，有「名字」以及「團體」兩個欄位。不過考慮到可能會把檔案匯出，在原始檔案多加一個「最近被抽到的日期」欄位，希望最近被抽到的會比其他人再被抽到的機會低一點。

這個 CSV 檔案命名為 members.csv。一開始沒有人被抽到，所以最後一欄都先設成空的¹，第一行是每一欄欄位的名稱。如果從資料庫匯出，這欄位就會有值。

"名字","團體","最近被抽到的日期"
"高坂 穂乃果","μ's",""
"絢瀬 絵里","μ's",""
"南 ことり","μ's",""
"園田 海未","μ's",""
"星空 凛","μ's",""
"西木野 真姫","μ's",""
"東條 希","μ's",""
"小泉 花陽","μ's",""
"矢澤 にこ","μ's",""
"平沢 唯","K-ON!",""
"秋山 澪","K-ON!",""
"田井中 律","K-ON!",""
"琴吹 紬","K-ON!",""
"中野 梓","K-ON!",""

首先我們先確定會用 Python 把資料讀出來。在 Python 當中有個叫 csv 的內建模組（module）可以處理 CSV 的檔案讀寫。在這邊我們選用 csv.DictReader，它預設會把檔案的第一行當成欄位名稱，然後根據這名稱，每一行都會產生一個 dict 物件。

import csv

with open('./members.csv', newline='') as f:
    r = csv.DictReader(f)
    for row in r:
        print(row)

可以把這段程式碼直接打在 Python REPL 裡或者存成一個檔案後再用 Python 執行它，結果都會是：

{'名字': '高坂 穂乃果', '團體': "μ's"}
{'名字': '絢瀬 絵里', '團體': "μ's"}
{'名字': '南 ことり', '團體': "μ's"}
{'名字': '園田 海未', '團體': "μ's"}
{'名字': '星空 凛', '團體': "μ's"}
{'名字': '西木野 真姫', '團體': "μ's"}
{'名字': '東條 希', '團體': "μ's"}
{'名字': '小泉 花陽', '團體': "μ's"}
{'名字': '矢澤 にこ', '團體': "μ's"}
{'名字': '平沢 唯', '團體': 'K-ON!'}
{'名字': '秋山 澪', '團體': 'K-ON!'}
{'名字': '田井中 律', '團體': 'K-ON!'}
{'名字': '琴吹 紬', '團體': 'K-ON!'}
{'名字': '中野 梓', '團體': 'K-ON!'}

如果不要直接 print(row) ，而是稍微整理一下資料再輸出，改成：

import csv

with open('./members.csv', newline='') as f:
    r = csv.DictReader(f)
    for row in r:
        print('{} of {}'.format(row['名字'], row['團體']))

則輸出結果會是：

高坂 穂乃果 of μ's
絢瀬 絵里 of μ's
南 ことり of μ's
園田 海未 of μ's
星空 凛 of μ's
西木野 真姫 of μ's
東條 希 of μ's
小泉 花陽 of μ's
矢澤 にこ of μ's
平沢 唯 of K-ON!
秋山 澪 of K-ON!
田井中 律 of K-ON!
琴吹 紬 of K-ON!
中野 梓 of K-ON!

這樣就確定我們有辦法把資料用 Python 讀取了。要拿每個欄位的內容也很簡單，像要名字的話，只要用 row['名字']。

網站架構規劃

這個抽籤網站主要就幾個功能：

首頁可以選擇其中一個團體或所有人去抽籤
- 送出之後可以看到結果
- 並且把這個抽籤結果更新到歷史記錄裡
歷史記錄列出過去被抽到的記錄
更新成員清除所有資料，重新讀入

每一頁我們要有個功能表列，方便功能的切換。

所以資料庫的部份會有兩張表格：members 以及 draw_histories 分別記錄成員以及被抽過的時間。

    ┌─────────────────────┐
    │ members             │
    ├─────────────────────┤
    │ id          INTEGER │ <─┐
    │ name           TEXT │   │
    │ group_name     TEXT │   │
    └─────────────────────┘   │
                              │
    ┌─────────────────────┐   │
    │ draw_histories      │   │ foreign
    ├─────────────────────┤   │ key
    │ memberid    INTEGER │ ──┘
    │ time       DATETIME │
    └─────────────────────┘

Table members 應該很好理解，一個欄位是名字 name，一個是團體名稱 group_name。其中 id 這個欄位是程式內部在使用的，它會在讀入資料的時候自動產生。

Table draw_histories 記錄每次抽籤發生的時間 time 還有誰被抽到 memberid，可以發現 memberid 是用成員的 id，因此我們多加一個限制是這欄位的值應該要在 members 裡的 id 中出現過。

實作環境設定

我們選用 Flask 架設 server，因為它一開始用相當簡單。資料的部份會讀到 SQLite 資料庫。

Flask is a microframework for Python based on Werkzeug, Jinja 2 and good intentions. (Flask official site)

SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. (SQLite official site)

Python 環境

使用 Python 3.5。理論上 SQLite 就已經裝好了能直接使用。一般在開發 Python 程式的時候會使用虛擬環境，好處虛擬環境安裝的 Python 套件可以獨立管理，不受系統或其他虛擬環境影響。我們用內建的 venv 建立一個名稱為 VENV 的虛擬環境：

$ python3.5 -m venv VENV

這時候目錄底下就會多一個 VENV 資料夾，裡面是個完整的 Python 執行結構，就好像在這個路徑安裝 Python 一樣。先暫時不管它怎麼做到虛擬隔離，知道怎麼用就好。使用跟離開分別是：

$ source VENV/bin/activate
(VENV) $ which python
# /path/to/draw_member/VENV/bin/python

(VENV) $ deactivate
$  # 前面的 (VENV) 會消失

安裝 Flask、Jinja2 等套件

Python 使用 pip 管理安裝的套件，

(VENV) $ pip install flask jinja2
# Collecting flask
# Collecting jinja2
# ... (連帶裝了相關的套件）

這時候如果查看裝了哪些套件就會看到：

(VENV) $ pip freeze
# Flask==0.10.1
# itsdangerous==0.24
# Jinja2==2.8
# MarkupSafe==0.23
# Werkzeug==0.10.4

為了方便之後把環境安裝在別的電腦上，記得用

pip freeze > requirements.txt

把套件版本的資訊都存在一個檔案裡的好處是，下次把要環境架起來就只要

pip install -r requirements.txt

就設定完成了。

SQLite Database

我們先把 SQLite 每個資料表設定好，這樣之後在寫程式就只要專心讀寫資料就好了。根據前面建的模型，我們可以轉換成 SQL 語法：

-- create_db.sql
CREATE TABLE members (
    id INTEGER PRIMARY KEY ASC AUTOINCREMENT,
    name TEXT NOT NULL,
    group_name TEXT
);

CREATE TABLE draw_histories (
    memberid INTEGER,
    time DATETIME DEFAULT (datetime('now', 'localtime')),
    FOREIGN KEY (memberid) REFERENCES members(id)
);

把這串 SQL 寫到一個檔案 create_db.sql 後就可以實際測試一下。我們把兩個成員寫到檔案裡面，

$ sqlite3 -init create_db.sql test.db

-- Loading resources from create_db.sql

SQLite version 3.8.11.1 2015-07-29 20:00:57
Enter ".help" for usage hints.
sqlite> INSERT INTO members (name, group_name)
   ...> VALUES
   ...> ('高坂 穂乃果', 'μ''s'),
   ...> ('平沢 唯', 'K-ON!');
sqlite> .header on
sqlite> SELECT * FROM members;
id|name|group_name
1|高坂 穂乃果|μ's
2|平沢 唯|K-ON!

sqlite3 -init xxx.sql 意思是把 xxx.sql 裡面的 SQL 指令都執行了一遍，所以一進到 SQLite shell 裡面就建立好表格了。

再來我們模擬幾次抽籤的過程。注意到我們之前有寫 draw_histories.time 的預設值，所以抽籤只要寫是誰就可以了，時間 SQLite 會自動根據指令執行的時間給值。不過我們都試一下吧。

sqlite> INSERT INTO draw_histories (memberid) VALUES (1), (2);
sqlite> INSERT INTO draw_histories (memberid, time)
   ...> VALUES (2, datetime('2015-09-25 16:30'));

所以第一次 INSERT 指令抽了果果以及小唯各一次。第二次 INSERT 再抽了一次小唯，這次還有額外指定時間為的 9 月 25 號下午 4 點半。關於 SQLite 裡 datetime 的更多使用方式可以參考官網說明文件，我們的例子只要這樣就足夠了。

sqlite> SELECT * FROM draw_histories;
memberid|time
1|2015-09-28 16:55:03
2|2015-09-28 16:55:03
2|2015-09-25 16:30:00

前兩個就是第一次 INSERT 所建立的抽籤歷史，跟你下指令的時間有關。第二次 INSERT 有給定時間，所以記錄永遠是 9 月 25 號下午。

draw_histories 只儲存了 member_id，我們可以做一個比較複雜的查詢，把成員的名字跟所屬團體一起列出來。

sqlite> SELECT m.id, m.name, m.group_name, d.time AS draw_time
   ...> FROM draw_histories AS d, members as m
   ...> WHERE m.id == d.memberid
   ...> ORDER BY d.time ASC;
id|name|group_name|draw_time
2|平沢 唯|K-ON!|2015-09-25 16:30:00
1|高坂 穂乃果|μ's|2015-09-28 16:55:03
2|平沢 唯|K-ON!|2015-09-28 16:55:03

把 CSV 寫進資料庫

我們就把之後要用的資料庫取名為 members.db。我們先把初始的資料寫進資料庫裡。

這邊只有多一個在 Python 裡操作 SQLite 的步驟。透過 Python 內建的 sqlite module 就可以控制資料庫存取。先確定有這些檔案了：

members.csv: 所有成員資料
create_db.sql: 資料庫 schema

先 import 用到的 module

>>> import sqlite3
>>> import csv

把成員資料從 CSV 讀進來，跟之前一樣，只是我們稍微整理一下格式，存在 members 這個變數。

>>> with open('./members.csv', newline='') as f:
...     csv_reader = csv.DictReader(f)
...     members = [
...         (row['名字'], row['團體'])
...         for row in csv_reader
...     ]
...
>>> members
[('高坂 穂乃果', "μ's"),
 ('絢瀬 絵里', "μ's"),
 ('南 ことり', "μ's"),
 ('園田 海未', "μ's"),
 # ...
 ('中野 梓', 'K-ON!')]

接著是新的部份，要先用 sqlite3.connect() 建立 SQLite database 連線，然後再用這個連線去下 SQL 指令。首先要把 table 都建立出來：

>>> with open('create_db.sql') as f:
...     create_db_sql = f.read()
...
>>> db = sqlite3.connect('members.db')
>>> with db:
...     db.executescript(create_db_sql)
...

db.executescript('...') 可以執行一系列的 SQL 指令（注意指令間要有分號）。另外使用 with db: ... 作用是會 sqlite3 module 會自動幫我們把中間的 SQL 指令送出²，等同於：

>>> c = db.cursor()
>>> c.executescript(create_db_sql)
>>> c.commit()

再來把讀進來的 members 變數寫到資料表裡面。

>>> with db:
...     db.executemany(
...         'INSERT INTO  members (name, group_name) VALUES (?, ?)',
...         members
...     )

試著把資料讀出來，確定真的存進去了。

>>> c = db.execute('SELECT * FROM members LIMIT 3')
>>> for row in c:
>>>     print(row)
(1, '高坂 穂乃果', "μ's")
(2, '絢瀬 絵里', "μ's")
(3, '南 ことり', "μ's")

到了這邊我們資料的部份沒問題了，接下來就要處理網站流程本身。

Flask 基本架構

Flask 的 web server 可以把所有功能都寫在一個檔案，在這邊就以 draw_member.py 為例子。

from flask import Flask
app = Flask(__name__)


@app.route('/')
def index():
    return "<p>Hello World!</p>"


if __name__ == '__main__':
    app.run(debug=True)

以上就是最基本的 Flask server 架構。先來測試看看，都已經等待一千六百多字了。先把 server 跑起來，

(VENV) $ python draw_member.py
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
 * Restarting with stat

再來可以開瀏覽器訪問 http://localhost:5000/，或者用 command line 來訪問：

$ curl 'http://localhost:5000/'
# <p>Hello World!</p>

會看到 server 回傳「Hello World!」。太感動了！底下先說明整個流程與 code 的關係。

app 是整個 Flask application 的核心物件，可以看到最後我們會呼叫它的 .run() 來產生一個可以動的 web server。debug=True 表示如果 server 有錯誤的時候 Flask 會提供我們完整的錯誤訊息，包含錯誤是在哪個 Python function 裡產生的，錯誤時各個變數的值等等。因為這樣會也會讓有心人士知道網站是怎麼運行的，變正式網站（上 production）時會把這個選項關掉。

我們定義了一個 index function 並且用 decorator 把這個函式綁定在 / 路徑也就是首頁上。使用者訪問 / 就會跑到這個 function 裡來。

Flask 與 SQLite 資料庫讀取

我們先把資料庫相關的函式都先寫好，這邊基本上參照 Flask 官網 SQLite 使用方式。

import csv
import sqlite3
from flask import Flask, g

app = Flask(__name__)
SQLITE_DB_PATH = 'members.db'
SQLITE_DB_SCHEMA = 'create_db.sql'
MEMBER_CSV_PATH = 'members.csv'


# SQLite3-related operations
# See SQLite3 usage pattern from Flask official doc
# http://flask.pocoo.org/docs/0.10/patterns/sqlite3/
def get_db():
    db = getattr(g, '_database', None)
    if db is None:
        db = g._database = sqlite3.connect(SQLITE_DB_PATH)
        # Enable foreign key check
        db.execute("PRAGMA foreign_keys = ON")
    return db


@app.teardown_appcontext
def close_connection(exception):
    db = getattr(g, '_database', None)
    if db is not None:
        db.close()


if __name__ == '__main__':
    app.run(debug=True)

一下子多了很多 code，如果太複雜可以先當作就是這樣吧。

需要了解的部份，第一是 g 這個變數裡可以放很多需要傳來傳去的變數，所以就把建立好的資料庫連線放在 g._database。平常如果要用這個連線的話，就用 db = get_db() 去拿。

第二是我們把資料的路徑等等，都寫成變數放在程式碼的最開頭。這是個好習慣，把常數跟程式分開來，管理才方便³。

First view, first template

先來做首頁，把 HTML 放在 templates/index.html。

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width">
    <title>成員抽籤</title>
  </head>
  <body>
    <h1>來抽出快樂的夥伴吧！</h1>
    <h3>功能列</h3>
    <ul>
      <li>首頁（抽籤）</li>
      <li>歷史記錄</li>
      <li>清除記錄、更新成員資料</li>
    </ul>
  </body>
</html>

這只是個單純的首頁，有個標題，還有個功能列，但暫時都沒有功能。我們修改一下 draw_member.py 裡定義的 index 讓：

from flask import render_template

@app.route('/')
def index():
    return render_template('index.html')

馬上來執行看看，用一樣的方式。不過之前執行的那個可能沒有結束，記得一個 port 只能有一個服務，所以要不是用舊的（Flask 很聰明，在 debug=True 時知道檔案被更新時就會用新的），要不是就關掉再重開一個新的。

打開瀏覽器訪問 http://localhost:5000 應該會出現底下的畫面。

Flask Hello World

抽籤功能

接下來要實作抽籤的功能啦，照前面說的，我們在首頁會設一個團體列表，使用者就會選擇某個團體去抽籤。

在實作之前要來背景介紹一下，要先講一下 GET 與 POST 的差異。

GET vs POST

使用者平常在訪問網站時，該人輸入一個網站、點一個超連址，這時候瀏覽器會發送一個 GET request 到對應的 server 以及路徑。瀏覽器（通常）就會回傳一個對應的 HTML 檔案，瀏覽器就會負責把它顯示在畫面上。

但當使用者跟網站有更多互動的時候，常常是要把使用者的資訊送給網站時，像帳號登入、填問卷表單，或者在這邊的選擇某個團體去抽籤，這時候就會透過 POST。

更多的 GET/POST 以及其他的 HTTP request，可以參考一頁式介紹（中）或非常完整的介紹在 Mozilla Developer Network (MDN)（英）

Form

最常見的 POST 就是搭配表單 (form) 使用。像登入要填帳號密碼、問卷問題與回答，就很常用 form 實作。Form 裡面有很多種 input，代表使用者能填的欄位，類型可能是單選、複選、單行、多行、密文等。

我們就先看一下 form 實際的長相吧。改寫 templates/index.html，加上一個抽籤選團體的 form。

<h1>來抽出快樂的夥伴吧！</h1><!-- 本來有的 -->
<p>選擇要被抽的團體</p>
<form action="/draw" method="post">
  <label for="group_name">團隊名稱：</label>
  <input type="radio" name="group_name" value="μ's">μ's
  <input type="radio" name="group_name" value="K-ON!">K-ON!
  <input type="radio" name="group_name" value="ALL" checked>（全）
  <input type="submit" value="Submit">
</form>
<hr><!-- 這是分隔線 -->

基本上加在 body 裡面就可以。這個 form 包含了一個標籤，指定是給名為 group_name 的 input。底下接四個 input tags 但實際上只有兩大個。

第一大個是團體的單選選項共三個 input，注意到他們的 name 都是 group_name 但 value 不同，後面接著他們顯示的字。其中「（全）」它多了一個 checked 表示預設選擇這個選項。

另一大個是 type=submit 的 input，他就是送出的表單的按鈕。

再來注意 form tag 本身。method="post" 應該很好理解，表示要送出 POST request；action="/draw" 表示這個 POST 要發到 /draw 這個路徑。

同樣，form 底下也很多細節，歡迎再去 MDN 了解。

Request (Form / POST) handling in Flask

所以我們馬上來寫處理 /draw POST 的 view 吧。

import random
from flask import request


@app.route('/draw', methods=['POST'])
def draw():
    # Get the database connection
    db = get_db()

    # Draw member ids from given group
    # If ALL is given then draw from all members
    group_name = request.form.get('group_name', 'ALL')
    valid_members_sql = 'SELECT id FROM members '
    if group_name == 'ALL':
        cursor = db.execute(valid_members_sql)
    else:
        valid_members_sql += 'WHERE group_name = ?'
        cursor = db.execute(valid_members_sql, (group_name, ))
    valid_member_ids = [
        row[0] for row in cursor
    ]

    # If no valid members return 404 (unlikely)
    if not valid_member_ids:
        err_msg = "<p>No members in group '%s'</p>" % group_name
        return err_msg, 404

    # Randomly choice a member
    lucky_member_id = random.choice(valid_member_ids)

    # Obtain the lucy member's information
    member_name, member_group_name = db.execute(
        'SELECT name, group_name FROM members WHERE id = ?',
        (lucky_member_id, )
    ).fetchone()
    return '<p>%s（團體：%s）</p>' % (member_name, member_group_name)

Flask 會把使用者發給 server 的 request 存在 request 裡面，其實使用者會傳蠻多資訊的，像該人的語言、用的瀏覽器、時間等等，這些都能在 request 找到。而使用者填好的 form 的內容會存在當中 request.form 裡，而我們先前定義在 form 中 input name 就會變成這邊的 dict key。

因此如果要拿使用者決定的 group_name 時，就會寫成 request.form.get('group_name', 'ALL')。這相當於 request.form['group_name'] 但在沒有這個 key 時回傳預設值 'ALL'。正常使用並不會找不到這個 key，但網站開發者永遠不要相信使用者會乖乖回傳這些內容。

拿了團體名稱之後，就用團體名稱去下查詢的 SQL。同理這名稱可能沒有結果，這時就回傳一個 HTTP status code 為 404 的錯誤訊息。一般情況 4XX 都代表使用者給的資料有問題的。

拿到了所有成員的 id 後，用了個 random.choice 隨機抽一個出來。如同字面上的意思，random 是個 Python 內建的 module。再把這個 id 拿去查名字與團體。

我們總共做了兩個資料庫查詢，第一次把可能的 member id 都傳回來，第二次把抽中的人的名字、團體都拿回來。暫時還沒做寫到歷史的功能，但那個也不難，之後再說。先不做 template，把結果包在 HTML 最基本的 <p> 元素就傳回來。

Demo

重新整理首頁，可以看到多了一個表單（廢話）。Flask 的 web server 很聰明，不用重新啟動它，會自動看到檔案有更新做 reload。可以回去比對一下自己寫在 index.html 裡 HTML 在瀏覽器上呈現的對應關係。

新的首頁，多了一個表單

按下 Submit 之後就會跳到抽籤結果（注意 URL 的變化）

抽籤結果

預計是抽全部，你也可以回到上一頁，選自己想要的團體。

最重要的功能就完成啦！如果自己程式遇到一些狀況的話，可以看我寫的完整版本。

More on templates

之前我們 render_template 其實都是傳一個完整的 HTML 內容給它，並沒有用到 template 功能。Template 有幾個用處：

集中重覆用到的片段、結構
讓一部份 HTML 的內容受變數控制

馬上來改寫一下吧。我們的功能表應該每一頁都要出現，再來我們希望 /draw 的頁面也是個完整的 HTML。

首先先把常用的部份獨立出來，做成 templates/base.html。

<!-- templates/base.html -->
<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width">
    <title>{% block title %}成員抽籤{% endblock title %}</title>
  </head>
  <body>
    {% block content %}{% endblock content %}
    <hr>
    <h3>功能列</h3>
    <ul>
      <li><a href="/">首頁（抽籤）</a></li>
      <li><a href="/history">歷史記錄</a></li>
      <li><a href="/reset">清除記錄、更新成員資料</a></li>
    </ul>
  </body>
</html>

像功能列這種不會變的就很適合放在這邊。而我們的首頁就可以重覆使用這個結構，

<!-- templates/index.html -->
{% extends "base.html" %}

{% block content %}
<h1>來抽出快樂的夥伴吧！</h1>
<p>選擇要被抽的團體</p>
<form action="/draw" method="post">
  <label for="group_name">團隊名稱：</label>
  <input type="radio" name="group_name" value="μ's">μ's
  <input type="radio" name="group_name" value="K-ON!">K-ON!
  <input type="radio" name="group_name" value="ALL" checked>（全）
  <input type="submit" value="Submit">
</form>
{% endblock content %}

可以看到最大的差異就是我們的 index.html 變簡單了。它就像物件繼承一樣，{% extends "base.html" %}，表示先把 base.html 的內容放進來，而裡面定義了兩個 block：title 以及 content。Index 有定義 content 的內容，所以就取代掉原本定義在 base 裡空的 content。 Index 並沒有定義 title，那就會用原本 block 內的值，即「成員抽籤」。

再來處理 /draw 的部份，我們除而再利用 base.html 之外，還要引入 template variable 的概念。

<!-- templates/draw.html -->
{% extends "base.html" %}

{% block title %}抽籤結果{% endblock title %}

{% block content %}
<h1>抽籤結果</h1>
<p>{{ name }}（團體：{{ group }}）</p>
{% endblock content %}

特別的是 {{ name }} 和 {{ group }}。這語法表示他們的值分別受 name 和 group 這兩個變數決定，變數的值在 render_template 時才會決定。要怎麼把變數的值傳到 template 裡呢？

@app.route('/draw', methods=['POST'])
def draw():
    # ...
    # return '<p>%s（團體：%s）</p>' % (member_name, member_group_name)
    return render_template(
        'draw.html',
        name=member_name,
        group=member_group_name,
    )

改寫好的 draw 使用 template templates/draw.html，並在 render_template 時把變數的值都放進去。

這時候才重新抽籤可以看到新的 template 的輸出結果，功能表也出現了。

歷史記錄

記得要在抽籤的時候把記錄加到 database 裡。因為之前有設好 schema 預設用現在時間當抽籤時間，所以時間的處理完全交給 SQLite。用 SQL 語法 LIMIT 10 以及 ORDER BY 選擇最近的十筆，同時在查結果時，也同時查詢 members table 對應的名字與團體。這個專業術語叫 JOIN。

把這個 view 放在 /history 路徑。

@app.route('/draw')
def draw():
    # ...
    # Update draw history
    with db:
        db.execute('INSERT INTO draw_histories (memberid) VALUES (?)',
                   (lucky_member_id, ))
    # Render template
    return ...


@app.route('/history')
def history():
    db = get_db()
    recent_histories = db.execute(
        'SELECT m.name, m.group_name, d.time '
        'FROM draw_histories AS d, members as m '
        'WHERE m.id == d.memberid '
        'ORDER BY d.time DESC '
        'LIMIT 10'
    ).fetchall()
    return render_template(
        'history.html',
        recent_histories=recent_histories
    )

同理也要建立對應的 template。

<!-- templates/history.html -->
{% extends "base.html" %}

{% block title %}抽籤歷史{% endblock title %}

{% block content %}
<h1>抽籤歷史（最近 10 筆）</h1>
<table>
  <thead>
    <tr>
      <th>名字</th>
      <th>團體</th>
      <th>抽中時間</th>
    </tr>
  </thead>
  <tbody>
    {% for history in recent_histories %}
    <tr>
      <td>{{ history.0 }}</td>
      <td>{{ history.1 }}</td>
      <td>{{ history.2 }}</td>
    </tr>
    {% endfor %}
  </tbody>
</table>
{% endblock content %}

這邊用了新的 template 語法 for loop，每次 loop history 的值都會變，而且還可以再存取它底下的屬性，寫成 Python 就像：

for history in recent_histories:
    history[0]

Flask 用的 Jinja2 template 功能很多，現在各位已經比較理解 server 的運作了，可以去閱讀一下 Jinja2 官網文件看完整的使用方式。

時間處理用 datetime

如果有注意到的話，我們用的時間從 SQLite 回傳回來其實是字串。想要改寫時間格式怎麼辦？這時候就要用上內建 module datetime 裡提供的 datetime 物件。同時我們也順便把本來用 fetchall() 的結果，改成用 dict 表示每一筆歷史。

from datetime import datetime

@app.route('/history')
def history():
    db = get_db()
    c = db.execute(
        'SELECT m.name, m.group_name, d.time AS "draw_time [timestamp]" '
        'FROM draw_histories AS d, members as m '
        'WHERE m.id == d.memberid '
        'ORDER BY d.time DESC '
        'LIMIT 10'
    ).fetchall()
    recent_histories = []
    for row in c:
        recent_histories.append({
            'name': row[0],
            'group': row[1],
            'draw_time': datetime.strptime(row[2], '%Y-%m-%d %H:%M:%S'),
        })
    return render_template(...)

{% for history in recent_histories %}
<tr>
  <td>{{ history.name }}</td>
  <td>{{ history.group }}</td>
  <td>{{ history.draw_time.strftime("%Y 年 %m 月 %d 日 %H 時 %M 分") }}</td>
</tr>
{% endfor %}

可以看到 for loop 不再使用 0, 1, 2 去拿每筆歷史各欄位的值，而是用欄位名稱，相當於 history['name']。這樣的做法比較好，因為用數字一下就忘了，隨便調整一下 view 的內容順序就不一定是這樣了；單獨讀 template 也能懂每個欄位的意思。

What’s Next

Static files and better theme

我們只用了 HTML template。想要讓網站看起來更漂亮，就要寫 CSS 與 Javascript (JS)。有像 Bootstrap、PureCSS、Semantic UI 這類的「framework」，套用之後能在短時間畫出美觀實用的版面。

而 CSS、JS，以及站上大大小的其他檔案都必需要從 server 傳送到用戶端上，這邊就是 static files 的處理。

More how web works

除了 HTTP GET、POST 之外，還有 HTTPS、session、cookie 等很常見的技術。

Object Relational Model (ORM)

我們只舉了純寫 SQL 的例子，但當專案變複雜的時候，純 SQL 管理上越來越複雜。ORM 是一種解決的方案。

Django

當然可以繼續把 Flask 研究下去，它也是個很好的 web framework。不過我們主要的 code base 是 Django。所以希望大家在了解一個 web server (app) 長得像怎樣之後，就可以開始學習 Django。Django 與 Flask 最大的設計不同就是 Django 一開始就提供了很多模組與功能，感覺很「肥」，而 Flask 只提供了必要的功能

總結

這樣就是一個完整的抽籤的網站了。其實架網站的主要知識也差不多是這些，再來就是細節以及知識的加強。

做好的成品我也放在 Github 上了，裡面的 commit log 記錄了幾個重要的步驟，所以想要看看每一步的結果可以用 git checkout 回到每個記錄點，例如想要看抽籤功能寫完，用上 template 的版本就可以到 git checkout f39fc1。

PS: 沒想到會寫這麼長啊……

Details

底下記了很多技術細節，有興趣再看吧。

SQLite table info

除了用 .schema 去看每個 TABLE 建立時的指令之外，也可以用 PRAGMA table_info 去看某個 table 每個欄位的設定。

-- Run `sqlite -init create_db.sql`
sqlite> .header on
sqlite> .mode column
sqlite> PRAGMA table_info(members);
cid  name         type       notnul  dflt_value                    pk
---  -----------  ---------  ------  ----------------------------  --
0    id           INTEGER    0                                     1
1    name         TEXT       1                                     0
2    group_name   TEXT       0                                     0
sqlite> PRAGMA table_info(draw_histories);
cid  name         type       notnul  dflt_value                    pk
---  -----------  ---------  ------  ----------------------------  --
0    memberid     INTEGER    0                                     0
1    draw_time    DATETIME   0       datetime('now', 'localtime')  0

SQLite foreign key check

SQLite3 在比較新版才會去處理 foreign key 限制的功能，參考官網的說明，

sqlite> PRAGMA foreign_keys;
0

如果是 0 的話表示 SQLite 並不會去檢查 foreign key。這可以手動打開檢查。

sqlite> PRAGMA foreign_keys = ON;
sqlite> PRAGMA foreign_keys;
1
sqlite> INSERT INTO draw_histories(memberid) VALUES (1000);
Error: FOREIGN KEY constraint failed

重新讀入資料

我們先包好一個 function reset_db。

# draw_members.py
def reset_db():
    with open(SQLITE_DB_SCHEMA, 'r') as f:
        create_db_sql = f.read()
    db = get_db()
    # Reset database
    # Note that CREATE/DROP table are *immediately* committed
    # even inside a transaction
    with db:
        db.execute("DROP TABLE IF EXISTS draw_histories")
        db.execute("DROP TABLE IF EXISTS members")
        db.executescript(create_db_sql)

    # Read members CSV data
    with open(MEMBER_CSV_PATH, newline='') as f:
        csv_reader = csv.DictReader(f)
        members = [
            (row['名字'], row['團體'])
            for row in csv_reader
        ]

    # Write members into databse
    with db:
        db.executemany(
            'INSERT INTO members (name, group_name) VALUES (?, ?)',
            members
        )

reset_db() 會 DROP 掉舊的 database ，然後再用剛剛介紹的方法再把資料從 CSV 讀進來。

所以這個 function 要怎麼使用？

一個是像之前一樣綁定一個路徑 @app.route('/reset')；另一個方式我們可以透過 python shell 達到。

>>> from draw_member import app, reset_db
>>> with app.app_context():
...     reset_db()

Datetime in SQLite and Python

這篇文章太長了，寫到下一篇去。

2016-06-14 更新：增加使用 datetime.datetime 的說明避免跟 module 名稱混淆 (credit: 馬國薰)

在資料處理上其實會有個 NA 的值來區分「空」以及「空值」的概念。不過這用 Python 內建的 csv.reader 處理會太複雜就先算了。 ↩
參考官方說明文件，它是在進入 with db: ... code block 時開啟一個 transaction，並在正常離開的時候自動 commit。如果中間遇到沒有處理的 Exception 時，就會自動 roll back。 ↩
其實 Flask 相關的設定通常放在 app.config 裡面，不過我們的例子沒差。 ↩

FASTA/Q sequence processing toolkit -- seqtk

2015-09-27T14:11:00-05:00

This is the first post of the series of my common NGS processing workflows and notes.

Some of the most common operation in sequence processing is FASTQ → FASTA conversion. Tons of conversion scripts using either sed or awk can be found by search. For example,

# FASTQ to FASTA
# Assume every read record takes exactly 4 line
# Ref: http://stackoverflow.com/a/10359425
$ sed -n '1~4s/^@/>/p;2~4p'

The assumption of 4 lines per read usually holds for recent NGS sequencing data, so not a big deal.

In many case the sequence is gzip’d. It is still a piece of cake when combining with pipe editing,

gzcat myseq.fq.gz | sed -n '1~4s/^@/>/p;2~4p' | gzip > myseq.fa.gz

However, things can get complex really fast when one wants to additionally do reverse complement, randomly sample a subset of reads, and many other types of sequence manipulation. Efficiency matters if those tasks are applied to tens of millions of reads. Even a few nanoseconds longer of computing time difference per read can make a difference at this scale of reads.

Seqtk

So seqtk comes into rescue. It is written in C and MIT licensed. A quick comparison shows it is generally faster than other UNIX-based solutions, let alone implementations based on scripting languages.

Seqtk bundles many other operations, but I’ll just mention those I frequently use.

$seqtk

Usage:   seqtk <command> <arguments>
Version: 1.0-r77-dirty

Command: seq       common transformation of FASTA/Q
         comp      get the nucleotide composition of FASTA/Q
         sample    subsample sequences
         subseq    extract subsequences from FASTA/Q
         fqchk     fastq QC (base/quality summary)
         mergepe   interleave two PE FASTA/Q files
         trimfq    trim FASTQ using the Phred algorithm

         hety      regional heterozygosity
         mutfa     point mutate FASTA at specified positions
         mergefa   merge two FASTA/Q files
         dropse    drop unpaired from interleaved PE FASTA/Q
         rename    rename sequence names
         randbase  choose a random base from hets
         cutN      cut sequence at long N
         listhet   extract the position of each het

FASTQ → FASTA

Read (gzip’d) FASTQ and write out as FASTA,

$ seqtk seq -A in.fq[.gz] > out.fa

To make the output gzip’d again, piped with gzip,

$ seqtk seq -A in.fq[.gz] | gzip > out.fa.gz

Reverse complement

If one wants to debug the R2 reads of pair-end sequencing (second read on forward strand), since they contain reverse complement sequence of the insert DNA, one needs to reverse complement R2 reads again to debug directly by bare human eyes.

$ seqtk seq -r R2.fq > R2_rc.fq

$ echo '> Example R2 seq
  GCATTGGTGGTTCAGTGGTAGAATTCT' | seqtk seq -r
# > Example R2 seq
# AGAATTCTACCACTGAACCACCAATGC

Quality check

To be honest, FastQC is more frequently used for quality check because it generates reports with beautiful figures.

But for a detail report on each read position, one should consider seqtk fqchk.

$ seqtk fqchk myseq.fq[.gz]

By default it sets -q 20. This quality threshold determines the threshold of counting a base as low or high quality, shown as %low and %high per read position. In the default case, quality score higher than 20 will be treated as high quality bases.

min_len: 10; max_len: 174; avg_len: 28.92; 37 distinct quality values
POS #bases    %A   %C   %G   %T   %N  avgQ errQ %low %high
ALL 236344886 17.0 22.5 31.3 29.2 0.0 39.9 37.6 0.1  99.9
1   8172342   8.9  12.4 57.0 21.7 0.0 39.6 29.0 0.5  99.5
2   8172342   7.7  62.5 16.2 13.7 0.0 39.8 37.8 0.2  99.8
3   8172342   50.3 24.1 11.9 13.6 0.0 39.8 38.2 0.1  99.9
4   8172342   10.4 22.9 15.3 51.3 0.0 39.9 38.7 0.1  99.9
5   8172342   14.3 12.9 22.3 50.5 0.0 39.8 37.0 0.2  99.8
# ... (trimmed)

The following columns, avgQ and errQ, need more explanation. Average quality (avgQ) is computed by weighted mean of each base’s quality,

$\mathrm{avgQ} = \dfrac{\sum_{q=0}^{93} q \cdot n_q}{\sum_{q = 0}^{93} n_q},$

where $n_q$ is the number of bases with quality score being $q$ . The magic number 93 comes from the quality score of Sanger sequencing¹, whose score ranges from 0 to 93.

For errQ we need more background knowledge about how quality score is computed. A base with quality score $q$ implies the probability of being erroneously called, $P_q$ , is

$P_q = 10^{\frac{-q}{10}}, \hspace{1em} q = -10\log_{10}{P_q}.$

Therefore, given $q$ being $0, 1, 2, \ldots$ , seqtk has a conversion table perr from quality score to probability,

Q	0	1	2	3²
P	0.5	0.5	0.5	0.5

Q	4	5	…	38	39	40
P	0.398107	0.316228	…	0.000158	0.000126	0.000100

Based on the probability, it computes the expected number of base call errors, num_err, and the empirical probability of having a base call error at this position, errP,

$\mathrm{num}_\text{err} = \sum_q P_q \cdot n_q, \hspace{1em} \text{errP} = \frac{\mathrm{num}_\text{err}}{\sum_q n_q}.$

Thus the errQ is the equivalent quality score of errP, which better interprets the probability of base call error than avgQ,

$\mathrm{errQ} = -10\log_{10}{\mathrm{errP}}.$

By passing -q 0 to seqtk fqchk, one can get the proportion of all distinct quality scores at each position. This information is pretty useful if the sequencing data is all a mess and one needs to figure out the cause.

Though some of the seqtk fqchk’s behavior is not documented, it should be straight forward enough to understand. All in all, the details can always be found in the source code.

Summary

Seqtk is fast to use for daily routines of FASTA/Q conversion. On top of that it provide various functionalities such as read random sampling, quality check, and many I haven’t tried or mentioned.

See multiple specifications of quality score at sckit-bio doc. The score is Phred quality score. More other score representations can be found at FASTQ wiki. ↩
Note that the probability of q less than 4 is fixed with 0.5. A quick computation can see when $q = 3$ , its actual Phred probability is $10 ^ {-0.3} = 0.501$ . ↩

清除 ^H

2015-09-27T02:28:00-05:00

中文輸入我用嘸蝦米，在打中英文切換時候，很容易打出 \x08 這東西，在 vim 就會顯示成 ^H，功能是 Backspace，但在一般 GUI 環境裡，可能就會因為它而把前面的字吃掉。

因此今天寫了個小腳本可以把清掉當目錄底下的文字檔的 ^H：

ag -l '\x08' | xargs sed -i '' 's/\x08//'

ag 能夠換成比較慢但內建就有的 grep，參數兩者是相容的。

如果要順便印出改了哪些檔案的話：

echo 'Found ^H in the following files:'
ag -l '\x08' | tee /dev/fd/2 | xargs sed -i '' 's/\x08//'

用 tee 把 stdout 導向到 stderr 還蠻有趣的，以前都不知道這樣用。

使用 Zotero 管理文獻書目

2015-09-26T00:00:00-05:00

TL;DR Use Zotero to sync references, webpages, and everything.

一開始會想要收集 reference 無非是做研究。寫論文、平常報告進度需要放上 citation，而在學術界最常就是 cite 別人的期刊。期刊 citation 有它一定的格式，而且每個期刊用的格式不同，手打容易錯，也很難維護。所以最好的方式就是把期刊完整的資訊存在資料庫，然後引用的時候再插到文件裡面。

整個問題就變成怎麼管理這些期刊資訊。

BibTex is for LaTeX
- BibDesk
EndNote is for Word
Zotero bridges the both world
- Zotfile
- How to sync data storage
總結

BibTex is for LaTeX

在 LaTeX 當中可以利用 BibTeX（或更新的 BibLaTeX）提供的流程處理 citation 與管理 reference（即 Bibliography 管理）。他把所有 reference 集中在一個純文字的檔案，

@article{Calin:2006aa,
    Author = {Calin, George A. and Croce, Carlo M.},
    Journal = {Nat Rev Cancer},
    Month = {11},
    Number = {11},
    Pages = {857--866},
    Title = {MicroRNA signatures in human cancers},
    Volume = {6},
    Year = {2006}}

每篇文章會有一個 cite key，在內文用到的時候就可以引用，而 BibTeX 就會根據現在定義的 style 去放 citation 以及在文末加上對應的 reference。

BibDesk

真正讓 BibTeX 能在日常生活中很好使用，有一部份要歸功於像 BibDesk 這樣的圖形工具。

BibDesk 是個 OSX 的應用程式，包含在 MacTex distribution 裡面。除了能自動從匯入來自網站或不同格式的 citation 之外，還有檔案附件的功能，能把例如論文的 PDF、Supplementary files 自動跟對應的條目做連接，重新命名並放在一個架構化的資料夾。重新命名跟歸檔的方式都能自訂，例如可以照期刊名稱/年份去分類，然後把這個資料夾放在 Dropbox 上就完成了自動同步。

BibDesk 使用畫面

這樣解決了幾乎所有寫 paper 會碰到的問題。

平常會有個超級大的 BibTeX 檔，裡面有所有各式各樣的 reference。要寫 paper 的時候就把相關的 paper 拿出來 export 成一個小的檔案，然後把一些條目裡不相關的資訊拿掉，就不用再去想文獻引用的部份。我有好幾年都是這樣管理 reference 的。

EndNote is for Word

不過不是所有人都用 LaTeX 寫 paper，例如我們實驗室就只有我一個人用 LaTeX，其他人都用 Word。Word 上面就沒這麼簡單又好用的管理工具了。最多人用的是 EndNote。它是付費的，但因為我是公立大學的學生，所以謝謝各位納稅人，讓我能免費使用它（鞠躬）。

EndNote 做到的功能跟 BibTeX 一樣，用了它之後在 word 裡面就不用再管理書目的格式。不過我平常都是從別的地方把 reference export 再丟進 EndNote 裡，所以也不清楚它有什麼別的功能。

噢，他有個好處就是在 OSX 和在 Windows 上都一樣好用。

Zotero bridges the both world

BibDesk (BibTeX) 真的很方便，讓我有時候想要管理一些很經典的技術文章，想要存一些有用的不論是 PDF、影片、網站，都想要放到 BibDesk 裡面。但這些地方都沒有提供 BibTex citation format 讓人直接複製貼上，而且它的語法也沒有設計要解決這麼多來源，所以寫起來很卡、很花時間。

另一方面，現在查資料都是用瀏覽器，看到一篇論文，如果要 Export citation、打開 BibDesk、Import citation、Download PDF(s)、Link PDF(s) 這一連串動作也很麻煩。

所以就有了 Zotero [zoh-TAIR-oh] 這整合在瀏覽器的工具。目前支援 Firefox、Chrome、Safari，也有提供 Plugin 給 Word 或 LibreOffice 使用。所以它應該足夠取代前面的工具，雖然我並沒有結合 Word 使用過。

基本畫面蠻簡單的，大概所有的書目管理軟體都差不多，只是它是整合在 Browser 當中，

Zotero 使用畫面

使用很簡單，就兩個按鈕，左邊打開 Zotero 視窗，右邊把當前網頁存進自己的 library 裡，它右下角就會出現處理的訊息，如果是期刊網站而且有 full text PDF 的權限，就會一起把 PDF 都存起來。

平常要放到論文裡時，我還是會先匯出到 BibTeX 或 EndNote。不過它額外還有好用的功能，能把 citation 輸出成 RTF/HTML 的 bibliography，這可以直接貼在 Powerpoint 做投影片很方便。

Torsten Thomas, Jack Gilbert & Folker Meyer. Metagenomics - a guide from sampling to data analysis. Microbial Informatics and Experimentation 2, 3 (2012).

Zotero 有提供免費 300MB 讓使用者同步 library，這對於單純 citation 本身已經很足夠了，它也支援同步到自己架設的 WebDAV。

PS: Zotero 採用 AGPL v3 授權，原始碼在 Github 上。

Zotfile

Zotero 內建的 PDF attachment 功能不如 BibDesk 這麼完整，因此有 Zotfile 來額外管理 PDF 檔案的功能。再者 Zotero 的空間有限，會想把 PDF 等大的檔案放在像 Dropbox 的地方，不要都用 Zotero 同步。

自訂 (PDF) 檔案存放路徑，底下可以再設定子目錄。在這邊是會按照期刊名/年分去分目錄。

自訂檔案命名規則

不過如果是同步到 Dropbox 的話，可能每台電腦的路徑都不一樣，例如 OSX 可能是 /Users/me/Dropbox，但 Debian 可能是 /home/me/Dropbox，這時候存放的路徑就要改成相對路徑。

Zotero Advanced 設定裡修改 library 相關檔案的路徑。

這邊要額外說明一下 Linked Attachment Base Directory 以及 Data Directory 的差異。像 PDF 這類如果被 Zotfile 所管理的檔案，或是自己手動選「Attach Link …」的檔案，他使用的是 linked attachment，icon 會有個連結的符號。其他像 Webpage Snapshot 或是預設的 PDF 檔都是放在 Data Directory。

How to sync data storage

如果要進一步讓 data storage 也用 Dropbox 同步的話，參考官網關於 sync 的介紹，OSX 上 Zotero Firefox 的資料會存放在

~/Library/Application Support/Firefox/Profiles/xxxxxxxx.default/zotero

其中 data storage 就在底下的 storage 資料夾。官網建議不要把 Zotero 的 SQLite database 等都同步在 Dropbox 上，所以只要把這個資料移到 Dropbox 再 soft link 回來就可以了。

總結

Zotero 是個實用並且跟瀏覽器整合的文獻（書目）管理工具。但它也能處理像網頁等其他網路上也很常見的格式，也能與既有的工具、文件編輯軟體結合，並有同步功能，非常適合作為外部記憶庫。

（應該要用英文寫的，什麼時候才會有第一篇英文 blog post QAQ）

PNG Optimizer

2015-09-22T00:00:00-05:00

部落格剛成立總是特別興奮，測了一下瀏覽像這樣的網站要用多少頻寬。首頁能壓在 600KB 左右，因為文章 summary 裡面暫時還沒圖。不過像部落格設定這篇文章，裡面有幾張螢幕截圖的就要花快 2MB 傳。

就想了一下圖檔有什麼壓縮方式。如果是 JPG 的話，jpegoptim 簡單又有效；如果是 PNG 的話，以前都是用 OptiPNG，但效果有限，而且是無損壓縮。在螢幕截圖的情況，我倒不介意幾個像素的顏色稍微不一樣（其實人眼無法分辨）

於是，需要比較看看市面找得到的幾種 PNG 壓縮方式。很碰巧找到 http://css-ig.net/png-tools-overview.html 專講 PNG 優化的比較，就挑了幾款來試。

直接把結果整理成下表：

Filename	Original size	OptiPNG (lossless)	Zopfli (lossless)	pngquant (lossy)
blog_desktop.png	180K	108K	100K	56K
blog_mobile.png	72K	52K	44K	28K
justfont_setting.png	272K	196K	164K	84K
oldsite.png	604K	536K	492K	288K
oldsite_full.png	816K	684K	644K	376K

OptiPNG

OptiPNG 還是有部份效果，不過他跑不快。

Zopfli (Zopflipng)

Zopfli 是 Google 開發的壓縮演算法，相容於 deflate, gzip, zlib 格式。也因此能用在 PNG 上面。他也是 lossless compression。

Zopfli Compression Algorithm is a compression library programmed in C to perform very good, but slow, deflate or zlib compression. (Zopfli Readme)

這個 Zopflipng 也是同個 repo 維護。自己編譯簡單，

git clone https://github.com/google/zopfli.git
make zopflipng

如果要一口氣壓縮一堆 PNG，可以這樣使用：

zopflipng --lossy_transparent --prefix *.png

速度也蠻慢的，有個 -q 選項可以加速。但壓縮效率比 OptiPNG 還好。

PS 剛好今天早上看到 Google 又出了另一個壓縮演算法 Brotli，但這個與 deflate 不相容，應該不能用在 PNG 上面。

pngquant

想要有損的 PNG 可以用 pngquant。看官網有特別強調在透明度的資訊會被保留，並能像 JPEG 一樣設定 quality。一般 quality 容許越低壓縮比都會越高。

pngquant -f --ext=.png --quality=70-85 --skip-if-larger *.png

可以看到 pngquant 能很容易達到 50% 以下的壓縮比。就我的例子看不太出現螢幕截圖哪裡失真了，而且失真了……也不會怎麼樣啦。

pngquant + Zopflipng

看了一下相關的討論，pngquant 還有再被壓縮的空間，所以最後再套上 Zopflipng 還可以再變小，還蠻驚人的。

Filename	Orig. size	pngquant	pngquant + Zopfli	Compress ratio
blog_desktop.png	180K	56K	60K	0.33
blog_mobile.png	72K	28K	28K	0.39
justfont_setting.png	272K	84K	76K	0.28
oldsite.png	604K	288K	268K	0.44
oldsite_full.png	816K	376K	348K	0.43

沒有測的 TruePNG

在原始網站中有提到 TruePNG 表現很好，但它不是 open source 而且好像只能在 Windows 上跑，那就算了。

結論

以後沒事截圖都會用 pngquant 壓縮一下，完全不能有色差的考慮從 OptiPNG 改為 Zopfli。

設定部落格筆記

2015-09-21T00:00:00-05:00

Blog 對我來說，最重要的就是書寫的舒適度。

一開始在設定 github CNAME 的時候就訂為 blog.liang2.tw，但一直以來都只是個一頁式的自我介紹¹，用 SemanticUI 手刻而成。不過部落格如果每篇文章都還要手刻的話，大概就不會有力氣再寫內容了。

整理了一下有幾個目標：

只考慮用 static site 因為不想維護 server，而且 blog 也沒什要炫的，現在光用前端就可以做到很多互動功能
最好 site generator 是用 Python 實作，這樣想要調整它的功能時，比較懂怎麼改
能支援 markdown 和 reStructuredText 最好

篩完之後選項也沒幾個：Pelican、Sphinx，但 Sphinx 可能對 blog 開發的功能比較少，最多人用的大概就 Pelican，所以就決定用它了。

整理起來也做了不少調整，就列點吧：

Pelican 簡介
All is about the theme
字型
- 中文 webfont
- 中文排版
Figure caption
Markdown or rst?
To do
EDIT (2015-09-22)
EDIT (2015-09-23)

Pelican 簡介

Pelican 結論來說不難理解，而且要自訂 blog theme 也不會很複雜。首先跟 Sphinx 一樣，用內建的 pelican-quickstart 預設值就能架好一個可以動的。目錄大概長這樣

my_blog/
├── content/
│   ├── blog_post_1.md
│   └── blog_post_2.rst
├── output/
├── develop_server.sh*
├── Makefile
├── fabfile.py
├── pelicanconf.py
├── publishconf.py
└── requirements.txt

Blog source 都放在 content 底下，設定檔分成 local 用的 pelicanconf.py 以及 deploy 用的 publishconf.py。並且提供了像 Fabric、Make、shell script 等自動化腳本把 source 用 theme template render 成一個靜態網站，

make html

預設會輸出到 output/，到時候 deploy 就把這個資料夾的內容丟到 server 上。

每篇文章可以用 markdown 或者 reStructuredText(rst) 來寫，概念上像這樣：

---
Title: Hello World
Date: 2016-01-16 18:00
Tags: world, programming
Category: test
Slug: hello-world
---

Hello [World]

[World]: https://en.wikipedia.org/wiki/World

Hello World
##############

:date: 2016-01-16 18:00
:tags: world, programming
:category: test
:slug: hello-world

Hello World_

.. _World: https://en.wikipedia.org/wiki/World

這樣已經設定好了標題、分類、標籤、發布日期還有 slug（有點像文章的 ID）算很完整了。最低要求至少有標題。

最後調整了一下 static file 的路徑。我把文章按年月分開，每個子資料夾裡有當月的圖、檔案等等。URL 也是以年月為單位。其實最理想的應該是有個 hash 之類的東西 /posts/2015/09/abcd/ 等同於 /posts/2015/09/abcd-my-post/ 比較好分享。找了一下好像沒這功能，不過沒有它影響也不嚴重，暫且不理。

All is about the theme

一開始最花時間就是找個好主題了。內建的主題實用性不差，但初次看結構太複雜，再來我喜歡更簡潔的版面，也希望有寫好 responsive layout。

Pelican 大部份的主題都集中在 http://pelicanthemes.com/，有縮圖很好挑選，而且 theme 跟內容是分開的，換 theme 只是改 config 裡 THEME 這變數而已，不喜歡就換。選了一陣子挑到 Flex，他不是我最喜歡的版型，我比較喜歡單欄式置中，但意外只有少數主題滿足上述條件。

Theme template 用 Jinja2，一開始只要調整 base、index、article、page 這幾頁跟 blog 最相關的就能改變主要的外觀。好在兩欄式的網頁 code 讀起來也很舒適。看了一下只要把 responsive 調整一下，讓手機內文寬度夠、很大的螢幕不要滿版整體看起來就差不多。大致上 theme 就這樣定下來了。

細部的 CSS 修正，Flex 有用 LESS 和 gulp 處理前端的設定。LESS 變數跟 nesting rules 不會讓 CSS 變得很髒；每次改完跑個 gulp 就有新的 style.min.css 很方便。

唯一討厭左側的大頭照，有夠煩的，而且還要增加 54KB 的流量。還再想該放什麼來關掉它，放初音好了。

字型

因為自用 OSX，有時候都會忘了在 Windows 上的字體有多悲哀。

Flex 內建用 Google webfont 來處理英文字體，為了引言還有完整性多加了一組 serif 字體 Crimson Text。我喜歡這種 Garamond 類的古典襯線字。剛剛發現它是開源的 (SIL 1.1)，nice。（大陸網友表示：……）

中文 webfont

麻煩的就是中文字型。直接放棄系統內建，但最後有把 Noto Sans CJK 和 Source Hans Sans 加進來當備用。一直都有想嘗試 justfont 推出的 webfont 功能。它運作時會嵌入一個 javascript，看這頁網頁用到哪些中文字，才去要這些中文字的字型來加速載入。使用上就跟 Google webfont 一樣，官網教學考慮了很多使用情況，其實沒做什麼設定就好了，我以為要調很多東西才看得到效果，最後只改了 font-family 就完工。他的設定也能保留原本英文字的字型。

Justfont 設定

免費的試用沒問題之後就刷下去了。說真的免費只能綁兩個字型，設定好內文以及內文粗體 quota 就用完了，現在 100,000 page views/year 大約 NTD 350/year 也不貴。既然付費了當然要試試信黑體，電腦版的到現在還買不起啊。設了兩個字重，一樣加了一套楷體當引言用。楷體也選了比較秀氣的 cwTeX 楷。

也許未來會試試看仿宋體，但我有點擔心螢幕顯示的效果（用 Retina 表示解析度無感），而且 justfont 提供的（仿）宋體也沒有比信黑體更喜歡，這實驗暫且擱置。

中文排版

受到 COSCUP 2015 Bobby Tung 給的演講《中文排版需求以及我在W3C看到的事情》所感召，覺得如果自己不一開始好好做網頁中文排版，之後肯定更懶得改。

但最後還是有所妥協啦（跪）。

首先段落前後還是有留白，這主要是兼顧英文排版，因為不知道怎麼在不同語言套不同的版型，英文段落是前後留大間距。再來我在純文字的時候也很習慣段落前後空一行，感覺視覺上這樣比較舒適（也許是行高不夠……）。 ~~margin 也是設為 1em。~~（EDIT: 見文末）

段落首行縮排最後也沒有放，主因是文句都蠻短的，有點怪；再來 markdown parser 會把我的全形空白吃掉，難以理解（但 rst 不會），真要加只能用　硬加。中英交雜的段落中文字會無法對齊，不過就暫時算了，現在中英文的字重能一樣已經很感動了。

300 的中文字的確有點細，我把字調大了成 18px，還特別拿給我爸媽看，確定他們看得到這些字 XD

做到這裡其實還蠻滿意了，長得像這樣：

手機上的樣子

電腦螢幕的樣子

Figure caption

圖的下面還蠻常會放一些圖說、reference 之類。範例上面就有。在 markdown 不容易達成這效果，因為它的語法沒這麼複雜；但 rst 本來就有支援這樣的語法：

.. role:: fig

.. figure:: {filename}pics.jpg
    :align: center

    :fig:`Figure 1:` The figure caption.

    The legend consists of all elements after the caption.

這樣就會變成

<div class="figure align-center">
  <img alt="" src="{filename}pics.jpg">
  <figcaption><span class="fig">Figure 1:</span> The figure caption</figcaption>
  <div class="legend">The legend consists of all elements after the caption.</div>
</div>

在 markdown 基本上就手打上面那一串 HTML，其實也還好，只是醜了一點。真的寫得很煩時再想寫 plugin 來做這件事。

Markdown or rst?

日常的編輯應該還是以 markdown 為主，看看精美的 Macdown 編輯器如此好用。但如果是很複雜的檔案（分析有公式有圖表什麼的）可能就會考慮 rst；rst 缺點就是語法有點複雜，而且很多語法仰賴句中空白，使得不適用中文，然後我的 vim linter 會一直抱怨它有很多沒看過的 directives。

不過很高興 Pelican 把兩者整合的很好，兩個都能用就能視情況轉換，但 template 也不用寫兩份。

To do

這之外還加上了 LaTeX MathJax、Smartypants 等小細節，不過整體來說 blog 客製化就完成了。也許未來用到什麼再加吧。

目前想到的一些問題：

標題字重：本來是跟內文同字重，但感覺長文會抓不到段落，先改成粗體，希望短文不要因為這樣變得很混亂。
Jupyter notebook include：還沒有試直接嵌入 nb 的功能，我想應該也是調整 CSS 那類的工（前端好累好難啊…）

Pelican plugin 裡面包含了很多樣的套件，我猜很多遇到的問題，前人都解掉了吧？……吧 xdd

EDIT (2015-09-22)

看來看去，又調整了很多東西。

首先，字體大小調小成 16px 又調回 18px。會選擇 16px 是因為我發現在 13" 筆電上閱讀會變得很擠。調回來是因為在大螢幕上看真的太小了，自己都需要放大來看。而且發現本來 13" 上很擠的問題並不是字體，而是一行文字的字數。

一行文字太多會影響到閱讀的效率。PTT 一行最多 39 個中文字，但應該很少文章是打滿的，大約都打個五到八成寬，也就是在 20-32 個中文字。英文的話大約在 12-15 個字。我自己調了很多版本也差不多是這個數字。

所以理想的文章寬度要滿足中、英文的字數。中文字寬度是固定的，所以在決定一行有多少個中文字之後，就要想辦法調整英文字體讓一行英文字數剛好。原本使用的 Source Pro Sans 稍微窄了一點，會讓純英文的頁面看起來有點擠，字重 400 的時候就好多了，但中文就變得不適合內文。最後換成 Lato，也是很普及的字體，不過其實沒寬多少。如果還是覺得很擠就只好換成 Open Sans 了，但我覺得它就有點鬆散。

最後內文寬 738px (41em) 或 828px (46em)，實際一行最大為 612px (34em)。一行最多 34 個中文字、大約 14 個英文字（80 個字元）；程式碼一行最多只能放 74 個字元，短了一點點但還可以接受。

意外的小發現，在內文變窄之後，還可以加上右側的 sidenote，像是 Tufte CSS 這樣，有時會比 footnote 好用，但可能又變回內容太擠的狀態。

最後是在段落前後距離調整，把標題接內文的間距變小了，但段落間的間距調大。學到了一些以前不會的 CSS 語法，像

p + p {
  margin-top: 1.5em;
}

代表選取相鄰的 p 元素，這樣可以避免直接改 p 的 margin 讓 p 與 h、ul、pre 等間距太寬的狀況。前端真的太神妙了。*

EDIT (2015-09-23)

另外 smartypants 有時候有點煩，像是表達 13 吋時

13" vs 13"

不把 "（QUOTATION MARK \u0022）直接寫成 " 就會被轉換成左邊那樣 ”（RIGHT DOUBLE QUOTATION MARK \u201D）。

也把表格的格式加上，仿造 bootstrap 表格 overflow 時會變成 block 可以滑動著看。

以前部落格的長相：

↩

部落格第一篇

2015-09-20T00:00:00-05:00

所以呢，這就是我新的部落格。

我一直都有記錄自己的習慣。國中有週記；高中有週記；大一的時候有 Facebook，大二之後則有個版。記錄自己其實只要純文字就夠了。所以，這個部落格暫時也不會拿來做這件事。

不過純文字有時候還是蠻麻煩的，像程式的筆記，或一些圖多的文章（例如：宅物），用純文字記就很麻煩，可讀性非常差，再來自己編輯、排版上也很困難，多個面向加在一起就成為很大的問題。

於是有了建自己部落格的想法。

會有什麼內容？

目前至少有這些東西：

程式開發心得
研究的筆記
社群（Ex PyCon）的記錄
宅物，主要是動畫

另外可能會有一些的英文文章，因為我也要申請學校。這應該可以用 Tag 或者 Category 來篩選。不過主要應該是個技術性的部落格。

部落格之從前種種譬如黑歷史

我目前記憶中，第一個部落格是國中用的 MSN/Hotmail。但在我無法登入 hotmail 信箱帳號被凍結的同時，它應該也消失在這個世界上了。

第二個部落格應該是剛上高中的醉客樓，在蕃薯藤上，當年 Google 還沒獨占中文搜尋市場。意外地它竟然沒有倒掉，而且也沒什麼人取名為「醉客樓」，搜尋還可以在第一頁，簡直是……可不可以寫信給 yam 把它關掉啊超害羞的。帳號是 myway 也蠻猛的，現在應該搶不到這麼潮的 ID 了。暱稱是問路客阿博，這實在是……好吧到現在還是蠻喜歡的。

最近發文時間是 2007 年 2 月，也就是我中二病全開的時候（好啦其實是高一）。文章是自己寫的短篇小說，雖然只有第一章：

好漂亮的花啊。

我蹲坐在花圃前，輕輕的用手撫摸著一朵朵盛開的美。小小的園地，竟是如此繽紛。一排排整齊的花，在自己小小地盤綻放最耀眼的色彩。當我凝視著這爭奇鬥豔的景色，不知不覺地便為目眩神迷，一大片的花海在腦海中漫延開來，我努力想像自己在其中奔跑的樣子：用手快速地拂過一旁不停從視線中消失的花朵，我在小徑中旋轉著身體，在腦海中我可輕盈多了！我想像自己便倒臥在那花海中，張開雙手仔細品嘗著花辮溫柔的觸感，還有那淡淡的花香，抬頭望著藍天，我高興的笑著……

（如果真的想看的話全文在這兒）

這篇小說最後有完稿而且有投出去，在士林台電 K 書中心寫完的。不過當年是紙筆寫在稿紙上，後來就懶得打字了。這小說是在看完地獄少女第一季的時候寫的，嚴重受到它的世界觀跟敘事方式影響。

咦說好的技術性部落格呢？

Let the journey begin.

Liang-Bo Wang's Blog

Use DuckDB in ensembldb to query Ensembl's genome annotations

Convert ensembldb’s database to DuckDB through Parquet

Load Parquet tables to DuckDB

Database file size comparison

Use DuckDB in ensembldb

Benchmark the databases

Genome-wide annotation query

Another genome-wide annotation query

Gene-specific lookup

Summary

Thesis in LaTeX

Writing thesis in LaTeX is easy and rewarding (once it’s set up)

My LaTeX setup

Bibliography management with Zotero and Better BibTeX

Overflowing legend of large figures

Folder structure

GitHub online editing workflow

Package recommendation to start a new LaTeX template

Why I didn’t use LaTeX a lot during my PhD

Types of documents I will write in LaTeX in future

Fix Fira Code font ligatures and features

Permanently fix the font features using the official build script (update in 2022-03)

Permanently fix the font features

Change the blog commenting system to utterances

ID cross reference with exact protein sequence identity using UniParc

Camps of biological IDs

Challenges to map IDs with exact sequence identity

Why mapping with exact sequence identity?

UniParc comes into rescue for protein sequence identity

Programatic UniParc access using its XML

Summary

Identify the Ensembl release from versioned IDs

Generate Venn diagrams easily

Store GDC genome as a Seqinfo object

Build GDC’s Seqinfo

Build EnsDb from a local Ensembl MySQL database

Ensembl VM

Build a local Ensembl v84 MySQL database

Build EnsDB from the local MySQL database

Remove MySQL

Make Firefox fullscreen borderless on macOS

Notes for screencast encoding

GPG Key Transition

Access gene annotation using gffutils

Usage example

Single feature access

Gene model coordinates of a transcript

Feature selection

Direct operation on the database

Discussions

Read UniProtKB in XML format

Other ways to read UniProtKB data in bulk

XML and XML schema

Read UniProt XML by xmlschema

Summary

Ad hoc bioinformatic analysis in database

Background

Reading tabular data in bioinformatics

Database

Tabular data IO in database

SQLite

PostgreSQL

Loading compressed data with named pipe

Benchmark

pandas (Python)

SQLite

PostgreSQL

Result

Conclusion

Using EnsDb's annotation database in Python

AnnotationHub web interface

Manual query in AnnotationHub

Manual query in EnsDB

Summary

Use Snakemake on Google cloud

RNA-seq dataset and pipeline for demonstration

Installation of snakemake and all related tools

Snakemake local pipeline execution

Genome reference index build (How to write snakemake rules)

Replacing login shell by `exec`