Liang-Bo Wang's Bloghttps://blog.liang2.tw/2023-08-13T23:05:21-05:00Use DuckDB in ensembldb to query Ensembl's genome annotations2022-10-05T00:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2022-10-05:/posts/2022/10/use-duckdb-in-ensembldb/<p>I have been using ensembldb to query genome annotations locally, which stores the Ensembl annotations in a offline SQLite database. By replacing the database engine with DuckDB, genome-wide queries are faster with small impact on gene specific queries (depending on the usage). DuckDB database’s file size is also smaller, and it can be even smaller by offloading the tables to external Parquet files.</p><!-- cSpell:words sexchrom OLAP Hsapiens pyarrow ensdb zstandard zstd -->
<p>To query genome annotations locally, <a href="https://bioconductor.org/packages/release/bioc/html/ensembldb.html">ensembldb</a> has been my go-to approach.
While I’ve already said many good things about this R package (<a href="https://blog.liang2.tw/posts/2016/05/biocondutor-ensembl-reference/">1</a>, <a href="https://blog.liang2.tw/posts/2017/11/use-ensdb-database-in-python/">2</a>, <a href="https://blog.liang2.tw/posts/2019/01/build-ensdb-from-local-mysql/">3</a>), here’s a summary of my favorite features:</p>
<ol>
<li>I can use the same Ensembl version throughout my project (as a SQLite database)</li>
<li>I can query the genome-wide annotations and their locations easily and offline</li>
<li>Nice integration to R’s ecosystem that I can easily combine the extracted annotations with my data and other annotations using <a href="https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html">GenomicRanges</a> and <a href="https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html">SummarizedExperiment</a></li>
<li>Language agnostic to use its database (say, <a href="https://blog.liang2.tw/posts/2017/11/use-ensdb-database-in-python/">I can query the same db in Python</a>)</li>
</ol>
<p>Since <a href="https://duckdb.org/">DuckDB</a> is designed for analytical query workloads (aka <a href="https://en.wikipedia.org/wiki/Online_analytical_processing">OLAP</a>), I decided to convert ensembldb’s SQLite database to DuckDB and try it in some of my common analysis scenarios.
DuckDB has a similar look-and-feel to SQLite.
Also, it uses a columnar storage and supports query into external <a href="https://parquet.apache.org/">Apache Parquet</a> and <a href="https://arrow.apache.org/">Apache Arrow</a> tables.
I tried out some of these user-friendly features in this exercise.</p>
<div class="toc">
<ul>
<li><a href="#convert-ensembldbs-database-to-duckdb-through-parquet">Convert ensembldb’s database to DuckDB through Parquet</a><ul>
<li><a href="#load-parquet-tables-to-duckdb">Load Parquet tables to DuckDB</a></li>
<li><a href="#database-file-size-comparison">Database file size comparison</a></li>
</ul>
</li>
<li><a href="#use-duckdb-in-ensembldb">Use DuckDB in ensembldb</a></li>
<li><a href="#benchmark-the-databases">Benchmark the databases</a><ul>
<li><a href="#genome-wide-annotation-query">Genome-wide annotation query</a></li>
<li><a href="#another-genome-wide-annotation-query">Another genome-wide annotation query</a></li>
<li><a href="#gene-specific-lookup">Gene-specific lookup</a></li>
</ul>
</li>
<li><a href="#summary">Summary</a></li>
</ul>
</div>
<h2 id="convert-ensembldbs-database-to-duckdb-through-parquet">Convert ensembldb’s database to DuckDB through Parquet</h2>
<p>The first step is to convert ensembldb’s SQLite database to DuckDB<sup id="fnref:sqlite-to-duckdb"><a class="footnote-ref" href="#fn:sqlite-to-duckdb">1</a></sup>.
I decided to export the SQLite tables as individual Parquet files, and then reload them back to DuckDB.
So we could also test the DuckDB’s ability to query external parquet files directly.</p>
<p>For this exercise, I used the <a href="https://www.ensembl.org/Homo_sapiens/">latest Ensembl release</a> (v107).
We can download the corresponding SQLite database from <a href="https://annotationhub.bioconductor.org/package2/AHEnsDbs">AnnotationHub’s web interface</a> (its object ID is <a href="https://annotationhub.bioconductor.org/ahid/AH104864">AH104864</a>):</p>
<div class="highlight"><pre><span></span><code>curl<span class="w"> </span>-Lo<span class="w"> </span>EnsDb.Hsapiens.v107.sqlite<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>https://annotationhub.bioconductor.org/fetch/111610
</code></pre></div>
<p>We use <a href="https://www.sqlalchemy.org/">SQLAlchemy</a> to fetch the schema of all the tables:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">sqlalchemy</span> <span class="kn">import</span> <span class="n">MetaData</span><span class="p">,</span> <span class="n">create_engine</span>
<span class="n">engine</span> <span class="o">=</span> <span class="n">create_engine</span><span class="p">(</span><span class="s1">'sqlite:///EnsDb.Hsapiens.v107.sqlite'</span><span class="p">)</span>
<span class="n">metadata</span> <span class="o">=</span> <span class="n">MetaData</span><span class="p">()</span>
<span class="n">metadata</span><span class="o">.</span><span class="n">reflect</span><span class="p">(</span><span class="n">bind</span><span class="o">=</span><span class="n">engine</span><span class="p">)</span>
</code></pre></div>
<p>We can then list all the tables and their column data types:</p>
<div class="highlight"><pre><span></span><code><span class="n">db_tables</span> <span class="o">=</span> <span class="n">metadata</span><span class="o">.</span><span class="n">sorted_tables</span>
<span class="k">for</span> <span class="n">table</span> <span class="ow">in</span> <span class="n">db_tables</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">table</span><span class="o">.</span><span class="n">name</span><span class="si">}</span><span class="s1">: '</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s1">''</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">', '</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">c</span><span class="o">.</span><span class="n">name</span><span class="si">}</span><span class="s1"> (</span><span class="si">{</span><span class="n">c</span><span class="o">.</span><span class="n">type</span><span class="si">}</span><span class="s1">)'</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">table</span><span class="o">.</span><span class="n">columns</span><span class="p">))</span>
<span class="c1"># chromosome: seq_name (TEXT), seq_length (INTEGER), is_circular (INTEGER)</span>
<span class="c1"># gene: gene_id (TEXT), gene_name (TEXT), gene_biotype (TEXT),</span>
<span class="c1"># gene_seq_start (INTEGER), gene_seq_end (INTEGER),</span>
<span class="c1"># seq_name (TEXT), seq_strand (INTEGER),...</span>
<span class="c1"># ...</span>
</code></pre></div>
<p>With the correct data type mapping, we can export all the tables as Parquet by <a href="https://pandas.pydata.org/">pandas</a> and <a href="https://arrow.apache.org/docs/python/index.html">PyArrow</a>.
Since there are quite many text columns, I also used <a href="https://facebook.github.io/zstd/">zstandard</a> to compress the Parquet (with a higher compression level):</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="nn">pa</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">pyarrow.parquet</span> <span class="k">as</span> <span class="nn">pq</span>
<span class="n">sqlite_to_pyarrow_type_mapping</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">'TEXT'</span><span class="p">:</span> <span class="n">pa</span><span class="o">.</span><span class="n">string</span><span class="p">(),</span>
<span class="s1">'INTEGER'</span><span class="p">:</span> <span class="n">pa</span><span class="o">.</span><span class="n">int64</span><span class="p">(),</span>
<span class="s1">'REAL'</span><span class="p">:</span> <span class="n">pa</span><span class="o">.</span><span class="n">float64</span><span class="p">(),</span>
<span class="p">}</span>
<span class="c1"># Read each SQLite table as a Arrow table</span>
<span class="n">arrow_tables</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
<span class="k">with</span> <span class="n">engine</span><span class="o">.</span><span class="n">connect</span><span class="p">()</span> <span class="k">as</span> <span class="n">conn</span><span class="p">:</span>
<span class="k">for</span> <span class="n">table</span> <span class="ow">in</span> <span class="n">metadata</span><span class="o">.</span><span class="n">sorted_tables</span><span class="p">:</span>
<span class="c1"># Construct the corresponding pyarrow schema</span>
<span class="n">schema</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">schema</span><span class="p">([</span>
<span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">sqlite_to_pyarrow_type_mapping</span><span class="p">[</span><span class="nb">str</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">type</span><span class="p">)])</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">table</span><span class="o">.</span><span class="n">columns</span>
<span class="p">])</span>
<span class="n">arrow_tables</span><span class="p">[</span><span class="n">table</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">Table</span><span class="o">.</span><span class="n">from_pandas</span><span class="p">(</span>
<span class="n">pd</span><span class="o">.</span><span class="n">read_sql_table</span><span class="p">(</span><span class="n">table</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">conn</span><span class="p">,</span> <span class="n">coerce_float</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span>
<span class="n">schema</span><span class="o">=</span><span class="n">schema</span><span class="p">,</span>
<span class="n">preserve_index</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>
<span class="p">)</span>
<span class="c1"># Write each Arrow table to a zstd compressed Parquet</span>
<span class="k">for</span> <span class="n">table_name</span><span class="p">,</span> <span class="n">table</span> <span class="ow">in</span> <span class="n">arrow_tables</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">pq</span><span class="o">.</span><span class="n">write_table</span><span class="p">(</span>
<span class="n">table</span><span class="p">,</span>
<span class="sa">f</span><span class="s1">'ensdb_v107/</span><span class="si">{</span><span class="n">table_name</span><span class="si">}</span><span class="s1">.parquet'</span><span class="p">,</span>
<span class="n">compression</span> <span class="o">=</span> <span class="s1">'zstd'</span><span class="p">,</span>
<span class="n">compression_level</span> <span class="o">=</span> <span class="mi">9</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div>
<h3 id="load-parquet-tables-to-duckdb">Load Parquet tables to DuckDB</h3>
<p>Finally, we can load the exported Parquet tables to DuckDB.
Here I tested a few approaches:</p>
<ol>
<li>Create views to the external Parquet files (no content loaded to the db)</li>
<li>Load the full content</li>
<li>Load the full content and index the tables (same as the original SQLite db)</li>
</ol>
<p>Since DuckDB has native support for Parquet files, the syntax is straightforward:</p>
<div class="highlight"><pre><span></span><code><span class="c1">-- Install and activate the extension</span>
<span class="n">INSTALL</span><span class="w"> </span><span class="n">parquet</span><span class="p">;</span><span class="w"> </span><span class="k">LOAD</span><span class="w"> </span><span class="n">parquet</span><span class="p">;</span>
<span class="c1">-- To create views to external Parquet</span>
<span class="k">CREATE</span><span class="w"> </span><span class="k">VIEW</span><span class="w"> </span><span class="o"><</span><span class="k">table</span><span class="o">></span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="s1">'./ensdb_v107/<table>.parquet'</span><span class="p">;</span>
<span class="p">...</span>
<span class="c1">-- To load the full content from external Parquet</span>
<span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="o"><</span><span class="k">table</span><span class="o">></span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="s1">'./ensdb_v107/<table>.parquet'</span><span class="p">;</span>
<span class="p">...</span>
<span class="c1">-- To index the table (use .schema to get the original index definitions)</span>
<span class="k">CREATE</span><span class="w"> </span><span class="k">UNIQUE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">gene_gene_id_idx</span><span class="w"> </span><span class="k">on</span><span class="w"> </span><span class="n">gene</span><span class="w"> </span><span class="p">(</span><span class="n">gene_id</span><span class="p">);</span>
<span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">gene_gene_name_idx</span><span class="w"> </span><span class="k">on</span><span class="w"> </span><span class="n">gene</span><span class="w"> </span><span class="p">(</span><span class="n">gene_name</span><span class="p">);</span>
<span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">gene_seq_name_idx</span><span class="w"> </span><span class="k">on</span><span class="w"> </span><span class="n">gene</span><span class="w"> </span><span class="p">(</span><span class="n">seq_name</span><span class="p">);</span>
<span class="p">...</span>
</code></pre></div>
<p>Note that I didn’t try to “optimize” the table indices for my queries.
I simply mirrored the same index definition from the original SQLite database.</p>
<p>DuckDB’s commandline interface works like SQLite.
And it keeps the database in a single file too.
The full conversion including the Parquet step took about 10 seconds to complete.</p>
<div class="highlight"><pre><span></span><code>duckdb<span class="w"> </span>-echo<span class="w"> </span>ensdb_v107.duckdb<span class="w"> </span><<span class="w"> </span>create_duckdb.sql
duckdb<span class="w"> </span>-readonly<span class="w"> </span>ensdb_v107.duckdb
</code></pre></div>
<h3 id="database-file-size-comparison">Database file size comparison</h3>
<p>Here shows the file size of the databases created with different settings:</p>
<table>
<thead>
<tr>
<th style="text-align: left;">Database</th>
<th style="text-align: right;">File size</th>
<th style="text-align: right;">(%)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left;">SQLite no indexed</td>
<td style="text-align: right;">243MB</td>
<td style="text-align: right;">57.9</td>
</tr>
<tr>
<td style="text-align: left;"><strong>SQLite (original)</strong></td>
<td style="text-align: right;"><strong>420MB</strong></td>
<td style="text-align: right;"><strong>100.0</strong></td>
</tr>
<tr>
<td style="text-align: left;">DuckDB with external Parquets</td>
<td style="text-align: right;">37.6MB</td>
<td style="text-align: right;">9.0</td>
</tr>
<tr>
<td style="text-align: left;">DuckDB</td>
<td style="text-align: right;">169MB</td>
<td style="text-align: right;">40.2</td>
</tr>
<tr>
<td style="text-align: left;">DuckDB indexed</td>
<td style="text-align: right;">528MB</td>
<td style="text-align: right;">125.7</td>
</tr>
</tbody>
</table>
<p>DuckDB with external Parquets yields the smallest file (~9% of the original size).
It’s probably due to a lot of text columns in the database, and zstd compression works really well for the plain text.
This approach could make the ensembldb database more portable.
Say, it’s possible to commit it directly into the analysis project’s GitHub repo.</p>
<p>By loading the actual data into DuckDB (without indices), the file grows considerably due to no compression.
Though it is slightly smaller than its SQLite counterpart.
I wonder if this is due to the columnar storage being more space efficient than row storage.
After indexing the DuckDB database, it surprisingly grows to be much larger than SQLite.
I don’t know DuckDB’s indexing methods enough to understand what happened here.
Since DuckDB is still actively developing its indexing algorithm, I suppose this could be optimized in the future.</p>
<p>Now we have the databases ready.
Let’s see how they perform.</p>
<!-- cSpell:words dbdir mircrobenchmark noidx EGFR -->
<h2 id="use-duckdb-in-ensembldb">Use DuckDB in ensembldb</h2>
<p>It’s painless to tell ensembldb to use DuckDB instead.
<a href="https://duckdb.org/docs/api/r">DuckDB’s R client</a> already implements R’s DBI interface, and ensembldb <a href="https://jorainer.github.io/ensembldb/reference/EnsDb.html">accepts a DBI connection</a> to create a EnsDb object.
So we already have everything we need:</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">duckdb</span><span class="p">)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">ensembldb</span><span class="p">)</span>
<span class="n">edb_sqlite</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">EnsDb</span><span class="p">(</span><span class="s">'EnsDb.Hsapiens.v107.sqlite'</span><span class="p">)</span>
<span class="n">conn</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">dbConnect</span><span class="p">(</span><span class="nf">duckdb</span><span class="p">(),</span><span class="w"> </span><span class="n">dbdir</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"ensdb_v107.duckdb"</span><span class="p">,</span><span class="w"> </span><span class="n">read_only</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span>
<span class="n">edb_duckdb</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">EnsDb</span><span class="p">(</span><span class="n">conn</span><span class="p">)</span>
<span class="nf">dbDisconnect</span><span class="p">(</span><span class="n">conn</span><span class="p">,</span><span class="w"> </span><span class="n">shutdown</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="c1"># disconnect after usage</span>
</code></pre></div>
<p>All the downstream usage of ensembldb is the same from here.</p>
<h2 id="benchmark-the-databases">Benchmark the databases</h2>
<p>Now we have the original SQLite database and three DuckDB databases constructed with various settings ready to use in ensembldb.
Here I tested two scenarios: a genome-wide annotation query and a gene-specific lookup.</p>
<p>To make the query more realistic and complicated, I also applied a filter to all queries to select annotations only from the canonical chromosomes and remove all LRG genes:</p>
<div class="highlight"><pre><span></span><code><span class="n">standard_filter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">AnnotationFilter</span><span class="p">(</span>
<span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">seq_name</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">22</span><span class="p">,</span><span class="w"> </span><span class="s">'X'</span><span class="p">,</span><span class="w"> </span><span class="s">'Y'</span><span class="p">,</span><span class="w"> </span><span class="s">'MT'</span><span class="p">)</span><span class="w"> </span><span class="o">&</span>
<span class="w"> </span><span class="n">gene_biotype</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s">'LRG_gene'</span>
<span class="p">)</span>
</code></pre></div>
<p>I use <a href="https://cran.r-project.org/web/packages/microbenchmark/index.html">microbenchmark</a> to benchmark the same query from different databases.
It works like this:</p>
<div class="highlight"><pre><span></span><code><span class="n">mbm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">microbenchmark</span><span class="p">(</span>
<span class="w"> </span><span class="s">"sqlite_noidx"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="o"><</span><span class="n">some</span><span class="w"> </span><span class="n">query</span><span class="o">></span><span class="w"> </span><span class="p">},</span>
<span class="w"> </span><span class="s">"sqlite"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="kc">...</span><span class="w"> </span><span class="p">},</span>
<span class="w"> </span><span class="s">"duckdb_parquet"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="kc">...</span><span class="w"> </span><span class="p">},</span>
<span class="w"> </span><span class="s">"duckdb"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="kc">...</span><span class="w"> </span><span class="p">},</span>
<span class="w"> </span><span class="s">"duckdb_idx"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="kc">...</span><span class="w"> </span><span class="p">},</span>
<span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="w"> </span><span class="c1"># 50 times for faster queries</span>
<span class="p">)</span>
<span class="nf">summary</span><span class="p">(</span><span class="n">mbm</span><span class="p">)</span>
</code></pre></div>
<h3 id="genome-wide-annotation-query">Genome-wide annotation query</h3>
<p>The first genome-wide query finds the 5’UTR genomic ranges of all the transcripts.
This is one of the most computationally intensive built-in queries I know, involving some genomic range arithmics and querying over multiple tables.</p>
<div class="highlight"><pre><span></span><code><span class="n">five_utr_per_tx</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">fiveUTRsByTranscript</span><span class="p">(</span><span class="n">edb</span><span class="p">,</span><span class="w"> </span><span class="n">filter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">standard_filter</span><span class="p">)</span>
<span class="n">five_utr_per_tx</span><span class="w"> </span><span class="o">|></span><span class="w"> </span><span class="nf">head</span><span class="p">()</span>
<span class="c1">## GRangesList object of length 6:</span>
<span class="c1">## $ENST00000000442</span>
<span class="c1">## GRanges object with 2 ranges and 4 metadata columns:</span>
<span class="c1">## seqnames ranges strand | gene_biotype seq_name</span>
<span class="c1">## <Rle> <IRanges> <Rle> | <character> <character></span>
<span class="c1">## [1] 11 64305524-64305736 + | protein_coding 11</span>
<span class="c1">## [2] 11 64307168-64307179 + | protein_coding 11</span>
<span class="c1">## exon_id exon_rank</span>
<span class="c1">## <character> <integer></span>
<span class="c1">## [1] ENSE00001884684 1</span>
<span class="c1">## [2] ENSE00001195360 2</span>
<span class="c1">## -------</span>
<span class="c1">## seqinfo: 25 sequences (1 circular) from GRCh38 genome</span>
<span class="c1">##</span>
<span class="c1">## ...</span>
</code></pre></div>
<p>Here is the microbenchmark results by running the same query in all databases:</p>
<figure class="invert-in-dark-mode">
<img src="https://blog.liang2.tw/posts/2022/10/use-duckdb-in-ensembldb/pics/benchmark_genomewide_5utr_by_tx.png">
<figcaption>Benchmark results of extracting genome-wide 5'UTR locations per transcript.</figcaption>
</figure>
<p>There is a huge performance increase for all DuckDB databases, since this query pretty much scans over the full table.
Overall, DuckDB runs 3+ times faster than SQLite.</p>
<p>In many cases, there are always a few runs in each database that take significantly more time.
This trend is quite consistent as I re-run the benchmarks multiple times.
While I haven’t investigated these outliers, I think this is due to the first run(s) being un-cached.
Surprisingly, DuckDB with indices run much slower than that without indices (especially the first run).
Though the index might be useless in sequential scans, I guess the slowdown could be due to the bigger file (longer to cache) or the query planner accidentally traversing over indices.</p>
<h3 id="another-genome-wide-annotation-query">Another genome-wide annotation query</h3>
<p>The other genome-wide query finds the transcripts of all the genes.</p>
<div class="highlight"><pre><span></span><code><span class="n">tx_per_gene</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">transcriptsBy</span><span class="p">(</span><span class="n">edb</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"gene"</span><span class="p">,</span><span class="w"> </span><span class="n">filter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">standard_filter</span><span class="p">)</span>
<span class="n">tx_per_gene</span><span class="w"> </span><span class="o">|></span><span class="w"> </span><span class="nf">head</span><span class="p">()</span>
<span class="c1">## GRangesList object of length 6:</span>
<span class="c1">## $ENSG00000000003</span>
<span class="c1">## GRanges object with 5 ranges and 12 metadata columns:</span>
<span class="c1">## seqnames ranges strand | tx_id</span>
<span class="c1">## <Rle> <IRanges> <Rle> | <character></span>
<span class="c1">## [1] X 100633442-100639991 - | ENST00000494424</span>
<span class="c1">## [2] X 100627109-100637104 - | ENST00000612152</span>
<span class="c1">## [3] X 100632063-100637104 - | ENST00000614008</span>
<span class="c1">## [4] X 100627108-100636806 - | ENST00000373020</span>
<span class="c1">## [5] X 100632541-100636689 - | ENST00000496771</span>
<span class="c1">## tx_biotype tx_cds_seq_start tx_cds_seq_end gene_id</span>
<span class="c1">## <character> <integer> <integer> <character></span>
<span class="c1">## [1] processed_transcript <NA> <NA> ENSG00000000003</span>
<span class="c1">## [2] protein_coding 100630798 100635569 ENSG00000000003</span>
<span class="c1">## [3] protein_coding 100632063 100635569 ENSG00000000003</span>
<span class="c1">## [4] protein_coding 100630798 100636694 ENSG00000000003</span>
<span class="c1">## [5] processed_transcript <NA> <NA> ENSG00000000003</span>
<span class="c1">## ...</span>
</code></pre></div>
<p>Similarly, here are the benchmark results:</p>
<figure class="invert-in-dark-mode">
<img src="https://blog.liang2.tw/posts/2022/10/use-duckdb-in-ensembldb/pics/benchmark_genomewide_tx_by_gene.png">
<figcaption>Benchmark of extracting genome-wide gene isoforms.</figcaption>
</figure>
<p>This query tells more or less the same story with only a notable difference.
In this case, fully loaded DuckDB with and without indices share the same performance.
Interestingly, all the DuckDB runtimes are in a bimodal distribution.
I don’t know why.</p>
<h3 id="gene-specific-lookup">Gene-specific lookup</h3>
<p>My another main scenario is to look up the annotations of a specific gene.
Let’s simulate this kind of queries by retrieving all the transcripts of a gene “EGFR”:</p>
<div class="highlight"><pre><span></span><code><span class="n">egfr_tx</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">transcripts</span><span class="p">(</span><span class="n">edb</span><span class="p">,</span><span class="w"> </span><span class="n">filter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">AnnotationFilter</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">gene_name</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s">'EGFR'</span><span class="p">))</span>
<span class="n">egfr_tx</span>
<span class="c1">## GRanges object with 14 ranges and 12 metadata columns:</span>
<span class="c1">## seqnames ranges strand | tx_id</span>
<span class="c1">## <Rle> <IRanges> <Rle> | <character></span>
<span class="c1">## ENST00000344576 7 55019017-55171037 + | ENST00000344576</span>
<span class="c1">## ENST00000275493 7 55019017-55211628 + | ENST00000275493</span>
<span class="c1">## ENST00000455089 7 55019021-55203076 + | ENST00000455089</span>
<span class="c1">## ENST00000342916 7 55019032-55168635 + | ENST00000342916</span>
<span class="c1">## LRG_304t1 7 55019032-55207338 + | LRG_304t1</span>
<span class="c1">## ... ... ... ... . ...</span>
<span class="c1">## ENST00000450046 7 55109723-55211536 + | ENST00000450046</span>
<span class="c1">## ENST00000700145 7 55163753-55205865 + | ENST00000700145</span>
<span class="c1">## ENST00000485503 7 55192811-55200802 + | ENST00000485503</span>
<span class="c1">## ENST00000700146 7 55198272-55208067 + | ENST00000700146</span>
<span class="c1">## ENST00000700147 7 55200573-55206016 + | ENST00000700147</span>
<span class="c1">## ...</span>
</code></pre></div>
<figure class="invert-in-dark-mode">
<img src="https://blog.liang2.tw/posts/2022/10/use-duckdb-in-ensembldb/pics/benchmark_extract_specific_gene.png">
<figcaption>Benchmark of extracting the annotations of a specific gene.</figcaption>
</figure>
<p>SQLite with indices undoubtedly has the best performance.
Understandably, it’s been fine tuned for this very use case (extracting a few rows using indices).
And SQLite without an index takes the most time to complete, so it’s necessary to always index the tables.</p>
<p>The performance of all three DuckDB databases fall in between the two extremes of SQLite dbs.
Unlike SQLite, indexed DuckDB only speeds up the query a little bit (21.0ms vs 22.4ms on average).
Given the worse performance of one of the genome-wide queries above using the indexed DuckDB db,
I think it’s optional to create indices for ensembldb’s DuckDB dbs.</p>
<h2 id="summary">Summary</h2>
<p>Here is the overview of the benchmarking results together with the db’s file size.
The table below displays the performance in average speed-up ratio (and the worst case ratio) over the original SQLite db (ratio the higher the better):</p>
<table>
<thead>
<tr>
<th style="text-align: left;">Database</th>
<th style="text-align: right;">Size (%)</th>
<th style="text-align: right;">Genome I</th>
<th style="text-align: right;">Genome II</th>
<th style="text-align: right;">Gene-specific lookup</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left;">SQLite no indexed</td>
<td style="text-align: right;">57.9</td>
<td style="text-align: right;">0.61 (0.78)</td>
<td style="text-align: right;">0.88 (1.02)</td>
<td style="text-align: right;">0.044 (0.12)</td>
</tr>
<tr>
<td style="text-align: left;"><strong>SQLite (original)</strong></td>
<td style="text-align: right;"><strong>100.0</strong></td>
<td style="text-align: right;"><strong>1.00 (1.00)</strong></td>
<td style="text-align: right;"><strong>1.00 (1.00)</strong></td>
<td style="text-align: right;"><strong>1.00 (1.00)</strong></td>
</tr>
<tr>
<td style="text-align: left;">DuckDB w. ext. Parquets</td>
<td style="text-align: right;">9.0</td>
<td style="text-align: right;">3.36 (3.69)</td>
<td style="text-align: right;">4.14 (4.63)</td>
<td style="text-align: right;">0.15 (0.35)</td>
</tr>
<tr>
<td style="text-align: left;">DuckDB</td>
<td style="text-align: right;">40.2</td>
<td style="text-align: right;">4.70 (4.76)</td>
<td style="text-align: right;">6.30 (6.73)</td>
<td style="text-align: right;">0.61 (0.81)</td>
</tr>
<tr>
<td style="text-align: left;">DuckDB indexed</td>
<td style="text-align: right;">125.7</td>
<td style="text-align: right;">3.66 (1.87)</td>
<td style="text-align: right;">6.29 (6.33)</td>
<td style="text-align: right;">0.65 (1.80)</td>
</tr>
</tbody>
</table>
<p>Overall, DuckDB shows impressive performance increase for genome-wide queries.
It uses up less storage too.
While DuckDB is slower than SQLite when it comes to gene-specific lookups, since we are talking about tens of milliseconds per query, unless we are running thousands of these queries, the performance impact is minimal.
On the other hand, genome-wide queries are saving seconds per query.</p>
<p>As the benchmark results shown, we could replace the original ensembldb database with a DuckDB database by loading the tables and removing the indices.
If the user is willing to sacrifice some performance in gene-specific lookups, DuckDB with external Parquet files only uses < 10% of the original disk space but it still runs faster for genome-wide queries.</p>
<p>While the default indices copied from SQLite are not very helpful, I didn’t tune the indices to maximally speed up the gene-specific lookups.
We can probably also tune the Parquet compression ratio to find a better balance between the decompression speed and file size.
Note that DuckDB’s file format is not stabilized yet, so the database needs to be re-created in newer DuckDB versions.</p>
<p>All in all, I think DuckDB advertises itself accurately when it comes to analytical query workloads.
It shows good performance when it queries a large portion of its content.
By having a similar interface to SQLite and clients in popular languages (R, Python, and etc),
it’s easy to change an existing SQLite usecase to use DuckDB.
My small exercise with ensembldb has convinced me to try out DuckDB in more scenarios too.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:sqlite-to-duckdb">
<p>There is an official extension <a href="https://github.com/duckdblabs/sqlite_scanner">sqlite_scanner</a> currently under development that lets a DuckDB attach directly to a SQLite database.
So in the future, it could be much easier to convert SQLite to DuckDB. <a class="footnote-backref" href="#fnref:sqlite-to-duckdb" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Thesis in LaTeX2022-03-12T00:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2022-03-12:/posts/2022/03/thesis-latex/<!-- cSpell:words zotero WUSTL -->
<p>A few months ago, I finished <a href="https://github.com/ccwang002/phd-thesis">my PhD thesis</a> in LaTeX.
The WUSTL LaTeX template I could find has a long history<sup id="fnref:note-template-history"><a class="footnote-ref" href="#fn:note-template-history">1</a></sup>, which has a relative lengthy implementation and includes unnecessary code.
I didn’t feel comfortable to build on top of it.</p>
<p>I ended up rewriting the template …</p><!-- cSpell:words zotero WUSTL -->
<p>A few months ago, I finished <a href="https://github.com/ccwang002/phd-thesis">my PhD thesis</a> in LaTeX.
The WUSTL LaTeX template I could find has a long history<sup id="fnref:note-template-history"><a class="footnote-ref" href="#fn:note-template-history">1</a></sup>, which has a relative lengthy implementation and includes unnecessary code.
I didn’t feel comfortable to build on top of it.</p>
<p>I ended up rewriting the template based on the <a href="https://www.ctan.org/pkg/memoir">memoir</a> package, which is designed for long documents like book and thesis.
Overall I enjoyed my experience writing in LaTeX, so I wanted to share my LaTeX setup and my experience of editing a long document.</p>
<div class="toc">
<ul>
<li><a href="#writing-thesis-in-latex-is-easy-and-rewarding-once-its-set-up">Writing thesis in LaTeX is easy and rewarding (once it’s set up)</a></li>
<li><a href="#my-latex-setup">My LaTeX setup</a><ul>
<li><a href="#bibliography-management-with-zotero-and-better-bibtex">Bibliography management with Zotero and Better BibTeX</a></li>
<li><a href="#overflowing-legend-of-large-figures">Overflowing legend of large figures</a></li>
<li><a href="#folder-structure">Folder structure</a></li>
<li><a href="#github-online-editing-workflow">GitHub online editing workflow</a></li>
</ul>
</li>
<li><a href="#package-recommendation-to-start-a-new-latex-template">Package recommendation to start a new LaTeX template</a></li>
<li><a href="#why-i-didnt-use-latex-a-lot-during-my-phd">Why I didn’t use LaTeX a lot during my PhD</a></li>
<li><a href="#types-of-documents-i-will-write-in-latex-in-future">Types of documents I will write in LaTeX in future</a></li>
</ul>
</div>
<h2 id="writing-thesis-in-latex-is-easy-and-rewarding-once-its-set-up">Writing thesis in LaTeX is easy and rewarding (once it’s set up)</h2>
<p>The advantages of LaTeX and its ecosystem<sup id="fnref:latex-advantages"><a class="footnote-ref" href="#fn:latex-advantages">2</a></sup> over WYSIWYG editors (e.g., Word and Google Doc) are more obvious when the document gets longer, as WYSIWYG editors start to slow down, say, to change the figure style across the whole document, or to swap some figures and sections.
On the other hand, the LaTeX workflow remains the same.
While the compilation takes longer, it usually runs in the background and I have adapted to only check the output periodically.
I appreciate its reliability and modularity when my document gets really long.
I can just focus on the writing.</p>
<p>The hardest part is the setup of a LaTeX document.</p>
<p>What are the “recommended” packages? What are all these parameters of the packages and commands? How to set up the folder structure? How do I make a figure block whose caption overflows to the next page (and any other visual components)? Finally, not to mention the time I spent fixing the errors and the debugging when the code doesn’t work as expected.</p>
<p>Now, after some time going over the documentations, I wanted to write down my setup for future usage.</p>
<h2 id="my-latex-setup">My LaTeX setup</h2>
<p>I prefer the combination of LuaLaTeX (XeLaTeX) and BibLaTeX/Biber.
Both LaTeX engines can utilize system fonts and support Unicode.
As for the bibliography management, BibLaTeX is more customizable.</p>
<h3 id="bibliography-management-with-zotero-and-better-bibtex">Bibliography management with Zotero and Better BibTeX</h3>
<p>I manage all my reference in <a href="https://www.zotero.org/">Zotero</a>.
<a href="https://blog.liang2.tw/posts/2015/09/ref-management-zotero/">My Zotero setup</a> (Zotero + <a href="https://github.com/jlegewie/zotfile">Zotfile</a>) has been the same over the past 6 years, which is surprisingly stable in the software world.
Zotero has builtin support to export the reference in BibTeX and BibLaTeX formats.</p>
<p>To further integrate Zotero to the LaTeX workflow, <a href="https://retorque.re/zotero-better-bibtex/">Better BibTeX for Zotero</a> plugin comes in handy.
The plugin can customize the citation key generation, and it can automatically export the <code>.bib</code> file when the reference changes.</p>
<p>I currently have the following rule for the citation key generation:</p>
<div class="highlight"><pre><span></span><code>[auth+initials:lower]_[authorLast+initials:lower][>1]:[shorttitle2_2][year]
|[Auth+initials:lower:(unknown)]:[Title:capitalize:substring=1,64][year]
</code></pre></div>
<p>The key will try to extract the first and last author of the paper, the first two words of the title, and the publication year.
If the reference is not a article (mostly websites), it will try to use the author, title, and year when available.</p>
<p>Under this rule, for example, the citation key of the famous CRISPR paper (<a href="https://pubmed.ncbi.nlm.nih.gov/22745249/">doi:10.1126/science.1225829</a>):</p>
<blockquote>
<p>Jinek, M., Chylinski, K., Fonfara, I., Hauer, M., Doudna, J.A., and Charpentier, E. (2012). A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science 337, 816–821.</p>
</blockquote>
<p>becomes <code>jinekm_charpentiere:ProgrammableDualRNAguided2012</code>.</p>
<p>Better BibTeX also allows fixing the citation key, which is useful to keep a nickname of some commonly referred publications, or to maintain backward compatibility.</p>
<h3 id="overflowing-legend-of-large-figures">Overflowing legend of large figures</h3>
<p>A feature I found very useful is to allow figure caption/legend to overflow to the next page, which is common in many journals.
Usually the figure is large, taking up the full page, and has multiple panels combined into one file.
And the remaining figure legend need to be put at the next page.
Here’s an example of the desired behavior:</p>
<figure class="full-img invert-in-dark-mode">
<img src="https://blog.liang2.tw/posts/2022/03/thesis-latex/pics/fig_legend_first_half.png">
<img src="https://blog.liang2.tw/posts/2022/03/thesis-latex/pics/fig_legend_second_half.png">
<figcaption>Example of the figure legend overflow.</figcaption>
</figure>
<p>To create “fake” panels for <a href="https://www.ctan.org/pkg/subcaption">subcaption</a> to keep track of the reference, I followed <a href="https://tex.stackexchange.com/a/255790">the suggestion on LaTeX Stack Exchange</a> to create a hidden anchor for the label using <code>\phantomsubcaption</code>.</p>
<p>To allow the figure legend/caption to overflow, two figure environments are constructed side by side.
Then I used a quotation helper function <code>\sourceatright</code> by memoir to place the “(legend continued on next page)” right aligned to the end of legend.</p>
<p>Here’s the final helper functions in the permeable:</p>
<div class="highlight"><pre><span></span><code><span class="c">% Allowing subcaptions when all figure panels are combined</span>
<span class="c">% into one source image. Require subcaption package.</span>
<span class="c">% Based on https://tex.stackexchange.com/a/255790</span>
<span class="k">\newcommand</span><span class="nb">{</span><span class="k">\phantomlabel</span><span class="nb">}</span>[1]<span class="nb">{</span><span class="c">%</span>
<span class="k">\parbox</span><span class="nb">{</span>0pt<span class="nb">}{</span><span class="k">\phantomsubcaption\label</span><span class="nb">{</span>#1<span class="nb">}}</span><span class="c">%</span>
<span class="nb">}</span>
<span class="c">% Note for figure caption spanning multiple pages</span>
<span class="k">\newcommand</span><span class="nb">{</span><span class="k">\legendcontdnote</span><span class="nb">}{</span><span class="k">\sourceatright</span><span class="na">[2em]</span><span class="nb">{</span><span class="c">%</span>
<span class="k">\footnotesize\itshape</span>(legend continued on next page)<span class="c">%</span>
<span class="nb">}}</span>
<span class="k">\newcommand</span><span class="nb">{</span><span class="k">\legendcontdref</span><span class="nb">}</span>[1]<span class="nb">{</span><span class="k">\emph</span><span class="nb">{</span>(<span class="k">\fref</span><span class="nb">{</span>#1<span class="nb">}</span> continued)<span class="nb">}}</span>
</code></pre></div>
<p>And the following shows an example usage:</p>
<div class="highlight"><pre><span></span><code><span class="k">\begin</span><span class="nb">{</span>figure<span class="nb">}</span>[p] <span class="c">% usually large and will need a full page</span>
<span class="k">\centering</span>
<span class="k">\phantomlabel</span><span class="nb">{</span>fig:panel-a<span class="nb">}</span> <span class="c">% hidden label of each figure panel</span>
<span class="k">\phantomlabel</span><span class="nb">{</span>fig:panel-b<span class="nb">}</span>
<span class="k">\phantomlabel</span><span class="nb">{</span>fig:panel-c<span class="nb">}</span>
...
<span class="k">\includegraphics</span><span class="nb">{</span>figures/myfigure.pdf<span class="nb">}</span>
<span class="k">\caption</span><span class="nb">{</span><span class="c">%</span>
Overview of the whole figure.
<span class="k">\subref</span><span class="nb">{</span>fig:panel-a<span class="nb">}</span>
Some description about panel A.
<span class="k">\legendcontdnote</span>
<span class="nb">}</span>
<span class="k">\label</span><span class="nb">{</span>fig:myfigure<span class="nb">}</span>
<span class="k">\end</span><span class="nb">{</span>figure<span class="nb">}</span>
<span class="k">\begin</span><span class="nb">{</span>figure<span class="nb">}</span>[t] <span class="c">% place at the top of the very next page</span>
<span class="k">\centering</span>
<span class="k">\legend</span><span class="nb">{</span><span class="c">%</span>
<span class="k">\legendcontdref</span><span class="nb">{</span>fig:myfigure<span class="nb">}</span>
<span class="k">\subref</span><span class="nb">{</span>fig:panel-b<span class="nb">}</span>
Some description about panel B.
<span class="k">\subref</span><span class="nb">{</span>fig:panel-c<span class="nb">}</span>
Some description about panel C.
...
<span class="nb">}</span>
<span class="k">\end</span><span class="nb">{</span>figure<span class="nb">}</span>
</code></pre></div>
<h3 id="folder-structure">Folder structure</h3>
<p>I have a very typical folder structure of a thesis:</p>
<div class="highlight"><pre><span></span><code>wustlthesis.cls # document class
thesis.tex # main file (structure and settings)
abstract.tex
acknowledgments.tex
chapters/ # text per chapter
├── XX_name.tex
└── ...
figures/ # figures per chapter
├── chapXX_name/
└── ...
fonts/ # External fonts
.github/workflows/ # Autobuild GitHub workflows
README.md # Instructions to build the file
latexmkrc # latexmk settings
references.bib # Bibliography in BibLaTex
</code></pre></div>
<p>The main goal of the folder structure is to organize the materials by chapters.
I also keep all the related files together so the project is standalone and reproducible.</p>
<h3 id="github-online-editing-workflow">GitHub online editing workflow</h3>
<p>Another pain point of LaTeX setup is the installation of the toolchains (TexLive, MacTex, and etc).
Sometimes I want to work on the document on other’s laptop for just a while, to fix an obvious typo or to write down some ideas.
Also a working LaTeX toolchain always read to use will help others to adapt the workflow much more easily.</p>
<p>Turn out I can tap into the continuous integration and continuous delivery (CI/CD) service provided by the online code repositories.
On GitHub, such service is <a href="https://docs.github.com/en/actions/using-workflows">GitHub Actions</a>, allowing user to provide custom Docker image and run arbitrary code.</p>
<p>Luckily, someone has already laid out the ground work by preparing a TeXLive environment into a GitHub Action workflow: <a href="https://github.com/xu-cheng/latex-action">https://github.com/xu-cheng/latex-action</a>.
Below is an example of creating an online workflow to build the document using LuaLaTeX:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># .github/workflows/build_latex.yml</span>
<span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Build LaTeX PDF</span>
<span class="nt">on</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">push</span>
<span class="nt">jobs</span><span class="p">:</span>
<span class="w"> </span><span class="nt">build</span><span class="p">:</span>
<span class="w"> </span><span class="nt">runs-on</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">ubuntu-latest</span>
<span class="w"> </span><span class="nt">steps</span><span class="p">:</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Check out Git Repository</span>
<span class="w"> </span><span class="nt">uses</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">actions/checkout@v2</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Build LaTeX files</span>
<span class="w"> </span><span class="nt">uses</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">xu-cheng/latex-action@v2</span>
<span class="w"> </span><span class="nt">with</span><span class="p">:</span>
<span class="w"> </span><span class="nt">root_file</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">thesis.tex</span>
<span class="w"> </span><span class="nt">latexmk_use_lualatex</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">true</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Check if PDF file is generated</span>
<span class="w"> </span><span class="nt">run</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">|</span>
<span class="w"> </span><span class="no">file thesis.pdf | grep -q ' PDF '</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Upload PDF</span>
<span class="w"> </span><span class="nt">uses</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">actions/upload-artifact@v2</span>
<span class="w"> </span><span class="nt">with</span><span class="p">:</span>
<span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">PDF</span>
<span class="w"> </span><span class="nt">path</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">thesis.pdf</span>
</code></pre></div>
<p>The workflow is triggered on every git push. Here is an example of my WUSTL thesis template (<a href="https://github.com/ccwang002/wustl-latex-dissertation-template/actions/workflows/build_latex.yml">link</a>):</p>
<figure>
<img src="https://blog.liang2.tw/posts/2022/03/thesis-latex/pics/github_actions_overview.png">
<figcaption>Overview of all the online LaTeX document compilation jobs using GitHub actions.</figcaption>
</figure>
<p>Not only the workflow records the output messages useful to debug, the final PDF output is stored as an artifact.
I no longer need to install LaTeX locally.</p>
<figure>
<img src="https://blog.liang2.tw/posts/2022/03/thesis-latex/pics/github_actions_artifact.png">
<figcaption>Auto-generated PDF output of the document.</figcaption>
</figure>
<p>By putting everything in a GitHub repo, I can also use <a href="https://docs.github.com/en/codespaces/the-githubdev-web-based-editor">GitHub’s online editor</a> to work on my document in a web browser anywhere.
Together with GitHub Actions, it can be an alternative to online LaTeX platform like Overleaf; hacky but more flexible.</p>
<figure class="full-img">
<img src="https://blog.liang2.tw/posts/2022/03/thesis-latex/pics/github_online_editor.png">
<figcaption>Online editor of a LaTeX project in a GitHub repository.</figcaption>
</figure>
<p>Currently, the online workflow is not fully optimized.
My full thesis takes about 6–12 minutes to complete on GitHub, while locally, a full run without cache takes about 3–5 minutes.
And incremental local builds can be much faster with caches.
By caching the environment and the intermediate outputs, the online workflow can be faster.</p>
<h2 id="package-recommendation-to-start-a-new-latex-template">Package recommendation to start a new LaTeX template</h2>
<p>I look into a few package alternatives while updating the thesis template.</p>
<p>For a new template/project, start with <a href="https://www.ctan.org/pkg/memoir">memoir</a>, seriously.</p>
<p>Memoir is certainly a giant package and has a very lengthy documentation (~600 pages).
But it pretty much covers > 90% of the possible formatting I can possibly think of.
The first few chapters of its documentation (up to “Paragraphs and Lists” chapter) are very useful to get the main components and structures in place.</p>
<p>While memoir is powerful and covers everything, I do wish certain aspects of it can be better. The chapter title styling is complicated and difficult.
My another issue is the lack of examples.
I understand the current documentation is already crazily long.
But memoir has <em>many many</em> options and features that I sometimes find it difficult to comprehend how to use a specific feature.
Maybe more examples or code snippets can be added in a separate documentation.</p>
<p>With memoir, very few additional packages are required.
Here is a short list of packages:</p>
<ul>
<li><a href="https://ctan.org/pkg/microtype">microtype</a>: final touch on the typography. pure black magic in my view</li>
<li><a href="https://ctan.org/pkg/enumitem">enumitem</a>: list environment customization</li>
<li><a href="https://ctan.org/pkg/threeparttable">threeparttable</a>: pretty table styling and in-table notes</li>
<li><a href="https://www.ctan.org/pkg/subcaption">subcaption</a>: subfloat references</li>
<li><a href="https://ctan.org/pkg/hyperref">hyperref</a>: link and PDF metadata</li>
<li><a href="https://ctan.org/pkg/graphicx">graphicx</a>: external graphics</li>
<li><a href="https://ctan.org/pkg/fontspec">fontspec</a>: custom system fonts</li>
<li><a href="https://ctan.org/pkg/csquotes">csquotes</a>: quotes</li>
<li><a href="https://ctan.org/pkg/babel">babel</a>: localization</li>
</ul>
<p>Each package is powerful in their specific usage.
Their documentations are worth reading to fully utilize the package.</p>
<h2 id="why-i-didnt-use-latex-a-lot-during-my-phd">Why I didn’t use LaTeX a lot during my PhD</h2>
<p>I didn’t really have many chances to use LaTeX during my PhD.
It’s usually not worth the effort to use LaTeX for documents less than 50 pages.</p>
<p>The most common documents I need to produce are presentations and manuscripts, and I don’t think LaTeX is useful here.
My presentations are figure heavy.
Many figures are clipped from webpages, tool outputs, and other papers.
They are messy and require post-editing like croppings and overlays, which are easier to do interactively; PowerPoint wins.
Moreover, I need to finish my slides in a short time.
LaTeX is not the right tool.</p>
<p>As for the manuscripts, the figures and text are managed separately, so there is no need to work on how to insert the figures at the right location in text.
The main text editing is about citations.
While LaTeX is good at them, Zotero can manage citations and bibliography easily on Google Docs, Word, and LibreOffice Writer.
Another main hassle of manuscript editing is collaboration.
While it’s possible for LaTeX, the existing services have a high entry bar with little benefit.</p>
<h2 id="types-of-documents-i-will-write-in-latex-in-future">Types of documents I will write in LaTeX in future</h2>
<p>If I need to write a report longer than 50 pages, I will write in LaTeX and use my thesis setup.</p>
<p>I also have been updating my CV in LaTeX.
While I am playing with LaTeX for my thesis, I also applied the new things I learned to <a href="https://github.com/ccwang002/cv">my CV</a>.
In fact, I applied a different theme and rewrote many parts of the theme to utilize the awesome packages.
I like my new CV to be clean, minimalist, and easy to extend.
I don’t know yet if I want to create my one-page resume in LaTeX.</p>
<p>So is LaTeX worth learning if I am not going to write a book or a long report in future?</p>
<p>It’s a bad investment in time for sure, since I have been putting hours in learning LaTeX.
But I have since been paying more attention to typography and layout of my documents in general.
I appreciate certain aesthetics of the printings.
And by trying to fulfill the desired look and feel in my documents, I’ve also improved my editing skills in all WYSIWYG editors (even Adobe InDesign) and webpages (CSS).
It’s hard to beat that artistic satisfaction of getting the style right.</p>
<p>I am glad that I learned LaTeX.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:note-template-history">
<p>The history of the existing template goes way back to 1995.
I tried to summarize the history I can be find from the file comments <a href="https://github.com/ccwang002/wustl-latex-dissertation-template/#origin-of-this-template">here</a>. <a class="footnote-backref" href="#fnref:note-template-history" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:latex-advantages">
<p>Here are the advantages LaTeX I appreciate the most:</p>
<ul>
<li>Precise control of the layout</li>
<li>Automatic positioning of float objects (figures and tables)</li>
<li>Programmable and reusable visual components, especially those provided by the packages</li>
<li>Excellent bibliography management</li>
<li>Excellent equation display</li>
</ul>
<p><a class="footnote-backref" href="#fnref:latex-advantages" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
</ol>
</div>Fix Fira Code font ligatures and features2022-03-01T00:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2022-03-01:/posts/2022/03/fix-fira-code-font-features/<p><a href="https://github.com/tonsky/FiraCode">Fira Code</a> has been my choice of the programming font for a while. It’s also the default monospace font of my blog.
I like its ligatures such as <code>>=</code> and connected lines <code>======</code> <code>------</code>.
It evens renders the progress bar nicely <code></code>.
It makes my plain text documents look neat.</p>
<p>That said, I …</p><p><a href="https://github.com/tonsky/FiraCode">Fira Code</a> has been my choice of the programming font for a while. It’s also the default monospace font of my blog.
I like its ligatures such as <code>>=</code> and connected lines <code>======</code> <code>------</code>.
It evens renders the progress bar nicely <code></code>.
It makes my plain text documents look neat.</p>
<p>That said, I don’t like the default ampersands <code>&</code> and the at signs <code>@</code>.
I find them harder to read than their traditional looks.
To change their looks, we can enable the alternative ligatures and features of the font using <a href="https://ilovetypography.com/OpenType/opentype-features.html">different OpenType features</a> (see also <a href="https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Fonts/OpenType_fonts_guide">the guide on MDN</a>).
In this case, <code>ss05</code> and <code>ss08</code> enable the traditional looks of ampersands and at signs, respectively.
Most modern editors and word processors are able to configure the feature sets in use.</p>
<figure>
<img src="https://blog.liang2.tw/posts/2022/03/fix-fira-code-font-features/pics/fira_code_comparison.png">
<figcaption>Comparison of the Fira Code rendering with and without the features fixed (ss01, ss03, ss05, and ss08).</figcaption>
</figure>
<p>Unfortunately, I encounter programs that are unable to configure the font features.
While tools like <a href="https://twardoch.github.io/fonttools-opentype-feature-freezer/">pyftfeatfreeze (OpenType Feature Freezer)</a> are able to swap specific glyphs by directly editing the font file, ligatures of those glyphs may fail.
For example, <code>ss08</code> feature (e.g., <code>==</code>, <code>!=</code>, and <code>===</code>) won’t be permanently enabled using this approach.</p>
<h2 id="permanently-fix-the-font-features-using-the-official-build-script-update-in-2022-03">Permanently fix the font features using the official build script (update in 2022-03)</h2>
<p>Thanks to the <a href="https://github.com/tonsky/FiraCode/pull/1387">pull request by <code>@Daxtorim</code></a> a few days after this post was published, we now are able to permanently fix the font features using the official build script:</p>
<div class="highlight"><pre><span></span><code>docker<span class="w"> </span>run<span class="w"> </span>-it<span class="w"> </span>--rm<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-v<span class="w"> </span><span class="nv">$PWD</span>:/opt/FiraCode<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>tonsky/firacode:latest<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>./FiraCode/script/build.sh<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-f<span class="w"> </span><span class="s2">"ss01,ss03,ss05,ss08"</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-n<span class="w"> </span><span class="s2">"Fira Code ss01 ss03 ss05 ss08"</span>
<span class="c1"># Rename the generated TTFs</span>
parallel<span class="w"> </span><span class="s1">'mv {} {.}.ss1358_enabled.ttf'</span><span class="w"> </span>:::<span class="w"> </span>distr/ttf/<span class="s1">'Fira Code ss01 ss03 ss05 ss08'</span>/*.ttf
</code></pre></div>
<p>And that’s it!
This is the easiest solution and it works straight out of the box.
Praise the open source community :)
I still kept the original instructions below to manually create the patches since that’s what happens behind the scene.</p>
<h2 id="permanently-fix-the-font-features">Permanently fix the font features</h2>
<p>By changing the source code of the font generation, it should be possible to permanently fix any font features (aka patching).
As mentioned by <a href="https://github.com/tonsky/FiraCode/issues/869#issuecomment-548006778">the original author (<code>@tonsky</code>)</a>:</p>
<blockquote>
<p>There is probably a simpler approach to patching the font, just concat whatever code there is in ssXX and add it to the end of calt feature.
That should work on the current version of the font, but you’ll need to do your own research on which scripts to use for that</p>
</blockquote>
<p>And that’s exactly my patch for <code>FiraCode.glyphs</code>.
Copy all the content of the features (say, ss01, ss03, ss05, and ss08) to <code>calt</code>. So the original code:</p>
<div class="highlight"><pre><span></span><code># FiraCode.glyphs
{
code = "lookup less_bar_greater ...
... underscores;\012";
name = calt;
},
</code></pre></div>
<p>becomes:</p>
<div class="highlight"><pre><span></span><code>{
code = "lookup less_bar_greater ...
... underscores;\012
# ss01\012 sub r by r.ss01;
# ss03\012 sub ampersand by ampersand.ss03;...
# ss05\012 sub at by at.ss05;\012sub asciitilde.spacer'; ...
# ss08\012 sub equal_equal.liga by equal_equal.ss08;...
";
name = calt;
},
</code></pre></div>
<p>Then build the font from the <code>.glyphs</code> file using the Docker image:</p>
<div class="highlight"><pre><span></span><code>docker<span class="w"> </span>run<span class="w"> </span>-it<span class="w"> </span>--rm<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-v<span class="w"> </span><span class="nv">$PWD</span>:/opt/FiraCode<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>tonsky/firacode<span class="w"> </span>./FiraCode/script/build.sh
</code></pre></div>
<p>To set a different font name for the feature fixed fonts, I use <a href="https://twardoch.github.io/fonttools-opentype-feature-freezer/">pyftfeatfreeze</a>:</p>
<div class="highlight"><pre><span></span><code>parallel<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>pyftfeatfreeze<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--suffix<span class="w"> </span>--usesuffix<span class="o">=</span><span class="s2">"'ss01 ss03 ss05 ss08'"</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-v<span class="w"> </span>-n<span class="w"> </span><span class="se">\</span>
<span class="w"> </span><span class="s1">'{}'</span><span class="w"> </span><span class="s1">'features_enabled/{/.}.ss1358_enabled.ttf'</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>:::<span class="w"> </span>ttf/*.ttf
</code></pre></div>Change the blog commenting system to utterances2022-02-20T00:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2022-02-20:/posts/2022/02/blog-comment-utterances/<p>My blog is statically generated, so it needs an external service for commenting. I chose <a href="https://disqus.com/">Disqus</a> when I started my blog because it was a popular choice, and it is free and easy to setup. However, there’s been increasing concern about its extensive user tracking, ads, and therefore a …</p><p>My blog is statically generated, so it needs an external service for commenting. I chose <a href="https://disqus.com/">Disqus</a> when I started my blog because it was a popular choice, and it is free and easy to setup. However, there’s been increasing concern about its extensive user tracking, ads, and therefore a toll on the page loading performance<sup id="fnref:disqus downsides"><a class="footnote-ref" href="#fn:disqus downsides">1</a></sup>. Heck, I don’t even load Disqus myself when I check my own blog:</p>
<figure>
<img src="https://blog.liang2.tw/posts/2022/02/blog-comment-utterances/pics/disqus_blocked_by_privacy_badger_screenshot.png">
<figcaption>What my blog looks like from my end, where Disqus is blocked by <a href="https://privacybadger.org/Privacy">Privacy Badger</a> by default.</figcaption>
</figure>
<p>Recently, I was finally able to look into the alternatives to Disqus. I landed on <a href="https://utteranc.es/">utterances</a>, a commenting widget based on GitHub Issues. I like it for a few reasons:</p>
<ul>
<li>Free and open source</li>
<li>No trackings and ads (for now at least)</li>
<li>Comments are tied to the blog’s code repository</li>
<li>Moderation using existing GitHub tools/interface</li>
</ul>
<p>Switch to a new commenting systems can be hard due to the loss of the old comments. But since there are only a total of 75 comments on my blog, I don’t have a lot to miss :) I did back up the old comments because I enjoyed the discussions, and they are one of the main motiviations to keep me going. Disqus offers a way to export all the comments, and here is the frequency of all the comments over time:</p>
<figure>
<img src="https://blog.liang2.tw/posts/2022/02/blog-comment-utterances/pics/number_comments_per_post.png">
<figcaption>Number of comments on my blog over time</figcaption>
</figure>
<p>Since this post, my blog will be using the new utterances comment widget. For comparison, I attached a screenshot of the old interface using Disqus below:</p>
<figure>
<img src="https://blog.liang2.tw/posts/2022/02/blog-comment-utterances/pics/disqus_screenshot.png">
<figcaption>Old commenting widget using Disqus on my blog.</figcaption>
</figure>
<div class="footnote">
<hr>
<ol>
<li id="fn:disqus downsides">
<p>There are already many summaries on these issues. For example: <a href="https://donw.io/post/github-comments/"><em>Replacing Disqus with Comments</em></a> (<a href="https://news.ycombinator.com/item?id=14170041">Discussion on Hacker News</a>) and <a href="https://supunkavinda.blog/disqus"><em>Disqus, the dark commenting system</em></a> (<a href="https://news.ycombinator.com/item?id=26033052">Discussion on Hacker News</a>). <a class="footnote-backref" href="#fnref:disqus downsides" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>ID cross reference with exact protein sequence identity using UniParc2020-07-24T00:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2020-07-24:/posts/2020/07/id-crossref-exact-protein-uniparc/<p><em>TL;DR: Many existing ID mappings between different biological ID systems (RefSeq/Ensembl/UniProt) don’t consider if the IDs have the same exact protein sequence. When the exact sequence is needed, UniParc can be used to cross-reference the IDs. I will demonstrate how to use UniParc to map RefSeq …</em></p><p><em>TL;DR: Many existing ID mappings between different biological ID systems (RefSeq/Ensembl/UniProt) don’t consider if the IDs have the same exact protein sequence. When the exact sequence is needed, UniParc can be used to cross-reference the IDs. I will demonstrate how to use UniParc to map RefSeq human proteins to UniProt and Ensembl at scale.</em></p>
<p>You can skip to the solution if you already know what’s the problem I want to tackle.</p>
<div class="toc">
<ul>
<li><a href="#camps-of-biological-ids">Camps of biological IDs</a></li>
<li><a href="#challenges-to-map-ids-with-exact-sequence-identity">Challenges to map IDs with exact sequence identity</a></li>
<li><a href="#why-mapping-with-exact-sequence-identity">Why mapping with exact sequence identity?</a></li>
<li><a href="#uniparc-comes-into-rescue-for-protein-sequence-identity">UniParc comes into rescue for protein sequence identity</a></li>
<li><a href="#programatic-uniparc-access-using-its-xml">Programatic UniParc access using its XML</a></li>
<li><a href="#summary">Summary</a></li>
</ul>
</div>
<h3 id="camps-of-biological-ids">Camps of biological IDs</h3>
<p>There are a few “camps” of biological IDs that are used by many (human) databases and datasets: <a href="https://ensembl.org/">Ensembl</a>, <a href="https://www.ncbi.nlm.nih.gov/refseq/">RefSeq</a> (plus <a href="https://www.ncbi.nlm.nih.gov/gene/">NCBI/Entrez Gene</a>), and <a href="https://www.uniprot.org/">UniProt</a>. Each ID camp is comprehensive independently, containing gene-level, transcript-level, and protein-level information using their own systems of IDs. Unfortunately, all three ID systems/camps are useful in their own way, making the choice of the “favorite” ID system really divided for different databases and datasets.</p>
<p>To get a sense of these “ID camps” and how information is connected through them and across them, this great illustration from <a href="https://biodbnet-abcc.ncifcrf.gov/dbInfo/netGraph.php">bioDBnet</a> sums it all (it’s huge):</p>
<figure>
<img src="https://blog.liang2.tw/posts/2020/07/id-crossref-exact-protein-uniparc/pics/bioDBnet.jpg">
<figcaption>Best illustration of the complex ID crossref: bioDBnet Network Diagram (<a href="https://biodbnet-abcc.ncifcrf.gov/dbInfo/netGraph.php">source</a>)</figcaption>
</figure>
<h3 id="challenges-to-map-ids-with-exact-sequence-identity">Challenges to map IDs with exact sequence identity</h3>
<p>It’s usually straightforward to cross reference within each ID camp, as long as one has the versioned ID and a copy of that camp’s ID system. For example, to know the gene symbol of <code>ENSP00000368632.3</code>, I can easily use <a href="https://bioconductor.org/packages/release/bioc/html/ensembldb.html">ensembldb</a>, a lite copy of Ensembl’s ID system, to find out it is translated from transcript <code>ENST00000379328.9</code>, which is one of the transcripts of gene <code>ENSG00000107485.18</code>, whose gene symbol is <code>GATA3</code>. Easy peasy. The story goes the same for RefSeq and UniProt (albeit this one is more protein centric).</p>
<p>However, things get messy when one wants to cross reference across ID camps. While both official and third-party services (e.g., <a href="https://biodbnet-abcc.ncifcrf.gov/">bioDBnet</a> and <a href="https://david.ncifcrf.gov/home.jsp">DAVID</a>) exist, they don’t guarantee sequence identity match. In this post, I will focus on the protein sequence. For example, bioDBnet says <code>ENSP00000368632</code> can be mapped to (note the lack of version):</p>
<ul>
<li>RefSeq: NP_002042, NP_001002295, …</li>
<li>UniProt: P23771, …</li>
</ul>
<p>But when considering their sequence being identical, ENSP00000368632.3 should only match NP_001002295.1 in RefSeq, and the non-canonical form of UniProt P23771-2 (sequence version 1). The other IDs don’t have the exact same protein sequence because of an 1aa deletion. To complicate things more, many mappings don’t handle ID versions, and sequence often changes across ID versions. Without tracking the versioned ID, it’s impossible to say which IDs have the same sequence.</p>
<p>Let me be clear that there is nothing wrong about these services. They’re built so people can map IDs at high level and increase the number of mappable IDs, which is extremely useful in its own way. In fact, a lot of IDs simply cannot be mapped when consider exact sequence identity.</p>
<p>For transcripts the situation isn’t any better because many mapped transcripts of different ID camps have different UTR sequences. I will probably write another post to touch on transcript ID mapping. The short answer is that the <a href="https://www.ncbi.nlm.nih.gov/refseq/MANE/">MANE</a> project started by RefSeq and Ensembl is working on the problem.</p>
<h3 id="why-mapping-with-exact-sequence-identity">Why mapping with exact sequence identity?</h3>
<p>Well, when do I need ID mapping with exact protein sequence identity? My use case is to map post-translation modifications (PTMs) from RefSeq to UniProt. CPTAC detected loads of PTMs (mostly phosphosites) using RefSeq as the peptide spectral library (peptide search database). But a lot of what we know about a protein is from UniProt. So I need a reliable way to map a specific amino acid of one RefSeq protein to its UniProt counterpart. I’ve been using existing services for the job, but they not perfect.</p>
<p>For example, we found a phosphosite NP_001317366.1 (PTPN11) p.Y546. PTPN11 corresponds to Q06124 in UniProt reviewed proteome (the current canonical isoform is Q06124-2 seq. ver3). But you won’t find anything (e.g. antibody) about this site at 546aa because sequences of NP_001317366.1 and Q06124-2 don’t match. This phosphosite actually maps to Q06124-2 p.Y542.</p>
<p>While one can argue that I should re-run the peptide search using UniProt, this solution only works around the problem. The problem comes back when I want to map the Ensembl based mutations to UniProt. I am also aware that two proteins with different sequence can be biologically different, and we shouldn’t just blindly integrate their annotation. I totally agree, so the integration should be further validated. Due to the nature of the shotgun proteomics, as long as the peptide sequence of the PTM site can be found in both proteins, it’s fairly possible that the site can be mapped to both. This topic has been on my mind <a href="https://twitter.com/lbwang2/status/1238144323218288643">for a while</a>. I’ll write about it once I figure out the details. Anyway, mapping PTMs between different protein sequences is my next step, and it goes beyond the scope of this post.</p>
<h3 id="uniparc-comes-into-rescue-for-protein-sequence-identity">UniParc comes into rescue for protein sequence identity</h3>
<blockquote>
<p><a href="https://www.uniprot.org/uniparc/">The UniProt Archive (UniParc)</a> is a comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world. Proteins may exist in different source databases and in multiple copies in the same database. UniParc avoided such redundancy by storing each unique sequence only once and giving it a stable and unique identifier (UPI) making it possible to identify the same protein from different source databases. A UPI is never removed, changed or reassigned. UniParc contains only protein sequences.</p>
<p>(source: <a href="https://www.uniprot.org/help/uniparc">UniParc help page</a>)</p>
</blockquote>
<p>UniParc is a collection of non-redundant protein sequence archive. The UniParc ID and sequence is permanently stable, but the cross-references associated to one UniParc entry may change over time. All of its properties make UniParc perfect to be the identifer to map across ID camps. UniParc IDs can be queried using the <a href="https://en.wikipedia.org/wiki/Cyclic_redundancy_check">CRC64-ISO checksum</a> of the protein sequence.</p>
<p>For example, let’s find the UniPrac ID of <code>NP_001317366.1</code>. First, we obtain its protein sequence in FASTA from NCBI:</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span><span class="nb">export</span><span class="w"> </span><span class="nv">refseq_id</span><span class="o">=</span><span class="s2">"NP_001317366.1"</span>
<span class="gp">$ </span>curl<span class="w"> </span>-Lo<span class="w"> </span><span class="s2">"</span><span class="nv">$refseq_id</span><span class="s2">"</span>.fasta<span class="w"> </span><span class="se">\</span>
<span class="w"> </span><span class="s2">"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=</span><span class="nv">$refseq_id</span><span class="s2">&rettype=fasta&retmode=text"</span>
<span class="gp">$ </span>head<span class="w"> </span>-n<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="s2">"</span><span class="nv">$refseq_id</span><span class="s2">"</span>.fasta
<span class="go">>NP_001317366.1 tyrosine-protein phosphatase non-receptor type 11 isoform 3 [Homo sapiens]</span>
<span class="go">MTSRRWFHPNITGVEAENLLLTRGVDGSFLARPSKSNPGDFTLSVRRNGAVTHIKIQNTGDYYDLYGGEK</span>
<span class="go">FATLAELVQYYMEHHGQLKEKNGDVIELKYPLNCADPTSERWFHGHLSGKEAEKLLTEKGKHGSFLVRES</span>
</code></pre></div>
<p>Then calculate the CRC64 checksum (there are a few packages capable, or use <a href="https://www.ebi.ac.uk/Tools/so/seqcksum/">EBI’s checksum calculator</a>):</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pysam</span> <span class="kn">import</span> <span class="n">FastaFile</span>
<span class="gp">>>> </span><span class="kn">from</span> <span class="nn">crc64iso</span> <span class="kn">import</span> <span class="n">crc64iso</span>
<span class="gp">>>> </span><span class="n">fa</span> <span class="o">=</span> <span class="n">FastaFile</span><span class="p">(</span><span class="s1">'NP_001317366.1.fasta'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">seq</span> <span class="o">=</span> <span class="n">fa</span><span class="o">.</span><span class="n">fetch</span><span class="p">(</span><span class="n">fa</span><span class="o">.</span><span class="n">references</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="gp">>>> </span><span class="n">crc64iso</span><span class="o">.</span><span class="n">crc64</span><span class="p">(</span><span class="n">seq</span><span class="p">)</span>
<span class="go">'37E8BFC7ECA2D03F'</span>
</code></pre></div>
<p>Search <code>checksum:37E8BFC7ECA2D03F</code> on UniParc gives an unique entry<sup id="fnref:checksum-collision"><a class="footnote-ref" href="#fn:checksum-collision">1</a></sup> <code>UPI000041C017</code>: <a href="https://www.uniprot.org/uniparc/?query=checksum%3A37E8BFC7ECA2D03F&sort=score&direct=yes">https://www.uniprot.org/uniparc/?query=checksum%3A37E8BFC7ECA2D03F&sort=score&direct=yes</a>.</p>
<figure class="full-img">
<img src="https://blog.liang2.tw/posts/2020/07/id-crossref-exact-protein-uniparc/pics/uniparc_UPI000041C017.png">
<figcaption>UniParc entry <a href="https://www.uniprot.org/uniparc/UPI000041C017">UPI000041C017</a> and all of its human ID cross references with exact sequence identity.</figcaption>
</figure>
<p>All the external IDs listed here have the identical protein sequence to <code>NP_001317366.1</code>, which of course includes itself. UniParc also marks the IDs inactive if they are superseded by a newer version or become obsolete, which is quite useful for data forensics.</p>
<h3 id="programatic-uniparc-access-using-its-xml">Programatic UniParc access using its XML</h3>
<p>To extract UniParc’s cross reference, it’s easiest to parse its XML, which is also easy to download in bulk.
Continue to use <code>UPI000041C017</code> (<code>NP_001317366.1</code>) as the example,</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>curl<span class="w"> </span>-LO<span class="w"> </span>https://www.uniprot.org/uniparc/UPI000041C017.xml
<span class="gp">$ </span>head<span class="w"> </span>-n<span class="w"> </span><span class="m">10</span><span class="w"> </span>UPI000041C017.xml
<span class="go"><?xml version='1.0' encoding='UTF-8'?></span>
<span class="go"><uniparc xmlns="http://uniprot.org/uniparc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniparc http://www.uniprot.org/docs/uniparc.xsd" version="2020_03"></span>
<span class="go"><entry dataset="uniparc"></span>
<span class="go"><accession>UPI000041C017</accession></span>
<span class="go"><dbReference type="UniProtKB/Swiss-Prot" id="Q06124" version_i="2" active="N" version="2" created="2005-12-20" last="2019-11-13"></span>
<span class="go"><property type="NCBI_GI" value="84028248"/></span>
<span class="go"><property type="NCBI_taxonomy_id" value="9606"/></span>
<span class="go"><property type="protein_name" value="Tyrosine-protein phosphatase non-receptor type 11"/></span>
<span class="go"><property type="gene_name" value="PTPN11"/></span>
<span class="go"></dbReference></span>
</code></pre></div>
<p>While I don’t find XML easy to read, I’ve figured out <a href="https://blog.liang2.tw/posts/2018/01/read-uniprotkb-xml/">a way</a> before to parse XMLs as a JSON-like dictionary given its schema. Let’s define a function to flatten the nested structure and only select the information we want:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">parse_dbref</span><span class="p">(</span><span class="n">entry</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""</span>
<span class="sd"> Parse Ensembl/UniProt/RefSeq IDs of an UniParc entry.</span>
<span class="sd"> Keep both active and inactive IDs.</span>
<span class="sd"> """</span>
<span class="n">ref_ids</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">db_type</span> <span class="ow">in</span> <span class="p">[</span><span class="s2">"Ensembl"</span><span class="p">,</span> <span class="s2">"UniProt"</span><span class="p">,</span> <span class="s2">"RefSeq"</span><span class="p">]:</span>
<span class="n">ids</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
<span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">entry</span><span class="p">[</span><span class="s1">'dbReference'</span><span class="p">]:</span>
<span class="k">if</span> <span class="n">d</span><span class="p">[</span><span class="s2">"@type"</span><span class="p">]</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="n">db_type</span><span class="p">):</span>
<span class="c1"># Skip non-human entries</span>
<span class="n">ncbi_taxid</span> <span class="o">=</span> <span class="nb">next</span><span class="p">(</span>
<span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="s1">'@value'</span><span class="p">]</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">d</span><span class="p">[</span><span class="s1">'property'</span><span class="p">]</span> <span class="k">if</span> <span class="n">p</span><span class="p">[</span><span class="s1">'@type'</span><span class="p">]</span> <span class="o">==</span> <span class="s1">'NCBI_taxonomy_id'</span><span class="p">),</span>
<span class="kc">None</span>
<span class="p">)</span>
<span class="k">if</span> <span class="n">ncbi_taxid</span> <span class="o">!=</span> <span class="s1">'9606'</span><span class="p">:</span>
<span class="k">continue</span>
<span class="c1"># Make versioned ID</span>
<span class="k">if</span> <span class="s1">'@version'</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">d</span><span class="p">:</span>
<span class="c1"># Use the UniParc internal version (for UniProt)</span>
<span class="n">id_str</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">d</span><span class="p">[</span><span class="s1">'@id'</span><span class="p">]</span><span class="si">}</span><span class="s2">.</span><span class="si">{</span><span class="n">d</span><span class="p">[</span><span class="s1">'@version_i'</span><span class="p">]</span><span class="si">}</span><span class="s2">"</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">id_str</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">d</span><span class="p">[</span><span class="s1">'@id'</span><span class="p">]</span><span class="si">}</span><span class="s2">.</span><span class="si">{</span><span class="n">d</span><span class="p">[</span><span class="s1">'@version'</span><span class="p">]</span><span class="si">}</span><span class="s2">"</span>
<span class="n">ids</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">id_str</span><span class="p">)</span>
<span class="n">ref_ids</span><span class="p">[</span><span class="n">db_type</span><span class="p">]</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">ids</span><span class="p">)</span>
<span class="k">return</span> <span class="n">ref_ids</span>
</code></pre></div>
<p><a href="https://pypi.org/project/xmlschema/">xmlschema</a> makes it really easy to parse a XML with schema:</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="kn">import</span> <span class="nn">xmlschema</span>
<span class="gp">>>> </span><span class="kn">from</span> <span class="nn">pprint</span> <span class="kn">import</span> <span class="n">pprint</span>
<span class="gp">>>> </span><span class="n">xs</span> <span class="o">=</span> <span class="n">xmlschema</span><span class="o">.</span><span class="n">XMLSchema</span><span class="p">(</span><span class="s1">'https://www.uniprot.org/docs/uniparc.xsd'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">data</span><span class="p">,</span> <span class="n">errors</span> <span class="o">=</span> <span class="n">xs</span><span class="o">.</span><span class="n">to_dict</span><span class="p">(</span><span class="s1">'UPI000041C017.xml'</span><span class="p">,</span> <span class="n">validation</span><span class="o">=</span><span class="s1">'lax'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">pprint</span><span class="p">(</span><span class="n">parse_dbref</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s1">'entry'</span><span class="p">][</span><span class="mi">0</span><span class="p">]))</span>
<span class="go">{'Ensembl': ['ENSP00000489597.1'],</span>
<span class="go"> 'RefSeq': ['NP_001317366.1', 'XP_006719589.1'],</span>
<span class="go"> 'UniProt': ['Q06124-1.1', 'Q06124.2']}</span>
</code></pre></div>
<p><em>Voilà</em>, we can now map across ID camps with confidence!</p>
<p>This method can be applied to a large number of queries efficiently. By reading in a FASTA of protein sequences of interest, we can build URLs to UniParc XML per protein entry using its checksum, and pass the URLs as an <a href="https://aria2.github.io/">aria2</a>'s input file <code>xml.links</code>:</p>
<div class="highlight"><pre><span></span><code>https://www.uniprot.org/uniparc/?query=checksum%3A{crc64_checksum}&format=xml
out={protein_id}.xml
...
</code></pre></div>
<p>And download all the links in batch:</p>
<div class="highlight"><pre><span></span><code>aria2c<span class="w"> </span>-c<span class="w"> </span>-j5<span class="w"> </span>--max-overall-download-limit<span class="o">=</span>10M<span class="w"> </span>-i<span class="w"> </span>xml.links
</code></pre></div>
<h3 id="summary">Summary</h3>
<p>We solve the ID mapping with exact protein sequence identity between Ensembl/RefSeq/UniProt camps through UniParc.</p>
<p>Note that the version of UniProt entries is a bit confusing. For example, <code>Q06124.2</code> means the sequence version 2 of <code>Q06124</code>. But finding UniProt’s sequence version is not that straightforward, and the UniProt isoforms unlike the canonical isoform lack version tracking. As a result, while processing UniProt associated annotation, I will always add UniParc IDs or keep its protein sequence for future reference.</p>
<p>RefSeq and Ensembl protein IDs are always versioned. Thus it’s highly recommended to keep the versioned ID for these two systems.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:checksum-collision">
<p>It’s possible that two different protein sequences have the same checksum, though very unlikely.
So always double check so do this in batch. <a class="footnote-backref" href="#fnref:checksum-collision" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Identify the Ensembl release from versioned IDs2020-06-08T00:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2020-06-08:/posts/2020/06/identify-ensembl-release-ver/<p>I often received data that was annotated by an unknown Ensembl release/version.</p>
<p>It could be the Ensembl IDs in a gene expression matrix, a VEP annotated MAF file, or even a customized GTF. The documentation of those files wasn’t always clear about the annotation in use. However, it …</p><p>I often received data that was annotated by an unknown Ensembl release/version.</p>
<p>It could be the Ensembl IDs in a gene expression matrix, a VEP annotated MAF file, or even a customized GTF. The documentation of those files wasn’t always clear about the annotation in use. However, it’s sometimes necessary to know the exact Ensembl release. Say, I want to reproduce the result, or to pass the output to the extended downstream workflow. While the tiny difference between adjacent releases is only annoying, <a href="https://betatark.ensembl.org/web/statistics/">thousands of changes per release</a> add up quickly to be obvious inconsistency when using releases across different years.</p>
<p>It’s possible to pinpoint the Ensembl release using the ID versions. For example, ENSG00000119772.15 (DNMT3A) only existed in Ensembl releases 79 and 80; ENST00000275493.7 (EGFR) remains alive since release 96. By checking the ID history on <a href="https://ensembl.org">https://ensembl.org</a>, I can identify the possible Ensembl releases my data uses. A handy URL shortcut to accompany the investigation is <code>ensembl.org/id/<ensembl_id></code>, which can redirect to the different page tabs depends on the ID types (e.g., ENSG to genes and ENST to transcript).</p>
<figure>
<img src="https://blog.liang2.tw/posts/2020/06/identify-ensembl-release-ver/pics/ensembl_r100_dnmt3A_id_history.png">
<figcaption>ID history of ENSG00000119772 (DNMT3A) (<a href="https://www.ensembl.org/Homo_sapiens/Gene/Idhistory?g=ENSG00000119772">source</a>)</figcaption>
</figure>
<p>With <a href="https://betatark.ensembl.org/">Ensembl Tark</a>, I wrote <a href="https://gist.github.com/ccwang002/829a5420a47adfb3be597ed3ea8a0a29">a Python script</a> to automate the query. Given a list of versioned Ensembl IDs<sup id="fnref:versioned-id"><a class="footnote-ref" href="#fn:versioned-id">1</a></sup>, the script will identify the possible Ensembl releases. The IDs can be genes, transcripts, or proteins. Based on my testing, the script can identify Ensembl release range with less than 30 IDs.</p>
<p>Here is an example list of IDs:</p>
<div class="highlight"><pre><span></span><code>ENST00000426449.4
ENST00000434817.4
ENSP00000222254.6
ENSP00000477864.1
ENSG00000170266.14
ENSG00000238009.5
ENSG00000173705.7
</code></pre></div>
<p>And the output of the script:</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>python<span class="w"> </span>check_possible_ensembl_releases.py<span class="w"> </span>ensembl_ids.list
<span class="go">[2020-06-09 13:00:48][INFO ] Querying for 7 Ensembl IDs</span>
<span class="go">[2020-06-09 13:00:49][INFO ] Only Ensembl releases 75–99 are supported by Ensembl Tark. IDs outside the range may not be identified.</span>
<span class="go">ENSP00000477864.1 in Ensembl releases 76–99</span>
<span class="go">ENSG00000173705.7 in Ensembl releases 79–80</span>
<span class="go">ENSG00000238009.5 in Ensembl releases 79–80</span>
<span class="go">ENST00000434817.4 in Ensembl releases 79–80</span>
<span class="go">ENSP00000222254.6 in Ensembl releases 75–99</span>
<span class="go">ENSG00000170266.14 in Ensembl releases 79–80</span>
<span class="go">ENST00000426449.4 in Ensembl releases 79–80</span>
<span class="go">Possible Ensembl releases are: 79, 80</span>
</code></pre></div>
<p>In this case, Ensembl releases 79 and 80 have the same gene model.</p>
<p>The script basically uses Tark’s REST APIs to get the release range for each ID, and find the intersection of all the ranges. It runs on Python 3.8 and uses aiohttp 3.6 for concurrent API calls. Tark currently has records from release 75 (2014) to 99 (2020), so this approach will fail if the IDs are too old (or too new, but I think it will include the latest r100 soon). I limited the maximal concurrent calls ≤ 5 so I don’t overwhelm the Tark service.</p>
<p><a href="https://betatark.ensembl.org/">Tark</a> is a great website/service that compares the transcripts between different versions or even different sources (Ensembl vs RefSeq)! It can tell you what exon or UTR was changed, something quick tricky to set up because one has to import databases for every Ensembl release and even RefSeq releases. Tark is currently in beta, but I hope it can be stable and remained updated.</p>
<p>And this is just another point that it’s usually a good idea to include versioned IDs in the data.</p>
<p>Now I can go back digging the history of my data :)</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:versioned-id">
<p>It’s probably possible to apply the same approach without Ensembl ID version (e.g., ENSG00000119772), but one might need a lot of them because IDs are much more stable. <a class="footnote-backref" href="#fnref:versioned-id" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Generate Venn diagrams easily2019-04-20T00:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2019-04-20:/posts/2019/04/generate-venn-svg/<p>I find myself generating Venn diagrams quite often. While there are many available Venn diagram plotting libraries available, they don’t always fit my need. My inputs of the diagram are the set sizes rather than lists of observations. And after drawing the Venn diagram, I often edit them to …</p><p>I find myself generating Venn diagrams quite often. While there are many available Venn diagram plotting libraries available, they don’t always fit my need. My inputs of the diagram are the set sizes rather than lists of observations. And after drawing the Venn diagram, I often edit them to integrate with other figures, so I prefer a vector format like SVG, which not all the libraries offer.</p>
<p>So I made <a href="https://observablehq.com/@ccwang002/simple-venn-diagram-generator">an Observable Notebook</a> that allows me to interactively modify the Venn diagram, and download the output as a SVG file. It’s built on the <a href="https://github.com/benfred/venn.js/">venn.js</a> library, which does all the heavy lifting.</p>
<p>Here is the screenshot of the diagram drawing interface (see it live on <a href="https://observablehq.com/@ccwang002/simple-venn-diagram-generator">the notebook</a>):</p>
<figure>
<img src="https://blog.liang2.tw/posts/2019/04/generate-venn-svg/pics/venn_nb.png">
<figcaption>Screenshot of the Observable Notebook</figcaption>
</figure>
<p>The set sizes can be easily tweaked by editing the <code>sets</code> variable. The colors of the two sets can be configured by clicking on the color blocks. There is a button to download the generated Venn diagram. Finally, everything changes will interactively reflect on the diagram.</p>
<p>What’s more cool about the Observable Notebook is I can simply modify the code to change the output. If I want to a new set to have a three-way Venn diagram, I just need to update the <code>sets</code>. For example, copy paste the following to the notebook:</p>
<div class="highlight"><pre><span></span><code><span class="nx">sets</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">[</span>
<span class="w"> </span><span class="p">{</span><span class="nx">sets</span><span class="o">:</span><span class="w"> </span><span class="p">[</span><span class="mf">0</span><span class="p">],</span><span class="w"> </span><span class="nx">label</span><span class="o">:</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="nx">size</span><span class="o">:</span><span class="w"> </span><span class="mf">1700</span><span class="p">,</span><span class="w"> </span><span class="nx">fill</span><span class="o">:</span><span class="w"> </span><span class="nx">set0_color</span><span class="p">},</span>
<span class="w"> </span><span class="p">{</span><span class="nx">sets</span><span class="o">:</span><span class="w"> </span><span class="p">[</span><span class="mf">1</span><span class="p">],</span><span class="w"> </span><span class="nx">label</span><span class="o">:</span><span class="w"> </span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="nx">size</span><span class="o">:</span><span class="w"> </span><span class="mf">1350</span><span class="p">,</span><span class="w"> </span><span class="nx">fill</span><span class="o">:</span><span class="w"> </span><span class="nx">set1_color</span><span class="p">},</span>
<span class="w"> </span><span class="p">{</span><span class="nx">sets</span><span class="o">:</span><span class="w"> </span><span class="p">[</span><span class="mf">2</span><span class="p">],</span><span class="w"> </span><span class="nx">label</span><span class="o">:</span><span class="w"> </span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="nx">size</span><span class="o">:</span><span class="w"> </span><span class="mf">700</span><span class="p">,</span><span class="w"> </span><span class="nx">fill</span><span class="o">:</span><span class="w"> </span><span class="s1">'green'</span><span class="p">},</span>
<span class="w"> </span><span class="p">{</span><span class="nx">sets</span><span class="o">:</span><span class="w"> </span><span class="p">[</span><span class="mf">0</span><span class="p">,</span><span class="w"> </span><span class="mf">1</span><span class="p">],</span><span class="w"> </span><span class="nx">size</span><span class="o">:</span><span class="w"> </span><span class="mf">1200</span><span class="p">},</span>
<span class="w"> </span><span class="p">{</span><span class="nx">sets</span><span class="o">:</span><span class="w"> </span><span class="p">[</span><span class="mf">0</span><span class="p">,</span><span class="w"> </span><span class="mf">2</span><span class="p">],</span><span class="w"> </span><span class="nx">size</span><span class="o">:</span><span class="w"> </span><span class="mf">500</span><span class="p">},</span>
<span class="w"> </span><span class="p">{</span><span class="nx">sets</span><span class="o">:</span><span class="w"> </span><span class="p">[</span><span class="mf">1</span><span class="p">,</span><span class="w"> </span><span class="mf">2</span><span class="p">],</span><span class="w"> </span><span class="nx">size</span><span class="o">:</span><span class="w"> </span><span class="mf">450</span><span class="p">},</span>
<span class="w"> </span><span class="p">{</span><span class="nx">sets</span><span class="o">:</span><span class="w"> </span><span class="p">[</span><span class="mf">0</span><span class="p">,</span><span class="w"> </span><span class="mf">1</span><span class="p">,</span><span class="w"> </span><span class="mf">2</span><span class="p">],</span><span class="w"> </span><span class="nx">size</span><span class="o">:</span><span class="w"> </span><span class="mf">350</span><span class="p">}</span>
<span class="p">]</span>
</code></pre></div>
<p>And I will get the following new Venn diagram in SVG:</p>
<figure>
<img src="https://blog.liang2.tw/posts/2019/04/generate-venn-svg/pics/threeway_venn.svg">
</figure>
<p>I hope now I will spend less time figuring out how to draw a Venn diagram.</p>Store GDC genome as a Seqinfo object2019-02-26T00:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2019-02-26:/posts/2019/02/gdc-seqinfo/<p>Genomic Data Commons (GDC) hosted by NCI is the place to harmonize past and future genomic data, such as TCGA, TARGET, and CPTAC projects. GDC has its own genome reference, <a href="https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files"><code>GRCh38.d1.vd1</code></a>, which has 2,779 “chromosomes” including decoys and virus sequences. That said, the canonical chromosomes of GRCh38 …</p><p>Genomic Data Commons (GDC) hosted by NCI is the place to harmonize past and future genomic data, such as TCGA, TARGET, and CPTAC projects. GDC has its own genome reference, <a href="https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files"><code>GRCh38.d1.vd1</code></a>, which has 2,779 “chromosomes” including decoys and virus sequences. That said, the canonical chromosomes of GRCh38.d1.vd1 (e.g., chr1 to chr22, chrM, chrX, and chrY) are identical to that of hg38 and GRCh38. So all these three genome references can be used interchangeably.</p>
<p>Anyway, I was trying to correctly store the full GRCh38.d1.vd1 genome information in the <code>GRanges</code> and <code>GRangesList</code> R objects, which can be done by creating a <code>Seqinfo</code> object representing all its chromosomes. It was also fun to get familiar with the genomic data structures in R. </p>
<h3 id="build-gdcs-seqinfo">Build GDC’s Seqinfo</h3>
<p>First, we need the length and the name of all chromosomes in GRCh38.d1.vd1. I used samtools to extract the information as a <code>.dict</code> file from the genome reference FASTA file.</p>
<div class="highlight"><pre><span></span><code><span class="nb">export</span><span class="w"> </span><span class="nv">GDC_REF_FA_URL</span><span class="o">=</span><span class="s1">'https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834'</span>
curl<span class="w"> </span>-Lo<span class="w"> </span>GRCh38.d1.vd1.fa.tar.gz<span class="w"> </span><span class="nv">$GDC_REF_FA_URL</span>
samtools<span class="w"> </span>dict<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-a<span class="w"> </span><span class="s1">'GRCh38.d1.vd1'</span><span class="w"> </span>-s<span class="w"> </span><span class="s1">'Homo sapiens'</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-u<span class="w"> </span><span class="nv">$GDC_REF_FA_URL</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>GRCh38.d1.vd1.fa.tar.gz<span class="w"> </span>><span class="w"> </span>GRCh38.d1.vd1.dict
head<span class="w"> </span>-n<span class="w"> </span><span class="m">3</span><span class="w"> </span>GRCh38.d1.vd1.dict
<span class="c1"># @HD VN:1.0 SO:unsorted</span>
<span class="c1"># @SQ SN:chr1 LN:248956422 M5:6aef897c3d6ff0c78aff06ac189178dd UR:https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834 AS:GRCh38.d1.vd1 SP:Homo sapiens</span>
<span class="c1"># @SQ SN:chr2 LN:242193529 M5:f98db672eb0993dcfdabafe2a882905c UR:https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834 AS:GRCh38.d1.vd1 SP:Homo sapiens</span>
</code></pre></div>
<p>Seqinfo also requires the information of whether a chromosome is circular. In GDC’s case, mitochondria chromosome and all viruses sequences are circular. Combining all the information together, we can construct the Seqinfo object describing the genome GRCh38.d1.vd1.</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">GenomeInfoDb</span><span class="p">)</span>
<span class="n">gdc_simple_tbl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">read_tsv</span><span class="p">(</span>
<span class="w"> </span><span class="s">'./GRCh38.d1.vd1.dict'</span><span class="p">,</span><span class="w"> </span>
<span class="w"> </span><span class="n">skip</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="c1"># Skip the first line (@HQ ...)</span>
<span class="w"> </span><span class="n">col_names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s">'SQ'</span><span class="p">,</span><span class="w"> </span><span class="s">'chrom'</span><span class="p">,</span><span class="w"> </span><span class="s">'length'</span><span class="p">,</span><span class="w"> </span><span class="s">'md5sum'</span><span class="p">,</span><span class="w"> </span><span class="s">'URI'</span><span class="p">,</span><span class="w"> </span><span class="s">'assembly'</span><span class="p">,</span><span class="w"> </span><span class="s">'species'</span><span class="p">)</span>
<span class="p">)</span><span class="w"> </span><span class="o">%>%</span>
<span class="w"> </span><span class="nf">select</span><span class="p">(</span><span class="n">chrom</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span>
<span class="w"> </span><span class="nf">mutate</span><span class="p">(</span><span class="n">chrom</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">str_sub</span><span class="p">(</span><span class="n">chrom</span><span class="p">,</span><span class="w"> </span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">),</span><span class="w"> </span>
<span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.integer</span><span class="p">(</span><span class="nf">str_sub</span><span class="p">(</span><span class="n">length</span><span class="p">,</span><span class="w"> </span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">)),</span>
<span class="w"> </span><span class="n">circular</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">case_when</span><span class="p">(</span>
<span class="w"> </span><span class="n">chrom</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s">'chrM'</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span>
<span class="w"> </span><span class="n">chrom</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s">'chrEBV'</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span>
<span class="w"> </span><span class="nf">startsWith</span><span class="p">(</span><span class="n">chrom</span><span class="p">,</span><span class="w"> </span><span class="s">'HPV'</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span>
<span class="w"> </span><span class="kc">TRUE</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="kc">FALSE</span>
<span class="w"> </span><span class="p">))</span>
<span class="n">gdc_seqinfo</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">Seqinfo</span><span class="p">(</span>
<span class="w"> </span><span class="n">seqnames</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gdc_simple_tbl</span><span class="o">$</span><span class="n">chrom</span><span class="p">,</span>
<span class="w"> </span><span class="n">seqlengths</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gdc_simple_tbl</span><span class="o">$</span><span class="n">length</span><span class="p">,</span>
<span class="w"> </span><span class="n">isCircular</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gdc_simple_tbl</span><span class="o">$</span><span class="n">circular</span><span class="p">,</span>
<span class="w"> </span><span class="n">genome</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">'GRCh38.d1.vd1'</span>
<span class="p">)</span>
</code></pre></div>
<p>Now we can supply it to any <code>GRanges</code> object coming out from any GDC’s sequencing data.</p>
<div class="highlight"><pre><span></span><code><span class="gp">> </span><span class="n">gdc_seqinfo</span>
<span class="go">Seqinfo object with 2779 sequences (191 circular) from GRCh38.d1.vd1 genome:</span>
<span class="go"> seqnames seqlengths isCircular genome</span>
<span class="go"> chr1 248956422 FALSE GRCh38.d1.vd1</span>
<span class="go"> chr2 242193529 FALSE GRCh38.d1.vd1</span>
<span class="go"> chr3 198295559 FALSE GRCh38.d1.vd1</span>
<span class="go"> chr4 190214555 FALSE GRCh38.d1.vd1</span>
<span class="go"> chr5 181538259 FALSE GRCh38.d1.vd1</span>
<span class="go"> ... ... ... ...</span>
<span class="go"> HPV-mKN2 7299 TRUE GRCh38.d1.vd1</span>
<span class="go"> HPV-mKN3 7251 TRUE GRCh38.d1.vd1</span>
<span class="go"> HPV-mL55 7177 TRUE GRCh38.d1.vd1</span>
<span class="go"> HPV-mRTRX7 7731 TRUE GRCh38.d1.vd1</span>
<span class="go"> HPV-mSD2 7300 TRUE GRCh38.d1.vd1</span>
</code></pre></div>
<p>I store the <code>gdc_seqinfo</code> as a RDS file (<a href="https://blog.liang2.tw/posts/2019/02/gdc-seqinfo/results/seqinfo_GRCh38.d1.vd1.rds">link</a> here) so I can re-use it easily.</p>
<div class="highlight"><pre><span></span><code><span class="n">gdc_seqinfo</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">readRDS</span><span class="p">(</span><span class="s">'seqinfo_GRCh38.d1.vd1.rds'</span><span class="p">)</span>
</code></pre></div>Build EnsDb from a local Ensembl MySQL database2019-01-08T00:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2019-01-08:/posts/2019/01/build-ensdb-from-local-mysql/<p>In some occasions, I need to access the older version of Ensembl human transcripts. For example, the mutation calls generated by the <a href="https://gdc.cancer.gov/">NCI’s Genomic Data Common</a> pipeline are annotated by Ensembl v84. To programmatically query the Ensembl annotations, I use the EnsDb SQLite database created by <a href="https://bioconductor.org/packages/release/bioc/html/ensembldb.html">ensembldb</a>, which is …</p><p>In some occasions, I need to access the older version of Ensembl human transcripts. For example, the mutation calls generated by the <a href="https://gdc.cancer.gov/">NCI’s Genomic Data Common</a> pipeline are annotated by Ensembl v84. To programmatically query the Ensembl annotations, I use the EnsDb SQLite database created by <a href="https://bioconductor.org/packages/release/bioc/html/ensembldb.html">ensembldb</a>, which is a R package I enjoy using (see <a href="https://blog.liang2.tw/posts/2017/11/use-ensdb-database-in-python/">my previous post</a> for its usage).</p>
<p>The EnsDbs of the recent versions of Ensembl (v87+) are available on AnnotationHub. However, the older versions are not available, and they don’t get updated when ensembldb introduces a new feature. For example, now newer EnsDbs include the transcript and gene ID version (<a href="https://github.com/jotsetung/ensembldb/issues/89">github issue</a>).</p>
<p>In my case, I need to build a EnsDb of Ensembl v84. <a href="https://bioconductor.org/packages/release/bioc/vignettes/ensembldb/inst/doc/ensembldb.html#10_getting_or_building_ensdb_databasespackages">The ensembldb’s documentation</a> describes how to build one from the public Ensembl MySQL server. However, this method will take more than a day to complete. I started to look for other methods. After some trial and error, I managed to create my EnsDb fast by connecting to a local Ensembl database that I built. Surprisingly the setup wasn’t difficult at all, and it only took about an hour to build the EnsDb.</p>
<p>Here are my notes of how to create the EnsDB from a local Ensembl MySQL database. I use macOS but the steps can be easily modified to work on other OSes.</p>
<div class="toc">
<ul>
<li><a href="#ensembl-vm">Ensembl VM</a></li>
<li><a href="#build-a-local-ensembl-v84-mysql-database">Build a local Ensembl v84 MySQL database</a></li>
<li><a href="#build-ensdb-from-the-local-mysql-database">Build EnsDB from the local MySQL database</a></li>
<li><a href="#remove-mysql">Remove MySQL</a></li>
</ul>
</div>
<h3 id="ensembl-vm">Ensembl VM</h3>
<p>To create a EnsDB from a Ensembl MySQL database, we need to the Ensembl Perl APIs. And the easiest setup is by a <a href="http://www.ensembl.org/info/data/virtual_machine.html">Ensembl virtual machine</a>. We just need to import the VM image using VirtualBox and install the ensembldb R package inside the VM, then it is ready to build the EnsDb. I recommend the VM to have more memory than the default 1GB since a larger memory helps build the R packages and EnsDb.</p>
<h3 id="build-a-local-ensembl-v84-mysql-database">Build a local Ensembl v84 MySQL database</h3>
<p>Ensembl provides <a href="https://www.ensembl.org/info/docs/webcode/mirror/install/ensembl-data.html">the MySQL database dump</a> to allow easy import of their data of any version. Assuming the working directory is <code>~/Documents/Ensembl_MySQL_mirror/</code>, we first copy the database dump by:</p>
<div class="highlight"><pre><span></span><code><span class="nb">cd</span><span class="w"> </span>~/Documents/Ensembl_MySQL_mirror
<span class="c1"># Download the db dump</span>
rsync<span class="w"> </span>-a<span class="w"> </span>rsync://ftp.ensembl.org/ensembl/pub/release-84/mysql/homo_sapiens_core_84_38<span class="w"> </span>.
<span class="c1"># MySQL doesn't accept compressed db dump files so we decompress them</span>
gunzip<span class="w"> </span>*.txt.gz
</code></pre></div>
<p>While downloading the data, we also need to install the MySQL server. I install <a href="http://www.ensembl.org/info/data/mysql.html">the same or similar version of MySQL</a> Ensembl is currently using, which is 5.6 at the time of writing. On macOS, Homebrew can specify the version of MySQL to be installed:</p>
<div class="highlight"><pre><span></span><code>brew<span class="w"> </span>install<span class="w"> </span>mysql@5.6
<span class="c1"># And launch the MySQL server</span>
/usr/local/opt/mysql@5.6/bin/mysql.server<span class="w"> </span>start
</code></pre></div>
<p>First we create a database whose name matches the Ensembl version (v84):</p>
<div class="highlight"><pre><span></span><code><span class="k">CREATE</span><span class="w"> </span><span class="k">DATABASE</span><span class="w"> </span><span class="n">homo_sapiens_core_84_38</span><span class="p">;</span>
</code></pre></div>
<p>Then we load the table schema and Ensembl data:</p>
<div class="highlight"><pre><span></span><code>/usr/local/opt/mysql@5.6/bin/mysql<span class="w"> </span>-u<span class="w"> </span>root<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>homo_sapiens_core_84_38<span class="w"> </span><<span class="w"> </span>homo_sapiens_core_84_38.sql
/usr/local/opt/mysql@5.6/bin/mysqlimport<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-u<span class="w"> </span>root<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--fields-terminated-by<span class="o">=</span><span class="s1">'\t'</span><span class="w"> </span>--fields_escaped_by<span class="o">=</span><span class="se">\\</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>homo_sapiens_core_84_38<span class="w"> </span>-L<span class="w"> </span>*.txt
</code></pre></div>
<p>Finally, we modify the MySQL config at <code>/usr/local/etc/my.cnf</code> to accept remote database connection, so our VM can access the database on the host machine. I don’t use MySQL for anything else, so I simply let MySQL binds to all the possible IP addresses my machine has:</p>
<div class="highlight"><pre><span></span><code>[mysqld]
bind-address = *
</code></pre></div>
<p>Note that this is not a secure configuration. To be secure, there should be a designated MySQL user with limited permission and a stricter connection setting. Restart MySQL to load the new config:</p>
<div class="highlight"><pre><span></span><code>/usr/local/opt/mysql@5.6/bin/mysql.server restart
</code></pre></div>
<p>Write down an (local) IP address of our host machine.</p>
<h3 id="build-ensdb-from-the-local-mysql-database">Build EnsDB from the local MySQL database</h3>
<p>Now we can come back to the vm and build the EnsDb v84. Run the following R script to build the EnsDb:</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">ensembldb</span><span class="p">)</span>
<span class="nf">fetchTablesFromEnsembl</span><span class="p">(</span>
<span class="w"> </span><span class="m">84</span><span class="p">,</span><span class="w"> </span><span class="n">species</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"human"</span><span class="p">,</span>
<span class="w"> </span><span class="n">user</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">'root'</span><span class="p">,</span><span class="w"> </span><span class="n">host</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">'<our host IP>'</span><span class="p">,</span><span class="w"> </span><span class="n">port</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3306</span>
<span class="p">)</span>
<span class="n">DBFile</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">makeEnsemblSQLiteFromTables</span><span class="p">()</span>
</code></pre></div>
<p>The EnsDb SQLite database will be availabe under the working directory. We can test the new EnsDb by:</p>
<div class="highlight"><pre><span></span><code><span class="n">edb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">EnsDb</span><span class="p">(</span><span class="n">DBFile</span><span class="p">)</span>
</code></pre></div>
<h3 id="remove-mysql">Remove MySQL</h3>
<p>If there is no other need of MySQL, we can uninstall it and remove all its databases by:</p>
<div class="highlight"><pre><span></span><code>brew remove mysql@5.6
rm -rf /usr/local/var/mysql
</code></pre></div>Make Firefox fullscreen borderless on macOS2018-12-13T00:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2018-12-13:/posts/2018/12/firefox-borderless-fullscreen-macos/<p>EDIT 2021-06-01: In Firefox 89+, there’s a default option “Hide Toolbar” in the fullscreen mode that automatically hides the toolbar. So the customization is no longer needed.</p>
<p>Firefox fullscreen on macOS by default contains the address bar and the tab bar. I usually don’t really need the full …</p><p>EDIT 2021-06-01: In Firefox 89+, there’s a default option “Hide Toolbar” in the fullscreen mode that automatically hides the toolbar. So the customization is no longer needed.</p>
<p>Firefox fullscreen on macOS by default contains the address bar and the tab bar. I usually don’t really need the full vertical space for web page, so those bars aren’t a problem. But when I access a RStudio Server on Firefox, I always want to have more vertical space. As shown in the screenshot below, the address bar and the tab bar of Firefox are unnecessary, and they may be quite distracting. If those bars are hidden and only show up upon request when Firefox enters fullscreen, the vertical space can be saved and the interface will remain clean.</p>
<figure>
<img src="https://blog.liang2.tw/posts/2018/12/firefox-borderless-fullscreen-macos/pics/rstudio_fullscreen.png">
<p class="caption"></p>
</figure>
<p>It turns out that Firefox controls its user interface styling using CSS. So we can set the shape of the window tabs, the height of the address bar, and more by adding a CSS file at <code>~/Library/Application Support/Firefox/Profiles/<profile>/chrome/userChrome.css</code>. In Firefox 69+, we need to set <code>toolkit.legacyUserProfileCustomizations.stylesheets=true</code> in <code>about:config</code> to enable the CSS styling (<a href="https://www.userchrome.org/how-create-userchrome-css.html">more details here</a>).</p>
<p>My modification was based on <a href="https://apple.stackexchange.com/a/313241">this answer on Stack Exchange</a>:</p>
<div class="highlight"><pre><span></span><code><span class="p">#</span><span class="nn">navigator-toolbox</span><span class="o">[</span><span class="nt">inFullscreen</span><span class="o">]</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">height</span><span class="p">:</span><span class="w"> </span><span class="mf">0.5</span><span class="kt">rem</span><span class="p">;</span>
<span class="w"> </span><span class="k">margin-bottom</span><span class="p">:</span><span class="w"> </span><span class="mf">-0.5</span><span class="kt">rem</span><span class="p">;</span>
<span class="w"> </span><span class="k">opacity</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span>
<span class="w"> </span><span class="k">overflow</span><span class="p">:</span><span class="w"> </span><span class="kc">hidden</span><span class="p">;</span>
<span class="w"> </span><span class="k">z-index</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">#</span><span class="nn">navigator-toolbox</span><span class="o">[</span><span class="nt">inFullscreen</span><span class="o">]</span><span class="p">:</span><span class="nd">hover</span><span class="o">,</span>
<span class="p">#</span><span class="nn">navigator-toolbox</span><span class="o">[</span><span class="nt">inFullscreen</span><span class="o">]</span><span class="p">:</span><span class="nd">focus-within</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="c">/*</span>
<span class="c"> * Add some padding between the navbar and the top screen edge</span>
<span class="c"> * to be more visible while the macOS hidden menu bar shows up.</span>
<span class="c"> * The macOS menubar will hide after a few seconds.</span>
<span class="c"> */</span>
<span class="w"> </span><span class="k">padding-top</span><span class="p">:</span><span class="w"> </span><span class="mf">1.5</span><span class="kt">rem</span><span class="p">;</span>
<span class="w"> </span><span class="k">height</span><span class="p">:</span><span class="w"> </span><span class="kc">auto</span><span class="p">;</span>
<span class="w"> </span><span class="k">margin-bottom</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="kt">rem</span><span class="p">;</span>
<span class="w"> </span><span class="k">opacity</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span>
<span class="w"> </span><span class="k">overflow</span><span class="p">:</span><span class="w"> </span><span class="kc">visible</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div>
<p>Note that the Firefox needs to be restarted to get the styling in effect. Here is how the Firefox fullscreen looks like after applying the <code>userChrome.css</code> above.</p>
<figure>
<img src="https://blog.liang2.tw/posts/2018/12/firefox-borderless-fullscreen-macos/pics/rstudio_fullscreen.modified.png">
</figure>
<p>Now the RStudio Server web page feels like a native app, similar to what RStudio Desktop offers. Both address and tab bars are hidden by default, and when the mouse hovers to the top, they get visible again.</p>
<figure>
<video auto autoplay loop>
<source src="https://blog.liang2.tw/posts/2018/12/firefox-borderless-fullscreen-macos/pics/fullscreen_switch_tabs.webm" type="video/webm">
<source src="https://blog.liang2.tw/posts/2018/12/firefox-borderless-fullscreen-macos/pics/fullscreen_switch_tabs.mp4" type="video/mp4">
Your browser doesn't support HTML5 video. You can still download the <a href="https://blog.liang2.tw/posts/2018/12/firefox-borderless-fullscreen-macos/pics/fullscreen_switch_tabs.mp4">screencast</a> and view it locally.
</video>
<figcaption>Switch tabs in the borderless fullscreen of Firefox.</figcaption>
</figure>
<p>Those top bars will show up as well when they are in focus by shortkeys. For example, ⌘ + L will get focus on the address bar. It is useful when I want to launch a quick search in a new tab.</p>
<figure>
<video controls>
<source src="https://blog.liang2.tw/posts/2018/12/firefox-borderless-fullscreen-macos/pics/fullscreen_focus.webm" type="video/webm">
<source src="https://blog.liang2.tw/posts/2018/12/firefox-borderless-fullscreen-macos/pics/fullscreen_focus.mp4" type="video/mp4">
Your browser doesn't support HTML5 video. You can still download the <a href="https://blog.liang2.tw/posts/2018/12/firefox-borderless-fullscreen-macos/pics/fullscreen_focus.mp4">screencast</a> and view it locally.
</video>
<figcaption>Address bar is shown automatically when it is focused using shortkey.</figcaption>
</figure>
<p>In <code>userChrome.css</code>, I added a small padding between the bars and the top border to make them more accessible by mouse. When the mouse moves to the top, macOS’s menu bar will pop up as well, and both Firefox and macOS will overlap. The macOS one will go away first, but the mouse has to stay on the Firefox bars so they don’t disappear either.</p>
<figure>
<video controls>
<source src="https://blog.liang2.tw/posts/2018/12/firefox-borderless-fullscreen-macos/pics/fullscreen_hover_for_menubar.webm" type="video/webm">
<source src="https://blog.liang2.tw/posts/2018/12/firefox-borderless-fullscreen-macos/pics/fullscreen_hover_for_menubar.mp4" type="video/mp4">
Your browser doesn't support HTML5 video. You can still download the <a href="https://blog.liang2.tw/posts/2018/12/firefox-borderless-fullscreen-macos/pics/fullscreen_hover_for_menubar.mp4">screencast</a> and view it locally.
</video>
</figure>
<p>The overlapping of Firefox bars and macOS menubar is still a bit annoying, which will require some practice to navigate between them by mouse. I will probably rely more on the shortkeys instead. Anyway, I now have more vertical space and the modification of <code>userChrome.css</code> works fine for now.</p>
<p>For more information about modifying the Firefox user interface, there is <a href="https://www.userchrome.org/">a website</a> that introduces <code>userChrome.css</code> in depth.</p>
<h3 id="notes-for-screencast-encoding">Notes for screencast encoding</h3>
<p>I modified the command from my <a href="https://blog.liang2.tw/posts/2016/03/notebook-progress-bar/">previous post</a> to shrink the file size of the original QuickTime screencasts using FFmpeg. More encoding parameters can be found at FFmpeg’s wiki (<a href="https://trac.ffmpeg.org/wiki/Encode/VP9">VP9</a> and <a href="https://trac.ffmpeg.org/wiki/Encode/H.264">H.264</a>).</p>
<div class="highlight"><pre><span></span><code><span class="c1"># VP9 (WEBM)</span>
ffmpeg<span class="w"> </span>-i<span class="w"> </span>fullscreen_switch_tabs.mov<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-vcodec<span class="w"> </span>libvpx-vp9<span class="w"> </span>-b:v<span class="w"> </span>200K<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-pass<span class="w"> </span><span class="m">1</span><span class="w"> </span>-an<span class="w"> </span>-r<span class="w"> </span><span class="m">24</span><span class="w"> </span>-f<span class="w"> </span>webm<span class="w"> </span>/dev/null
ffmpeg<span class="w"> </span>-i<span class="w"> </span>fullscreen_switch_tabs.mov<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-vcodec<span class="w"> </span>libvpx-vp9<span class="w"> </span>-b:v<span class="w"> </span>200K<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-pass<span class="w"> </span><span class="m">2</span><span class="w"> </span>-an<span class="w"> </span>-r<span class="w"> </span><span class="m">24</span><span class="w"> </span>fullscreen_switch_tabs.webm
<span class="c1"># H.264 (MP4)</span>
ffmpeg<span class="w"> </span>-i<span class="w"> </span>fullscreen_switch_tabs.mov<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-vcodec<span class="w"> </span>h264<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-strict<span class="w"> </span>-2<span class="w"> </span>-crf<span class="w"> </span><span class="m">40</span><span class="w"> </span>-preset<span class="w"> </span>slow<span class="w"> </span>-r<span class="w"> </span><span class="m">24</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>fullscreen_switch_tabs.mp4
</code></pre></div>
<p>EDIT 2020-06-13: Added extra <code>about:config</code> settings in Firefox 69+ and fixed the styling. I also increased the top margin since the new address bar is taller.</p>GPG Key Transition2018-10-20T00:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2018-10-20:/posts/2018/10/gpg-key-transition-2018/<p>I am transiting my GPG key again. However, for this time, I expect to use the new GPG master key longer and will start building this identity unless there is a concern about the key strength or I accidentally lose the key. </p>
<p>Back in <a href="https://blog.liang2.tw/posts/2016/12/gpg-key-transition-2016/">my GPG key transition in 2016 …</a></p><p>I am transiting my GPG key again. However, for this time, I expect to use the new GPG master key longer and will start building this identity unless there is a concern about the key strength or I accidentally lose the key. </p>
<p>Back in <a href="https://blog.liang2.tw/posts/2016/12/gpg-key-transition-2016/">my GPG key transition in 2016</a>, I’ve created the subkeys for daily usage and isolated the master key into a secret offline place. I learned more about PGP throughout the years, sadly though, I still seldom have a chance to use it extensively in my daily life. </p>
<p>This time, I am moving the subkeys to a YubiKey. I found <a href="https://github.com/drduh/YubiKey-Guide">drduh’s guide</a> on GitHub very informative to set up both the GPG key and the yubikey, as well as get my hands on the various possible applications. Another notable change is that I no longer set an expiration date on my master key.</p>
<p>I will revoke my old keys once the transition is done.</p>
<div class="highlight"><pre><span></span><code>-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
I am transitioning GPG keys from an old 4096-bit RSA key to a new
4096-bit RSA key. The old key will continue to be valid for some
time, but I prefer all new correspondance to be encrypted in the new
key, and will be making all signatures going forward with the new key.
This transition document is signed with both keys to validate the
transition.
The old key, which I am transitioning away from, is:
pub rsa4096/0x44D10E44730992C4 2016-12-04 [SC] [expires: 2018-12-04]
Key fingerprint = 85DF A3EB 72CD DE7D 3F2A 127C 44D1 0E44 7309 92C4
The new key, to which I am transitioning to, is:
pub rsa4096/0x69BAE333BC4DC4BA 2018-10-17 [SC]
Key fingerprint = 978B 49B8 EFB7 02F3 3B3F F2E5 69BA E333 BC4D C4BA
To fetch the full new key from a public key server using GnuPG, run:
gpg --recv-key 0x69BAE333BC4DC4BA
If you have already validated my old key, you can then validate that
the new key is signed by my old key:
gpg --check-sigs 0x69BAE333BC4DC4BA
Please contact me via e-mail at <me@liang2.tw> if you have any
questions about this document or this transition.
Liang-Bo Wang
(liang2, ccwang002)
me@liang2.tw
Oct 20, 2018
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCgAdFiEExQ3ldlbp0O+Zc1vXBS+kcT+kX08FAlvLdGkACgkQBS+kcT+k
X092fw/8DFGIIMHtf3JAt8nGth8Y94oyTgrorqzu7TXCxvUiGk6Qd/WBLiMvh//9
7mVounqtPSGuYHiWm6gtDlT+YRoFoJH3IZ9aboMzd8e/p4TspKhXAsF5bhp6U9ED
HCds+VRoVBGPtO3Ogizl5wpynYp7OzwqgwTteFlJ167mmqL05n/xLHnsvii3UFO4
MWwWMxVmwvEpJINsYOJ+mFxOXeD23ckKt3GQh2NF2Dpa+apEvq5l9vjmk4Vnqau2
WKXygbz1Rm0b629dblV8vU9iwgIsSlXz8oopkETadpkaGk9s/p8AYPupmqfCiwHx
/gYlAgIV0DmePfOjYGZ0RJTVHFnG5ong3kUyoeuAwCMWLC/QQ6pCIL+pRnXOT1/q
oXCTK7HUUZjWpRyHf/ptHpQOeYH+AuYAYJNwQA3xjzv57kkGT/gR3Q4APmA6l5lN
swHVL63QYjcKCNm+fR7pr2ka50WSrerFyLWinAT9G9w+h/31Zcl0cBqEM45nlaye
B5fYaU/G6ehFxWgVEjS7z7FSLQVNJu7aAXVZy9QlKcduNXBUZsDkcDI1iyKVhvl4
41SNhLXMO2Jk69kHtuE803/aVgw8N3Y/E5zjlBNXWSpPMPFeKZl9W6eqnl50kQcd
fKbZGV8bxFl40iBi4mfIrdOXrBQ9Oohp7UVWTHG1qYMruVRmW3aJAjMEAQEKAB0W
IQS66Rw2AWA1VmCbiTP4R473V2p7VwUCW8t0bgAKCRD4R473V2p7V8DqD/9YpnFP
usbOW9p3GwUgbvqPdefLszFZZb5LNsgL9eTSKwMUTn5AGldFquDM3hbjiZ+e/nbD
TwFKbKRt//48R5UTYsYJVxcNVW3CXxMtl+8B5PJORfUbz0/HSEsnKTMlHP1M4ybw
gZCI45sP4wT1prU3ngkZGHJJY9ojNOCrzHA+DVEp+vROn/zyg6AbLcr0+/yjCHO0
pxDbGUgEOV93GFKXl85u7qCXUTrIt2fkeFeEQoh248oBJQHPjD9WyOV/O3QNdT7d
6g7lJcSSpwevtTsWFaWCxRM2IwHlXJiWU/9bA2Jrb8E07mJGlmil7xCe4rPFD69F
/y9MBfMG0KVjAFgs6vAR5zHnN865d18JCunQ4OhY0tDnhi4q3O0OdehGqyX92pwL
hMKtHRHLuhYoDud2kdxizmtov1bHghO0kKSTlDGhZvs9Fpod4MKQHaZx4VbpA8np
OpmPBkX7+34AmnLnJP2GOA4UhsRpX1iyqTaGePjhtA0gqz5285bL02JGx/m7HDtX
MYI4yoc3rdZz37axnRinWmW7Lu5JmQGeVLJZ7Z2b83BEHVnapXPW2kFp8PcqbNnA
Lbb9XLkbHNOaiC4EFm07uFmMVkV6aKW3xV1YIlDeovRfNC0cyZDnUNFqaDGtZAeO
UifqHyDqNjBJX0a8miUMZDOXsZFD2jzm3pjS/Q==
=JRPp
-----END PGP SIGNATURE-----
</code></pre></div>Access gene annotation using gffutils2018-06-22T00:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2018-06-22:/posts/2018/06/gene-annotation-using-gffutils/<p>Recently, I had to access gene annotations in multiple versions from multiple sources such as Ensembl, GENCODE, and UCSC. I used to rely on the R/Bioconductor ecosystem to query the coordinates of a gene annotation. There are existing Bioconductor packages ready for Ensembl and UCSC annotations (more info in …</p><p>Recently, I had to access gene annotations in multiple versions from multiple sources such as Ensembl, GENCODE, and UCSC. I used to rely on the R/Bioconductor ecosystem to query the coordinates of a gene annotation. There are existing Bioconductor packages ready for Ensembl and UCSC annotations (more info in my previous posts: <a href="https://blog.liang2.tw/posts/2016/05/biocondutor-ensembl-reference/">Ensembl</a> and <a href="https://blog.liang2.tw/2016Talk-Genomics-in-R/">UCSC</a>), and one can create a new customized TxDb given a GTF/GFF file. However, the project I was working on was written in Python, so I went on searching for similar alternatives in Python.</p>
<p>That’s how I found <a href="https://daler.github.io/gffutils/">gffutils</a>, a Python package to access gene annotations from GTF/GFF files. <code>gffutils</code> first imports the annotations from the GTF/GFF file into a SQLite database. The package also provides some abstraction on top of the database schema, so user can retrieve an annotation without talking to the database directly using repetitive SQL commands. Database enables fast random access to any gene annotation. </p>
<p>I will use GENCODE v19, an annotation used by many TCGA GRCh37/hg19 projects, as an example to demo the usage of gffutils. My project requires the coordinates of UTRs and exons of all the transcripts in use.</p>
<div class="toc">
<ul>
<li><a href="#usage-example">Usage example</a><ul>
<li><a href="#single-feature-access">Single feature access</a></li>
<li><a href="#gene-model-coordinates-of-a-transcript">Gene model coordinates of a transcript</a></li>
<li><a href="#feature-selection">Feature selection</a></li>
</ul>
</li>
<li><a href="#direct-operation-on-the-database">Direct operation on the database</a></li>
<li><a href="#discussions">Discussions</a></li>
</ul>
</div>
<h3 id="usage-example">Usage example</h3>
<p>To use gffutils to query GENCODE annotation, we need to create the database first. The comprehensive gene annotation GTF can be downloaded from <a href="https://www.gencodegenes.org/releases/19.html">the GENCODE website</a> (<a href="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz">URL to the GTF</a>). The database creation is handled by gffutils’s <code>create_db</code> function. It will take a few minutes to run and the database will be at <code>gencode_v19.db</code>.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">gffutils</span>
<span class="n">db</span> <span class="o">=</span> <span class="n">gffutils</span><span class="o">.</span><span class="n">create_db</span><span class="p">(</span>
<span class="s1">'./gencode.v19.annotation.gtf.gz'</span><span class="p">,</span>
<span class="n">dbfn</span><span class="o">=</span><span class="s1">'gencode_v19.db'</span><span class="p">,</span>
<span class="n">verbose</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="n">merge_strategy</span><span class="o">=</span><span class="s1">'error'</span><span class="p">,</span>
<span class="n">disable_infer_transcripts</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="n">disable_infer_genes</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="p">)</span>
<span class="c1"># INFO - Committing changes: 2619000 features</span>
<span class="c1"># INFO - Populating features table and first-order relations: 2619443 features</span>
<span class="c1"># INFO - Creating relations(parent) index</span>
<span class="c1"># INFO - Creating relations(child) index</span>
<span class="c1"># INFO - Creating features(featuretype) index</span>
<span class="c1"># INFO - Creating features (seqid, start, end) index</span>
<span class="c1"># INFO - Creating features (seqid, start, end, strand) index</span>
<span class="c1"># INFO - Running ANALYSE features</span>
</code></pre></div>
<p>Once the database is created, we don’t have to repeat the same process but load the database directly as a FeatureDB object:</p>
<div class="highlight"><pre><span></span><code><span class="n">db</span> <span class="o">=</span> <span class="n">gffutils</span><span class="o">.</span><span class="n">FeatureDB</span><span class="p">(</span><span class="s1">'./gencode_v19.db'</span><span class="p">)</span>
</code></pre></div>
<h4 id="single-feature-access">Single feature access</h4>
<p>One can then access the annotations of a gene or transcript by its ID. Using a transcript of TP53 as an example,</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">gene</span> <span class="o">=</span> <span class="n">db</span><span class="p">[</span><span class="s1">'ENSG00000141510.11'</span><span class="p">];</span> <span class="n">gene</span>
<span class="go"><Feature gene (chr17:7565097-7590856[-]) at 0x7fac828deeb8></span>
<span class="gp">>>> </span><span class="n">tx</span> <span class="o">=</span> <span class="n">db</span><span class="p">[</span><span class="s1">'ENST00000269305.4'</span><span class="p">];</span> <span class="n">tx</span>
<span class="go"><Feature transcript (chr17:7571720-7590856[-]) at 0x7fac828f8080></span>
</code></pre></div>
<p>We can then access the details of the transcript:</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">tx</span><span class="o">.</span><span class="n">featuretype</span><span class="p">,</span> <span class="n">tx</span><span class="o">.</span><span class="n">source</span>
<span class="go">('transcript', 'HAVANA')</span>
<span class="gp">>>> </span><span class="n">tx</span><span class="o">.</span><span class="n">chrom</span><span class="p">,</span> <span class="n">tx</span><span class="o">.</span><span class="n">start</span><span class="p">,</span> <span class="n">tx</span><span class="o">.</span><span class="n">end</span><span class="p">,</span> <span class="n">tx</span><span class="o">.</span><span class="n">strand</span>
<span class="go">('chr17', 7571720, 7590856, '-')</span>
<span class="gp">>>> </span><span class="n">tx</span><span class="o">.</span><span class="n">attributes</span><span class="o">.</span><span class="n">items</span><span class="p">()</span>
<span class="go">[('gene_id', ['ENSG00000141510.11']),</span>
<span class="go"> ('transcript_id', ['ENST00000269305.4']),</span>
<span class="go"> ('gene_type', ['protein_coding']),</span>
<span class="go"> ('gene_status', ['KNOWN']),</span>
<span class="go"> ('gene_name', ['TP53']),</span>
<span class="go"> ('transcript_type', ['protein_coding']),</span>
<span class="go"> ('transcript_status', ['KNOWN']),</span>
<span class="go"> ('transcript_name', ['TP53-001']),</span>
<span class="go"> ('level', ['2']),</span>
<span class="go"> ('protein_id', ['ENSP00000269305.4']),</span>
<span class="go"> ('tag', ['basic', 'appris_principal', 'CCDS']),</span>
<span class="go"> ('ccdsid', ['CCDS11118.1']),</span>
<span class="go"> ('havana_gene', ['OTTHUMG00000162125.4']),</span>
<span class="go"> ('havana_transcript', ['OTTHUMT00000367397.1'])]</span>
</code></pre></div>
<h4 id="gene-model-coordinates-of-a-transcript">Gene model coordinates of a transcript</h4>
<p>To find the coordinates of its exons and UTRs, we use <a href="https://daler.github.io/gffutils/autodocs/gffutils.interface.FeatureDB.children.html#gffutils.interface.FeatureDB.children"><code>FeatureDB.children()</code></a> which takes an Feature object or its ID and retrieves all the features belong to this feature. TP53 is on the reverse strand of the chromosome, so we can further sort the features by their end position:</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="nb">list</span><span class="p">(</span><span class="n">db</span><span class="o">.</span><span class="n">children</span><span class="p">(</span><span class="n">tx</span><span class="p">,</span> <span class="n">order_by</span><span class="o">=</span><span class="s1">'-end'</span><span class="p">))</span>
<span class="go">[<Feature transcript (chr17:7571720-7590856[-]) at 0x7fac828922e8>,</span>
<span class="go"> <Feature UTR (chr17:7590695-7590856[-]) at 0x7fac82892208>, </span>
<span class="go"> <Feature exon (chr17:7590695-7590856[-]) at 0x7fac828922b0>,</span>
<span class="go"> <Feature UTR (chr17:7579913-7579940[-]) at 0x7fac828925c0>, </span>
<span class="go"> <Feature exon (chr17:7579839-7579940[-]) at 0x7fac828928d0>,</span>
<span class="go"> <Feature CDS (chr17:7579839-7579912[-]) at 0x7fac82892c18>, </span>
<span class="go"> <Feature start_codon (chr17:7579910-7579912[-]) at 0x7fac82892f28>,</span>
<span class="go"> ...</span>
<span class="go"> <Feature CDS (chr17:7572930-7573008[-]) at 0x7fac828277b8>, </span>
<span class="go"> <Feature exon (chr17:7571720-7573008[-]) at 0x7fac82827b38>,</span>
<span class="go"> <Feature UTR (chr17:7571720-7572929[-]) at 0x7fac82827eb8>, </span>
<span class="go"> <Feature stop_codon (chr17:7572927-7572929[-]) at 0x7fac828fca90>]</span>
</code></pre></div>
<p>We have retrieved the UTRs, CDSs and exons of the transcript. Note that UTR is considered a part of an exon in gene annotation terminology. We should use CDSs as the exons that will be translated to amino acids. <code>FeatureDB.children()</code> provides a way to subset the feature type it returns:</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="nb">list</span><span class="p">(</span><span class="n">db</span><span class="o">.</span><span class="n">children</span><span class="p">(</span><span class="n">tx</span><span class="p">,</span> <span class="n">order_by</span><span class="o">=</span><span class="s1">'-end'</span><span class="p">,</span> <span class="n">featuretype</span><span class="o">=</span><span class="p">[</span><span class="s1">'CDS'</span><span class="p">,</span> <span class="s1">'UTR'</span><span class="p">]))</span>
<span class="go">[<Feature UTR (chr17:7590695-7590856[-]) at 0x7fac8283d7f0>,</span>
<span class="go"> <Feature UTR (chr17:7579913-7579940[-]) at 0x7fac8283d710>,</span>
<span class="go"> <Feature CDS (chr17:7579839-7579912[-]) at 0x7fac8283d7b8>,</span>
<span class="go"> ...</span>
<span class="go"> <Feature CDS (chr17:7572930-7573008[-]) at 0x7fac82846470>,</span>
<span class="go"> <Feature UTR (chr17:7571720-7572929[-]) at 0x7fac828467b8>]</span>
</code></pre></div>
<p>Now the gene model of TP53 becomes clearly visible.</p>
<h4 id="feature-selection">Feature selection</h4>
<p>To select all the transcripts in the database, there is a <code>FeatureDB.all_features()</code> function. Here we want to select only the basic GENOCODE transcripts and count the number of different gene types:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">Counter</span>
<span class="c1"># All the transcripts of basic GENCODE v19</span>
<span class="n">all_basic_txs</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">tx</span> <span class="k">for</span> <span class="n">tx</span> <span class="ow">in</span> <span class="n">db</span><span class="o">.</span><span class="n">all_features</span><span class="p">(</span><span class="n">featuretype</span><span class="o">=</span><span class="s1">'transcript'</span><span class="p">)</span>
<span class="k">if</span> <span class="s1">'tag'</span> <span class="ow">in</span> <span class="n">tx</span><span class="o">.</span><span class="n">attributes</span> <span class="ow">and</span> <span class="s1">'basic'</span> <span class="ow">in</span> <span class="n">tx</span><span class="o">.</span><span class="n">attributes</span><span class="p">[</span><span class="s1">'tag'</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">Counter</span><span class="p">(</span><span class="n">tx</span><span class="o">.</span><span class="n">attributes</span><span class="p">[</span><span class="s1">'gene_type'</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">tx</span> <span class="ow">in</span> <span class="n">all_basic_txs</span><span class="p">)</span><span class="o">.</span><span class="n">most_common</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
<span class="c1"># [('protein_coding', 67186),</span>
<span class="c1"># ('antisense', 9160),</span>
<span class="c1"># ('lincRNA', 7121),</span>
<span class="c1"># ('miRNA', 3055),</span>
<span class="c1"># ('misc_RNA', 2034)]</span>
</code></pre></div>
<h3 id="direct-operation-on-the-database">Direct operation on the database</h3>
<p>Since gffutils is just a abstraction layer on top of the database, we can always talk to the underlying SQLite database directly by writing SQL commands. The database schema is available on <a href="https://daler.github.io/gffutils/database-schema.html">the gffutils’s documentation</a>. Under the hood, FeatureDB object maintains a SQLite connection at <code>FeatureDB.conn</code> and a helper function to run a single SQL command via <code>FeatureDB.execute()</code>. </p>
<p>For example, GENCODE stores the full version of a transcript ID but in many occasion, such information is not available. Say if we only know the TP53 transcript ID is <code>ENST00000269305</code>, then we can write a SQL query to find the matching ID: </p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">db</span><span class="o">.</span><span class="n">conn</span>
<span class="go"><sqlite3.Connection at 0x7fac89423490></span>
<span class="gp">>>> </span><span class="n">cur</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span>
<span class="gp">... </span> <span class="s2">"SELECT id FROM features "</span>
<span class="gp">... </span> <span class="s2">"WHERE featuretype='transcript' AND id LIKE 'ENST00000269305.%';"</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">cur</span><span class="o">.</span><span class="n">fetchone</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span>
<span class="go">'ENST00000269305.4'</span>
</code></pre></div>
<p>We can even tweak the SQLite behavior by setting <a href="https://www.sqlite.org/pragma.html">the <code>PRAGMA</code> statements</a>. gffutils has already added default pragma to optimize database query, including less database integrity and large memory size:</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">db</span><span class="o">.</span><span class="n">pragmas</span>
<span class="go">{'synchronous': 'NORMAL',</span>
<span class="go"> 'journal_mode': 'MEMORY',</span>
<span class="go"> 'main.page_size': 4096,</span>
<span class="go"> 'main.cache_size': 10000}</span>
<span class="gp">>>> </span><span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s1">'PRAGMA temp_store=MEMORY'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s1">'PRAGMA cache_size=-1000000'</span><span class="p">)</span> <span class="c1"># Use 1GB memory</span>
</code></pre></div>
<h3 id="discussions">Discussions</h3>
<p>gffutils provides a SQLite-based gene annotation storage in Python. Though it may not be as feature complete as what user may get in R, it is highly customizable and can be easily integrated with other Python functions. Like the Bioconductor packages GenomicFeatures and EnsDb, they all use a SQLite database under the hood. As shown in <a href="https://blog.liang2.tw/posts/2017/11/use-ensdb-database-in-python/">another post</a>, we can actually connect to those databases built by R packages directly, so user can access information from other sources such as UniProt isoforms and gene names.</p>
<p>In my opinion, all the approaches mentioned above are always better than trying to bake one’s own from scratch. Those packages are backed by numerous tests and are built from reliable or the original data sources. Besides multiple existing solutions in R and Python, one can always access the databases built by those packages from different languages, so it is quite unlikely to build something from scratch anyway.</p>Read UniProtKB in XML format2018-01-28T00:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2018-01-28:/posts/2018/01/read-uniprotkb-xml/<p>UniProt Knowledge Base (<a href="http://www.uniprot.org/help/uniprotkb">UniProtKB</a>) provides <a href="https://www.uniprot.org/help/programmatic_access">various methods</a> to access their data. I settled on their XML format since no additional parsing code is required and the format is well defined, which comes with a schema. Plus, it turns out that databases such as <a href="http://pdbml.pdb.org/">PDB</a> also provide their data export in …</p><p>UniProt Knowledge Base (<a href="http://www.uniprot.org/help/uniprotkb">UniProtKB</a>) provides <a href="https://www.uniprot.org/help/programmatic_access">various methods</a> to access their data. I settled on their XML format since no additional parsing code is required and the format is well defined, which comes with a schema. Plus, it turns out that databases such as <a href="http://pdbml.pdb.org/">PDB</a> also provide their data export in XML format and the corresponding schema so the method can be applied elsewhere. </p>
<p>Here I will show how to read XML with its schema in Python using <a href="https://pypi.org/project/xmlschema/">xmlschema</a>.</p>
<div class="toc">
<ul>
<li><a href="#other-ways-to-read-uniprotkb-data-in-bulk">Other ways to read UniProtKB data in bulk</a></li>
<li><a href="#xml-and-xml-schema">XML and XML schema</a></li>
<li><a href="#read-uniprot-xml-by-xmlschema">Read UniProt XML by xmlschema</a></li>
<li><a href="#summary">Summary</a></li>
</ul>
</div>
<h2 id="other-ways-to-read-uniprotkb-data-in-bulk">Other ways to read UniProtKB data in bulk</h2>
<p>UniProtKB at least provides REST, SPARQL, XML, and a flat text file for its data access. </p>
<p>RESTful APIs work very well to access a small proportion of data and usually are my way to go for data access, but it will put too much load on the server if I want a lot of information from tens of thousands of entries. Ideally, UniProtKB’s data won’t change very often so I’d like to hit the database once per entry and cache the results locally. </p>
<p><a href="https://en.wikipedia.org/wiki/SPARQL">SPARQL</a> is kind of similar to REST but can directly query on UniProtKB’s <a href="https://en.wikipedia.org/wiki/Resource_Description_Framework">RDF</a> file, thus one can retrieve whatever information available in a complex way. I started my research on this method but I got overwhelmed by the technical details and eventually gave up. I feel like more tutorials or examples on how to access the SPARQL interface will be very helpful.</p>
<p>UniProtKB’s flat text file has been a popular way to parse its data. I mean, it has <a href="https://www.uniprot.org/docs/userman.htm">its own manual</a>, and one can download a full entry’s data easily. But this requires writing a custom parser in Python. More code means more bugs, and I will worry about whether my parser works every time UniProt updates.</p>
<h2 id="xml-and-xml-schema">XML and XML schema</h2>
<p>XML data are structured. For example, this is what entry <a href="https://www.uniprot.org/uniprot/P51587">P51587</a> looks like in XML format:</p>
<div class="highlight"><pre><span></span><code><span class="cp"><?xml version="1.0" encoding="UTF-8"?></span>
<span class="nt"><uniprot</span><span class="w"> </span><span class="na">xmlns=</span><span class="s">"http://uniprot.org/uniprot"</span><span class="w"> </span><span class="na">xmlns:xsi=</span><span class="s">"http://www.w3.org/2001/XMLSchema-instance"</span><span class="w"> </span><span class="na">xsi:schemaLocation=</span><span class="s">"http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd"</span><span class="nt">></span>
<span class="nt"><entry</span><span class="w"> </span><span class="na">dataset=</span><span class="s">"Swiss-Prot"</span><span class="w"> </span><span class="na">created=</span><span class="s">"1996-10-01"</span><span class="w"> </span><span class="na">modified=</span><span class="s">"2017-12-20"</span><span class="w"> </span><span class="na">version=</span><span class="s">"201"</span><span class="nt">></span>
<span class="nt"><accession></span>P51587<span class="nt"></accession></span>
<span class="nt"><accession></span>O00183<span class="nt"></accession></span>
<span class="nt"><accession></span>O15008<span class="nt"></accession></span>
<span class="nt"><accession></span>Q13879<span class="nt"></accession></span>
<span class="nt"><accession></span>Q5TBJ7<span class="nt"></accession></span>
<span class="nt"><name></span>BRCA2_HUMAN<span class="nt"></name></span>
<span class="nt"><protein></span>
<span class="nt"><recommendedName></span>
<span class="nt"><fullName></span>Breast<span class="w"> </span>cancer<span class="w"> </span>type<span class="w"> </span>2<span class="w"> </span>susceptibility<span class="w"> </span>protein<span class="nt"></fullName></span>
<span class="nt"></recommendedName></span>
<span class="nt"><alternativeName></span>
<span class="nt"><fullName></span>Fanconi<span class="w"> </span>anemia<span class="w"> </span>group<span class="w"> </span>D1<span class="w"> </span>protein<span class="nt"></fullName></span>
<span class="nt"></alternativeName></span>
<span class="nt"></protein></span>
<span class="cm"><!-- ... --></span>
<span class="nt"></entry></span>
<span class="nt"></uniprot></span>
</code></pre></div>
<p>The file is available at <a href="https://www.uniprot.org/uniprot/P04637.xml">https://www.uniprot.org/uniprot/P04637.xml</a>. Basically, all the information about this entry should be available in this file, as long as one knows how to query the XML via <a href="https://en.wikipedia.org/wiki/XPath">XPath</a>. However, I find XML file harder to read alone, especially without any guide of how the file was constructed.</p>
<p>UniProt XML is constructed based on its XML schema, available as an XSD file at <a href="http://www.uniprot.org/support/docs/uniprot.xsd">http://www.uniprot.org/support/docs/uniprot.xsd</a>. The schema not only helps understand the XML content, it also validates whether the XML is valid. In other words, since all UniProt XMLs are validated by its schema, one can expect to parse all their data the same as what the schema has defined. XML schema is also part of the <a href="https://www.w3.org/XML/Schema">W3C standard</a> and wildly used.</p>
<h2 id="read-uniprot-xml-by-xmlschema">Read UniProt XML by xmlschema</h2>
<p>I use <a href="https://pypi.org/project/xmlschema/">xmlschema</a> to read XML with its schema in Python. Instead of using XPath, one can actually convert the XML content into a dictionary-like format, which can be easily passed to other Python functions.</p>
<p>Using same entry <a href="https://www.uniprot.org/uniprot/P51587">P51587</a> as an example,</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="kn">import</span> <span class="nn">xmlschema</span>
<span class="gp">>>> </span><span class="n">schema</span> <span class="o">=</span> <span class="n">xmlschema</span><span class="o">.</span><span class="n">XMLSchema</span><span class="p">(</span><span class="s1">'https://www.uniprot.org/docs/uniprot.xsd'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">entry_dict</span> <span class="o">=</span> <span class="n">schema</span><span class="o">.</span><span class="n">to_dict</span><span class="p">(</span><span class="s1">'./P51587.xml'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">entry_dict</span><span class="o">.</span><span class="n">keys</span><span class="p">()</span>
<span class="go">dict_keys(['@xsi:schemaLocation', 'entry', 'copyright'])</span>
<span class="gp">>>> </span><span class="n">content</span> <span class="o">=</span> <span class="n">entry_dict</span><span class="p">[</span><span class="s1">'entry'</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<span class="gp">>>> </span><span class="nb">list</span><span class="p">(</span><span class="n">content</span><span class="p">)[:</span><span class="mi">6</span><span class="p">]</span>
<span class="go">['@dataset', '@created', '@modified', '@version', 'accession', 'name']</span>
</code></pre></div>
<p>I don’t need any custom code to read the XML content structurally. For example, to get all the accession IDs of this entry,</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">content</span><span class="p">[</span><span class="s1">'accession'</span><span class="p">]</span>
<span class="go">['P51587', 'O00183', 'O15008', 'Q13879', 'Q5TBJ7']</span>
</code></pre></div>
<p>To get the protein names,</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">content</span><span class="p">[</span><span class="s1">'protein'</span><span class="p">]</span>
<span class="go">{'alternativeName': [{'fullName': 'Fanconi anemia group D1 protein'}],</span>
<span class="go"> 'recommendedName': {'fullName': 'Breast cancer type 2 susceptibility protein'}}</span>
</code></pre></div>
<p>One can compare the dictionary converted result with the original XML. I’d like to end the demo with a more complicated example that finds all the sequence variants:</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">seq_variants</span> <span class="o">=</span> <span class="nb">filter</span><span class="p">(</span>
<span class="gp">... </span> <span class="k">lambda</span> <span class="n">d</span><span class="p">:</span> <span class="n">d</span><span class="p">[</span><span class="s1">'@type'</span><span class="p">]</span> <span class="o">==</span> <span class="s1">'sequence variant'</span> <span class="ow">and</span> <span class="s1">'variation'</span> <span class="ow">in</span> <span class="n">d</span><span class="p">,</span>
<span class="gp">... </span> <span class="n">content</span><span class="p">[</span><span class="s1">'feature'</span><span class="p">])</span>
<span class="gp">>>> </span><span class="p">[(</span><span class="n">d</span><span class="p">[</span><span class="s1">'location'</span><span class="p">][</span><span class="s1">'position'</span><span class="p">][</span><span class="s1">'@position'</span><span class="p">],</span>
<span class="gp">... </span> <span class="n">d</span><span class="p">[</span><span class="s1">'original'</span><span class="p">],</span> <span class="n">d</span><span class="p">[</span><span class="s1">'variation'</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span>
<span class="gp">... </span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">seq_variants</span><span class="p">][:</span><span class="mi">10</span><span class="p">]</span>
<span class="go">[(25, 'G', 'R'),</span>
<span class="go"> (31, 'W', 'C'),</span>
<span class="go"> (31, 'W', 'R'),</span>
<span class="go"> (32, 'F', 'L'),</span>
<span class="go"> (42, 'Y', 'C'),</span>
<span class="go"> (53, 'K', 'R'),</span>
<span class="go"> (60, 'N', 'S'),</span>
<span class="go"> (64, 'T', 'I'),</span>
<span class="go"> (75, 'A', 'P'),</span>
<span class="go"> (81, 'F', 'L')]</span>
</code></pre></div>
<h2 id="summary">Summary</h2>
<p>Using UniProt’s XML and its schema can read all the data in a structured fashion without a custom parser. Once downloading the XML files of interest, one could basically query everything locally, which is very helpful to retrieve substantial information from UniProt, say, extracting all the citations for certain protein feature.</p>
<p>XML schema really helps users to understand the data structure and it also helps the database developers validate their data export. I hope someday all the databases should have this validation enforced.</p>
<p>However, one may find XML format tedious and not human-friendly to read. JSON has been popular and used heavily by RESTful APIs. The specification of <a href="http://json-schema.org/">JSON schema</a> exists, but it is not a W3C standard yet. </p>
<p>SPARQL and RDF, part of the attempt for the Semantic web can be a universal query interface solving the same problem more elegantly, though the entry level is a bit high with limited learning resources available.</p>
<p>For now, reading bulk data in XML with its schema seems to be the mature way to go with abundant support.</p>Ad hoc bioinformatic analysis in database2018-01-20T00:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2018-01-20:/posts/2018/01/ad-hoc-bioinfo-analysis-in-database/<p>Recently I’ve found that bioinformatic analysis in a database is not hard at all and the database set up wasn’t as daunting as it sounds, especially when the data are tabular. I used to start my analysis with loading everything into R or Python, and then figuring out …</p><p>Recently I’ve found that bioinformatic analysis in a database is not hard at all and the database set up wasn’t as daunting as it sounds, especially when the data are tabular. I used to start my analysis with loading everything into R or Python, and then figuring out all the filtering and grouping commands with my favorite R or Python packages. However, the data size would be bound by memory and the analysis might be slow unless additional optimization was applied. On the other hand, databases have already solved the problems by mapping the data to disk and indexing. Therefore I’d like to share my recent experience on using databases for bioinfo analysis.</p>
<p>Note that if one is interested in the actual tips of using databases for analysis, feel free to skip the whole background section.</p>
<div class="toc">
<ul>
<li><a href="#background">Background</a><ul>
<li><a href="#reading-tabular-data-in-bioinformatics">Reading tabular data in bioinformatics</a></li>
<li><a href="#database">Database</a></li>
</ul>
</li>
<li><a href="#tabular-data-io-in-database">Tabular data IO in database</a><ul>
<li><a href="#sqlite">SQLite</a></li>
<li><a href="#postgresql">PostgreSQL</a></li>
<li><a href="#loading-compressed-data-with-named-pipe">Loading compressed data with named pipe</a></li>
</ul>
</li>
<li><a href="#benchmark">Benchmark</a><ul>
<li><a href="#pandas-python">pandas (Python)</a></li>
<li><a href="#sqlite_1">SQLite</a></li>
<li><a href="#postgresql_1">PostgreSQL</a></li>
<li><a href="#result">Result</a></li>
</ul>
</li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
</div>
<h2 id="background">Background</h2>
<h3 id="reading-tabular-data-in-bioinformatics">Reading tabular data in bioinformatics</h3>
<p>Tabular data are everywhere in bioinformatics. To record gene expressions, variants or cross reference IDs between different annotation systems or databases, data are stored in various tabular-like formats, such as BED, GTF, MAF, and VCF, which can usually be normalized to the standard CSV and TSV files. Starting with the raw data, we apply different kinds of filtering and grouping to pick up the records of interest. For example, we might subset the data within a genomic region, select transcripts above an expression threshold, or group the data by the same transcript across multiple samples.</p>
<p>Researchers have developed numerous tools to select the data of interest. In Python, numpy and pandas dominate the analysis; in R, data.frame, tibble, and data.table are all widely used. However, all the tools above only work if the data can be fit into memory. Unfortunately, bioinformatics data can go beyond 10GB easily these days. It has been difficult to analyze everything in memory. Even using a powerful server with a few hundreds GB of memory, the overhead of loading all data into memory can be time-consuming. To make things worse, when joining multiple data together, the magnitude of the issues above will be multiplied.</p>
<p>One might argue that in Python there are packages like <a href="http://xarray.pydata.org/en/stable/">xarray</a> and <a href="https://dask.pydata.org/en/latest/">dask</a> capable of handling out-of-memory multi-dimensional array. But they are only useful for handling numerical data. In bioinformatics, metadata are frequently used and consist of many text columns, where numpy doesn’t have the same computing advantage as numerical columns. For example, gene expression only makes sense if it comes with the gene symbol, the transcript id, and the sample id.</p>
<h3 id="database">Database</h3>
<p>Databases have been solving the out-of-memory data analysis for decades, and it also comes with several advantages. First, the language databases use is standardized, known as Structured Query Language (SQL). SQL is expressive, which means instead of writing how to load or query the data, one writes what the data or the query look like. Databases support concurrent reads, enabling query in parallel. Second, One can speed up the queries by setting up indexes. Different types of indexes and different combinations of columns can be added to boost the query. Lastly, databases are persistent, so one only needs to load the data once.</p>
<p>I mainly use two databases: <a href="https://sqlite.org/">SQLite</a> and <a href="https://www.postgresql.org/">PostgreSQL</a>. SQLite’s database is just a single file on disk and it doesn’t need any configuration to run. In fact SQLite ships with Python, available as the <a href="https://docs.python.org/3/library/sqlite3.html"><code>sqlite</code> module</a>. SQLite works very well in my case.</p>
<p>PostgreSQL is a more feature-rich database and has better concurrency support such as multiple writers at the same time. <a href="https://www.postgresql.org/docs/current/static/indexes-types.html">Its advanced indexing</a> and <a href="https://www.postgresql.org/docs/current/static/datatype.html">data types</a> might be helpful for genomic range query. The downside is that it requires some configurations and its installation is not as easy as SQLite. Though the basic PostgreSQL setup is actually just a few commands on Debian Linux, one probably needs to go through some documentation to understand what they are about and how to tweak the config.</p>
<p>The most annoying thing I found using a database in the past was to load my data, where I had to create the table by <code>CREATE TABLE ...</code> and insert all my data by multiple <code>INSERT INTO ... VALUES ...</code> statements. But recently I found that many databases have some built-in utilities to make the process easy and fast. Also, it is not hard to programmatically generate the statements through packages like <a href="https://www.sqlalchemy.org/">SQLAlchemy</a>. Therefore, I will share some experience of using databases here.</p>
<h2 id="tabular-data-io-in-database">Tabular data IO in database</h2>
<h3 id="sqlite">SQLite</h3>
<p>For SQLite, use <code>.mode csv</code> with <a href="https://www.sqlite.org/cli.html#csv"><code>.import</code> statement</a> to load in data. SQLite will create the table automatically by using the first row as the column names if the table doesn’t exist. One can create the table before the loading to define each column’s data type, otherwise, columns are just <code>TEXT</code> type. <code>.separator</code> controls the delimiter character SQLite uses between columns.</p>
<div class="highlight"><pre><span></span><code><span class="p">.</span><span class="k">mode</span><span class="w"> </span><span class="n">csv</span>
<span class="p">.</span><span class="n">separator</span><span class="w"> </span><span class="err">\</span><span class="n">t</span><span class="w"> </span><span class="c1">-- For TSV files</span>
<span class="p">.</span><span class="n">import</span><span class="w"> </span><span class="s1">'/path/to/tsv'</span><span class="w"> </span><span class="k">table_name</span>
</code></pre></div>
<p>To export data, use <code>.once</code> statement followed by the query:</p>
<div class="highlight"><pre><span></span><code><span class="p">.</span><span class="n">header</span><span class="w"> </span><span class="k">on</span><span class="w"> </span><span class="c1">-- Export columns name</span>
<span class="p">.</span><span class="n">once</span><span class="w"> </span><span class="s1">'/path/to/output.tsv'</span>
<span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="k">table_name</span><span class="p">;</span><span class="w"> </span><span class="c1">-- Export all data in the table</span>
</code></pre></div>
<p>Commands above can be scripted into SQLite like:</p>
<div class="highlight"><pre><span></span><code>sqlite3 mydb.sqlite < load_data.sql
</code></pre></div>
<h3 id="postgresql">PostgreSQL</h3>
<p>For PostgreSQL, the built-in solution is to use the <a href="https://www.postgresql.org/docs/current/static/sql-copy.html"><code>COPY</code> statement</a> or the <a href="https://www.postgresql.org/docs/current/static/app-psql.html#APP-PSQL-META-COMMANDS-COPY"><code>\copy</code> metacommand</a> to import or export data. <code>COPY</code> runs faster than the equivalent <code>INSERT</code> statements. Besides built-in commands, an external tool <a href="https://pgloader.io/">pgloader</a> has been very helpful for the data loading, whose loading process is more flexible.</p>
<p>In this post, I won’t dive into details of their usage. There will be an example in the benchmark section.</p>
<h3 id="loading-compressed-data-with-named-pipe">Loading compressed data with named pipe</h3>
<p>Many tabular data are compressed by gzip or bgzip to save the disk space. To decompress the file and load into the database without storing the uncompressed file somewhere first, one can consider using <a href="https://www.linuxjournal.com/article/2156">named pipe</a>.</p>
<p>The idea is to decompress the file to a named pipe and read the data in a database from the named pipe. A named pipe can be created by <code>mkfifo</code>. For example,</p>
<div class="highlight"><pre><span></span><code>mkfifo<span class="w"> </span>mypipe
gunzip<span class="w"> </span>-c<span class="w"> </span>mydata.tsv.gz<span class="w"> </span>><span class="w"> </span>mypipe<span class="w"> </span><span class="p">&</span>
</code></pre></div>
<p>The trailing <code>&</code> makes the decompress command running in the background to keep everything in one shell session. Then read the data in SQLite as if it were a file like:</p>
<div class="highlight"><pre><span></span><code>.import mypipe mytable
</code></pre></div>
<p>The trick here can be further expanded to any preprocessing in any language. One can simply preprocess the file and write the output to a named pipe. The database can read from the named pipe without storing the full intermediate output on disk. Plus, by piping between commands more CPU cores are utilized.</p>
<h2 id="benchmark">Benchmark</h2>
<p>To give an idea of the data processing time in databases, I used all the <a href="https://www.synapse.org/#!Synapse:syn7214402/wiki/405297">somatic variants from TCGA MC3</a> as a demonstration. The goal here is to count the number of variants by different transcript and its mutation type. So the output result will be something like the following:</p>
<table>
<thead>
<tr>
<th style="text-align: left;">Transcript ID</th>
<th>Mutation type</th>
<th>Count</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left;">ENST00000000233</td>
<td>3’UTR</td>
<td>20</td>
</tr>
<tr>
<td style="text-align: left;">ENST00000000233</td>
<td>Frame_Shift_Del</td>
<td>1</td>
</tr>
<tr>
<td style="text-align: left;">ENST00000000233</td>
<td>Intron</td>
<td>6</td>
</tr>
<tr>
<td style="text-align: left;">…</td>
<td>…</td>
<td>…</td>
</tr>
</tbody>
</table>
<p>After filtering out all the silent mutations, there are about total 2.8 million variants making up 614MB of disk space.</p>
<p>I used three methods to load and group the variants: pandas, SQLite, and PostgreSQL. Their code is shown below.</p>
<h3 id="pandas-python">pandas (Python)</h3>
<p>Standard pandas IO code.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span>
<span class="s1">'mc3_filtered.tsv'</span><span class="p">,</span>
<span class="n">header</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span>
<span class="n">names</span><span class="o">=</span><span class="p">[</span>
<span class="s1">'chrom'</span><span class="p">,</span> <span class="s1">'start'</span><span class="p">,</span> <span class="s1">'end'</span><span class="p">,</span> <span class="s1">'strand'</span><span class="p">,</span> <span class="s1">'mutation_type'</span><span class="p">,</span>
<span class="s1">'ref_allele'</span><span class="p">,</span> <span class="s1">'alt_allele'</span><span class="p">,</span> <span class="s1">'transcript_id'</span><span class="p">,</span>
<span class="s1">'hgvs_c'</span><span class="p">,</span> <span class="s1">'hgvs_p'</span><span class="p">,</span> <span class="s1">'cdna_start'</span><span class="p">,</span> <span class="s1">'cdna_end'</span><span class="p">,</span>
<span class="s1">'p_start'</span><span class="p">,</span> <span class="s1">'p_end'</span><span class="p">,</span> <span class="s1">'normal_id'</span><span class="p">,</span> <span class="s1">'tumor_id'</span>
<span class="p">],</span>
<span class="n">dtype</span><span class="o">=</span><span class="p">{</span>
<span class="s1">'chrom'</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="s1">'start'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">int64</span><span class="p">,</span> <span class="s1">'end'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">int64</span><span class="p">,</span>
<span class="s1">'strand'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">int32</span><span class="p">,</span> <span class="s1">'cdna_start'</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="s1">'cdna_end'</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
<span class="s1">'p_start'</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="s1">'p_end'</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
<span class="p">},</span>
<span class="n">engine</span><span class="o">=</span><span class="s1">'c'</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">grp_df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'transcript_id'</span><span class="p">,</span> <span class="s1">'mutation_type'</span><span class="p">])[</span><span class="s1">'alt_allele'</span><span class="p">]</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="n">grp_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s1">'out.pandas.tsv'</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s1">'</span><span class="se">\t</span><span class="s1">'</span><span class="p">)</span>
</code></pre></div>
<h3 id="sqlite_1">SQLite</h3>
<p>I set some <code>PRAGMA ...</code> statements at the beginning to control some of the SQLite settings. It tells SQLite to use more cache, create temporary tables in memory and disable all the transaction recovery settings. SQLite by default writes everything to the disk first before changing the actual database content so if the program fails or any exception occurs, it can recover all the transactions properly. In our case, we don’t care about the integrity of the database.</p>
<div class="highlight"><pre><span></span><code><span class="n">PRAGMA</span><span class="w"> </span><span class="n">cache_size</span><span class="o">=-</span><span class="mi">4192000</span><span class="p">;</span><span class="w"> </span><span class="c1">-- Use 2GB RAM as cache</span>
<span class="n">PRAGMA</span><span class="w"> </span><span class="n">temp_store</span><span class="o">=</span><span class="n">MEMORY</span><span class="p">;</span>
<span class="n">PRAGMA</span><span class="w"> </span><span class="n">synchronous</span><span class="o">=</span><span class="k">OFF</span><span class="p">;</span>
<span class="n">PRAGMA</span><span class="w"> </span><span class="n">journal_mode</span><span class="o">=</span><span class="k">OFF</span><span class="p">;</span>
<span class="n">PRAGMA</span><span class="w"> </span><span class="n">locking_mode</span><span class="o">=</span><span class="k">EXCLUSIVE</span><span class="p">;</span>
<span class="p">.</span><span class="k">mode</span><span class="w"> </span><span class="n">csv</span>
<span class="p">.</span><span class="n">separator</span><span class="w"> </span><span class="err">\</span><span class="n">t</span>
<span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">mc3</span><span class="w"> </span><span class="p">(</span>
<span class="w"> </span><span class="n">chrom</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span>
<span class="w"> </span><span class="ss">"start"</span><span class="w"> </span><span class="nb">INT</span><span class="p">,</span>
<span class="w"> </span><span class="ss">"end"</span><span class="w"> </span><span class="nb">INT</span><span class="p">,</span>
<span class="w"> </span><span class="n">strand</span><span class="w"> </span><span class="nb">INT</span><span class="p">,</span>
<span class="w"> </span><span class="n">mutation_type</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span>
<span class="w"> </span><span class="n">ref_allele</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span>
<span class="w"> </span><span class="n">alt_allele</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span>
<span class="w"> </span><span class="n">transcript_id</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span>
<span class="w"> </span><span class="n">hgvs_c</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span>
<span class="w"> </span><span class="n">hgvs_p</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span>
<span class="w"> </span><span class="n">cdna_start</span><span class="w"> </span><span class="nb">INT</span><span class="p">,</span>
<span class="w"> </span><span class="n">cdna_end</span><span class="w"> </span><span class="nb">INT</span><span class="p">,</span>
<span class="w"> </span><span class="n">p_start</span><span class="w"> </span><span class="nb">INT</span><span class="p">,</span>
<span class="w"> </span><span class="n">p_end</span><span class="w"> </span><span class="nb">INT</span><span class="p">,</span>
<span class="w"> </span><span class="n">normal_id</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">,</span>
<span class="w"> </span><span class="n">tumor_id</span><span class="w"> </span><span class="nb">TEXT</span>
<span class="p">);</span>
<span class="p">.</span><span class="n">import</span><span class="w"> </span><span class="n">mc3_filtered</span><span class="p">.</span><span class="n">tsv</span><span class="w"> </span><span class="n">mc3</span>
<span class="c1">-- Create an index to speed up grouping on the same columns</span>
<span class="k">CREATE</span><span class="w"> </span><span class="k">INDEX</span><span class="w"> </span><span class="n">mc3_idx</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">mc3</span><span class="w"> </span><span class="p">(</span><span class="n">transcript_id</span><span class="p">,</span><span class="w"> </span><span class="n">mutation_type</span><span class="p">);</span>
<span class="c1">-- Output</span>
<span class="p">.</span><span class="n">once</span><span class="w"> </span><span class="k">out</span><span class="p">.</span><span class="n">sqlite</span><span class="p">.</span><span class="n">tsv</span>
<span class="k">SELECT</span><span class="w"> </span><span class="n">transcript_id</span><span class="p">,</span><span class="w"> </span><span class="n">mutation_type</span><span class="p">,</span><span class="w"> </span><span class="k">COUNT</span><span class="p">(</span><span class="n">alt_allele</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">c</span>
<span class="k">FROM</span><span class="w"> </span><span class="n">mc3</span>
<span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">transcript_id</span><span class="p">,</span><span class="w"> </span><span class="n">mutation_type</span><span class="p">;</span>
</code></pre></div>
<h3 id="postgresql_1">PostgreSQL</h3>
<p>I used <a href="https://pgloader.io/">pgloader</a> to load the data into a local PostgreSQL database <code>test_mc3</code>. pgloader can take a script of its own mini-language.</p>
<div class="highlight"><pre><span></span><code>LOAD CSV
FROM 'mc3_filtered.tsv'
INTO postgresql:///test_mc3?mc3
WITH fields terminated by '\t',
fields not enclosed,
drop indexes
BEFORE LOAD DO
$$ DROP TABLE IF EXISTS mc3; $$,
$$ CREATE TABLE mc3 (
chrom TEXT,
"start" BIGINT,
"end" BIGINT,
strand SMALLINT,
mutation_type TEXT,
ref_allele TEXT,
alt_allele TEXT,
transcript_id TEXT,
hgvs_c TEXT,
hgvs_p TEXT,
cdna_start INT,
cdna_end INT,
p_start INT,
p_end INT,
normal_id TEXT,
tumor_id TEXT
);
$$,
$$ CREATE INDEX mc3_idx ON mc3 (transcript_id, mutation_type); $$
;
</code></pre></div>
<p>To do the grouping analysis, I used the built-in <code>COPY</code> command:</p>
<div class="highlight"><pre><span></span><code><span class="k">COPY</span><span class="w"> </span><span class="p">(</span>
<span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="n">transcript_id</span><span class="p">,</span><span class="w"> </span><span class="n">mutation_type</span><span class="p">,</span><span class="w"> </span><span class="n">COUNT</span><span class="p">(</span><span class="n">alt_allele</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">c</span>
<span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">mc3</span>
<span class="w"> </span><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">transcript_id</span><span class="p">,</span><span class="w"> </span><span class="n">mutation_type</span>
<span class="p">)</span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="s1">'/private/tmp/mc3/MC3/out.psql.tsv'</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">FORMAT</span><span class="w"> </span><span class="nb">TEXT</span><span class="p">);</span>
</code></pre></div>
<h3 id="result">Result</h3>
<p>I didn’t run it systematically but a few repeats showed the similar numbers.</p>
<table>
<thead>
<tr>
<th style="text-align: left;">Method</th>
<th style="text-align: right;">Read data (sec)</th>
<th style="text-align: right;">Group-by analysis (sec)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left;">Pandas</td>
<td style="text-align: right;">10.7</td>
<td style="text-align: right;">0.9</td>
</tr>
<tr>
<td style="text-align: left;">SQLite</td>
<td style="text-align: right;">27.7</td>
<td style="text-align: right;">4.0</td>
</tr>
<tr>
<td style="text-align: left;">PostgreSQL</td>
<td style="text-align: right;">82.6</td>
<td style="text-align: right;">13.5</td>
</tr>
</tbody>
</table>
<p>In this case, all data can be loaded into memory easily, so pandas gave the best performance here. It actually took nearly no-time to complete the grouping.</p>
<p>All databases ran much slower on loading data than pandas. PostgreSQL seems to run a lot more slower than SQLite, which I think it has something to do with my server configuration, say, not enough cache size, or not enough working memory for the group-by operation. I feel like PostgreSQL can be faster but anyway this’s the result I have so far. Note that all the databases are stored on a PCIe SSD disk. If they were on a normal hard drive, the database creation will take a much longer time.</p>
<p>However, after the data are loaded into the database, the speed of the query alone is comparable to pandas. Because for pandas, one cannot skip the step of reading data so if the analysis is on a frequently used dataset, database like SQLite can yield better performance. Once the data get larger than the memory capacity, special care will be needed to make the pandas’ approach work, whereas database can scale up with little fuss.</p>
<h2 id="conclusion">Conclusion</h2>
<p>My post provides a different solution to work with tabular data by working in a database. In-memory approaches like pandas work very efficiently at a small dataset but one will have to code the “how-tos” to scale to a larger dataset that cannot feed into memory (or the overhead is too high). On the other hand, databases can easily scale to a few hundred GBs in size and the query is fast. For analysis on a frequently used dataset, loading data into the database first might be a good idea.</p>
<p>Another good thing about databases is that SQL makes joining across tables easily. One can easily join across multiple tables, say, expand the gene annotation and doesn’t have to worry how to implement it. With indexing, the joining can be fast. In pandas, one generates many objects representing the joining results, but those objects cannot be easily shared between scripts. Relying on storing the intermediate objects on disk, the accumulated overhead might be significant. Projects like <a href="https://arrow.apache.org/">Apache Arrow</a> might solve the in-memory object passing ultimately, but its development is still in the early phase. As for databases, one can define reusable views for the joining logic and filtering results. The post didn’t really touch this part so I probably need another benchmark or post to back my thoughts.</p>
<p>If one is analyzing variants, using databases or SQL in general has been backed up by many pratical projects. People at <a href="http://quinlanlab.org/">Quinlab Lab</a> hace been building <a href="https://github.com/quinlan-lab/vcf2db">vcf2db</a> to load variants into databases for downstream annotation and analysis. To scale way up to terabytes or petabytes of variant data, <a href="https://cloud.google.com/genomics/v1/analyze-variants">Google Cloud Genomics</a> provides an interface to store and query variants in BigQuery, where users use standard SQL to select the variants of interest.</p>
<p>However, working in pandas gives users great room for flexibility. For example, one can iterate over rows and do some complex transformation of the value. Maybe it would be the optimal solution to use <a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html#pandas.read_sql"><code>pandas.read_sql</code></a> to run a query in a database.</p>
<p>It seems to me like many people rely too much on the features of some special file formats such as bgzip and tabix and have forgotten the generic yet flexible approach using databases. Those formats often optimize the random access by a given genomic query by indexing. In databases, such index is analogous to <code>(chrom, start, -end)</code> or even GiST index on Range type in PostgreSQL. It might be slower in databases, but aside from the performance, one can continue to query the records in the same way in databases. For special format, the functionality will be much limited.</p>
<p>Now I will give the database approach a try before writing my own data wrangling script.</p>
<p>EDIT 2018-01-28: Add real world examples of using databases to store variant data.</p>Using EnsDb's annotation database in Python2017-11-17T00:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2017-11-17:/posts/2017/11/use-ensdb-database-in-python/<p>How to find and download the EnsDb, the Ensembl genomic annotation in SQLite database made by R package ensembldb, and use it in Python application.</p><p>I found that there isn’t a systematic way to query and convert genomic annotation IDs in Python. At least there isn’t one as good as <a href="https://www.bioconductor.org/help/workflows/annotation/annotation/">what R/Bioconductor currently has</a>. If you’ve never heard of R/Bioconductor annotation tool stack before, check out <a href="https://www.bioconductor.org/help/workflows/annotation/annotation/">the official workflow</a> or <a href="https://blog.liang2.tw/posts/2016/05/biocondutor-ensembl-reference/">my post in 2016</a> specific for querying Ensembl annotations.</p>
<p>Although I enjoy using R for genomic annotation conversion, a few days ago I wanted to do the same thing inside my text processing script in Python. I might be able to re-write the script in R but I feel like R is not really the right tool for this task and on top of it, I don’t know how to write an efficent text processing in R<sup id="fnref:r-text-processing"><a class="footnote-ref" href="#fn:r-text-processing">1</a></sup>.</p>
<p>Knowing the fact that all annotations in R are stored in single-file SQLite databases, I should be able to connect the database directly Python or any other language and wirte SQL query to retrieve the same information. So my question now becomes to how to extract or find the path to the databases. Turn out that many new Bioconductor annotation packages are hosted via <a href="https://bioconductor.org/packages/release/bioc/html/AnnotationHub.html">AnnotationHub</a>, and user can search for the annotation package and retrieve them locally by their ID. For example, all the recent Ensembl releases, e.g., <code>EnsDb.Hsapiens.vXX</code>, are available on AnnotationHub.</p>
<p>After digging around a bit, I am able to query the AnnotationHub, download the correct EnsDB SQLite database file, and make SQL queries for the annotation ID conversion without any R package. I will share the details in the rest of the post.</p>
<div class="toc">
<ul>
<li><a href="#annotationhub-web-interface">AnnotationHub web interface</a></li>
<li><a href="#manual-query-in-annotationhub">Manual query in AnnotationHub</a></li>
<li><a href="#manual-query-in-ensdb">Manual query in EnsDB</a></li>
<li><a href="#summary">Summary</a></li>
</ul>
</div>
<p>But before we start with the details, I want to clarify that it wasn’t my intention to persuade people away from the current R ecosystem. The current R ecosystem is great and I will recommend people to stick with it as much as you can. I am pretty sure I will hit a lot of issues if I want to do more complex analysis or queries without the help of what R packages provide.</p>
<h2 id="annotationhub-web-interface">AnnotationHub web interface</h2>
<p><strong>EDIT 2019-01-29</strong><br>
Now AnnotationHub has a nice <a href="https://annotationhub.bioconductor.org/">web interface</a>. With the new API, we can search and download all the EnsDb annotation objects on AnnotationHub by visiting <a href="https://annotationhub.bioconductor.org/package2/AHEnsDbs">https://annotationhub.bioconductor.org/package2/AHEnsDbs</a>:</p>
<figure>
<img src="https://blog.liang2.tw/posts/2017/11/use-ensdb-database-in-python/pics/annotataionhub_web_interface.png"/>
<figcaption>The web query interface of AnnotationHub</figcaption>
</figure>
<p>The following section is the old way to navigate through AnnotationHub’s database.</p>
<h2 id="manual-query-in-annotationhub">Manual query in AnnotationHub</h2>
<p>When one wants to use the R package AnnotationHub, the common usage is</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">AnnotationHub</span><span class="p">)</span>
<span class="n">ah</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">AnnotationHub</span><span class="p">()</span>
<span class="c1">## snapshotDate(): 2017-10-27</span>
<span class="nf">query</span><span class="p">(</span><span class="n">ah</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s">"EnsDb"</span><span class="p">,</span><span class="w"> </span><span class="s">"Homo sapiens"</span><span class="p">))</span>
</code></pre></div>
<p>The function call <code>AnnotationHub()</code> will download the latest version of the metadata of all available annotation object. The subsequent <code>query(...)</code> function will talk to the local metadata database.</p>
<p>Now let’s do it manually without any R function calls.</p>
<p>The default <a href="https://bioconductor.org/packages/release/bioc/html/AnnotationHub.html">AnnotationHub</a> is at <a href="https://annotationhub.bioconductor.org/">https://annotationhub.bioconductor.org/</a>. By visiting the page we can find several relevant endpoints:</p>
<ul>
<li><code>/metadata/annotationhub.sqlite3</code></li>
<li><code>/fetch/:id # id => rdatapaths.id</code></li>
</ul>
<p>So as long as we get the <code>rdatapaths.id</code> of the EnsDb using the metadata, we can download it via the <code>/fetch/:id</code> endpoint.</p>
<p>After downloading the metadata database <code>https://annotationhub.bioconductor.org/metadata/annotationhub.sqlite3</code>, we can inspect it in SQLite3 by connecting it directly:</p>
<div class="highlight"><pre><span></span><code>sqlite3 annotationhub.sqlite3
</code></pre></div>
<p>Some useful commands to inspect a foreign database (or the ultimate help command <code>.help</code>):</p>
<div class="highlight"><pre><span></span><code><span class="gp">sqlite></span><span class="w"> </span><span class="p">.</span><span class="n">header</span><span class="w"> </span><span class="k">on</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="p">.</span><span class="k">mode</span><span class="w"> </span><span class="k">column</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="p">.</span><span class="n">tables</span>
<span class="go">biocversions rdatapaths schema_info test</span>
<span class="go">input_sources recipes statuses timestamp</span>
<span class="go">location_prefixes resources tags</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="p">.</span><span class="k">schema</span><span class="w"> </span><span class="n">rdatapaths</span>
<span class="go">CREATE TABLE `rdatapaths`(`id` integer DEFAULT (NULL) NOT NULL PRIMARY KEY , `rdatapath` varchar(255) DEFAULT (NULL) NULL, `rdataclass` varchar(255) DEFAULT (NULL) NULL, `resource_id` integer DEFAULT (NULL) NULL, `dispatchclass` varchar(255) DEFAULT (NULL) NULL, CONSTRAINT `rdatapaths_ibfk_1` FOREIGN KEY (`resource_id`) REFERENCES `resources`(`id`));</span>
<span class="go">CREATE INDEX `rdatapaths_resource_id` ON `rdatapaths` (`resource_id`);</span>
</code></pre></div>
<p>So let’s make a SQL query to find all Human’s EnsDb:</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="n">r</span><span class="p">.</span><span class="n">ah_id</span><span class="p">,</span><span class="w"> </span><span class="n">rdp</span><span class="p">.</span><span class="n">id</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">rdatapaths_id</span><span class="p">,</span><span class="w"> </span><span class="n">rdp</span><span class="p">.</span><span class="n">rdatapath</span><span class="p">,</span><span class="w"> </span><span class="n">r</span><span class="p">.</span><span class="n">title</span>
<span class="k">FROM</span><span class="w"> </span><span class="n">resources</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">r</span>
<span class="k">JOIN</span><span class="w"> </span><span class="n">rdatapaths</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">rdp</span>
<span class="k">ON</span><span class="w"> </span><span class="n">r</span><span class="p">.</span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rdp</span><span class="p">.</span><span class="n">resource_id</span>
<span class="k">WHERE</span><span class="w"> </span><span class="n">r</span><span class="p">.</span><span class="n">title</span><span class="w"> </span><span class="k">LIKE</span><span class="w"> </span><span class="s1">'%EnsDb for Homo Sapiens%'</span><span class="p">;</span>
<span class="c1">-- ah_id rdatapaths_id rdatapath title</span>
<span class="c1">-- ---------- ------------- -------------------------------------- -- ---------------------------------</span>
<span class="c1">-- AH53211 59949 AHEnsDbs/v87/EnsDb.Hsapiens.v87.sqlite Ensembl 87 EnsDb for Homo Sapiens</span>
<span class="c1">-- AH53715 60453 AHEnsDbs/v88/EnsDb.Hsapiens.v88.sqlite Ensembl 88 EnsDb for Homo Sapiens</span>
<span class="c1">-- AH56681 63419 AHEnsDbs/v89/EnsDb.Hsapiens.v89.sqlite Ensembl 89 EnsDb for Homo Sapiens</span>
<span class="c1">-- AH57757 64495 AHEnsDbs/v90/EnsDb.Hsapiens.v90.sqlite Ensembl 90 EnsDb for Homo Sapiens</span>
</code></pre></div>
<p>All the Ensembl releases 87+ are available! I will use the release 90 for example. we can download it by its rdatapaths id:</p>
<div class="highlight"><pre><span></span><code>wget -O EnsDb.Hsapiens.v90.sqlite https://annotationhub.bioconductor.org/fetch/64495
</code></pre></div>
<p>For older Ensembl release, one may need to <a href="https://bioconductor.org/packages/release/bioc/vignettes/ensembldb/inst/doc/ensembldb.html#102_building_annotation_packages">build the SQLite database based by the instructions from ensembldb</a>. For the last GRCh37 release, Ensembl release 75, one can download the source of the Bioconductor annotation package <a href="https://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v75.html"><code>EnsDb.Hsapiens.v75</code></a> and extract it. The database will be under <code>inst/extdata</code>.</p>
<h2 id="manual-query-in-ensdb">Manual query in EnsDB</h2>
<p>EnsDb SQLite database are Ensembl annotation databases created by the R package <a href="https://bioconductor.org/packages/release/bioc/html/ensembldb.html">ensembldb</a>.</p>
<p>Here I will show how to find a transcript’s gene name, its genomic location, and all its exon locations given its Ensembl transcript ID.</p>
<p>First connect the database by <code>sqlite3 EnsDb.Hsapiens.v90.sqlite</code>. Its table design is very straightforward:</p>
<div class="highlight"><pre><span></span><code><span class="gp">sqlite></span><span class="w"> </span><span class="p">.</span><span class="n">tables</span>
<span class="go">chromosome exon metadata protein_domain tx2exon</span>
<span class="go">entrezgene gene protein tx uniprot</span>
</code></pre></div>
<p>So it didn’t take me long to figure out how to join the transcript and gene information:</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="n">tx</span><span class="p">.</span><span class="n">tx_id</span><span class="p">,</span><span class="w"> </span><span class="n">tx</span><span class="p">.</span><span class="n">gene_id</span><span class="p">,</span><span class="w"> </span><span class="n">gene</span><span class="p">.</span><span class="n">gene_name</span><span class="p">,</span><span class="w"> </span><span class="n">seq_name</span><span class="p">,</span><span class="w"> </span><span class="n">seq_strand</span>
<span class="k">FROM</span><span class="w"> </span><span class="n">tx</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">gene</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">tx</span><span class="p">.</span><span class="n">gene_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gene</span><span class="p">.</span><span class="n">gene_id</span>
<span class="k">WHERE</span><span class="w"> </span><span class="n">tx_id</span><span class="o">=</span><span class="s1">'ENST00000358731'</span><span class="p">;</span>
<span class="c1">-- tx_id gene_id gene_name seq_name seq_strand</span>
<span class="c1">-- --------------- --------------- ---------- ---------- ----------</span>
<span class="c1">-- ENST00000358731 ENSG00000145734 BDP1 5 1</span>
</code></pre></div>
<p>And for the genomic ranges of its exon:</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span><span class="w"> </span><span class="n">tx_id</span><span class="p">,</span><span class="w"> </span><span class="n">exon_idx</span><span class="p">,</span><span class="w"> </span><span class="n">exon_seq_start</span><span class="p">,</span><span class="w"> </span><span class="n">exon_seq_end</span>
<span class="k">FROM</span><span class="w"> </span><span class="n">tx2exon</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">exon</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="n">tx2exon</span><span class="p">.</span><span class="n">exon_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">exon</span><span class="p">.</span><span class="n">exon_id</span>
<span class="k">WHERE</span><span class="w"> </span><span class="n">tx_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'ENST00000380139'</span>
<span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">exon_idx</span><span class="p">;</span>
<span class="c1">-- tx_id exon_idx exon_seq_start exon_seq_end</span>
<span class="c1">-- --------------- ---------- -------------- ------------</span>
<span class="c1">-- ENST00000380139 1 32427904 32428133</span>
<span class="c1">-- ENST00000380139 2 32407645 32407772</span>
<span class="c1">-- ENST00000380139 3 32407250 32407338</span>
<span class="c1">-- ENST00000380139 4 32404203 32404271</span>
<span class="c1">-- ENST00000380139 5 32400723 32403200</span>
</code></pre></div>
<p>All the coordinates are 1-based and the ranges are inclusive.</p>
<h2 id="summary">Summary</h2>
<p>By downloading the underlying annotation database, one can do the same annotation query out of R language and sometimes it may be helpful. I feel like instead of trying to come up with my own layout of annotation mapping across multiple sources, it is more reliable to use a more official build. On the other hand, it is very hard to get the annotation mapping correct and there are tons of corner cases that require careful and systematic decisions. So I don’t really recommend to build my own mapping at the first place anyway. The method here should help the situation of annotation query out of R a bit.</p>
<p>Potentially one can try copy the full R infrastructure but using the same underlying database and replicate the same experience to other languages, but it might require substantial work to get the infrastructure done and correct.</p>
<p>EDIT 2017-12-13: Add instructions of using older Ensembl release.<br>
EDIT 2019-01-29: Add the web interface of AnnotationHub.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:r-text-processing">
<p>Based on my impression, my R expert friends would probably recommend me to write it with R-cpp, which I think would be over-kill for such a small task. But my impression can be wrong. Feel free to share your thoughts! <a class="footnote-backref" href="#fnref:r-text-processing" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Use Snakemake on Google cloud2017-08-10T00:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2017-08-10:/posts/2017/08/snakemake-google-cloud/<p><em><strong>TL;DR</strong> Run a RNA-seq pipeline using Snakemake locally and later port it to Google Cloud. Snakemake can parallelize jobs of a pipeline and even across machines.</em></p>
<p><a href="https://snakemake.readthedocs.io/">Snakemake</a> has been my favorite workflow management system for a while. I came across it while writing <a href="https://www.dropbox.com/s/u7aa2mbsto77wwy/thesis_upload.pdf?dl=0">my master thesis</a> and from the …</p><p><em><strong>TL;DR</strong> Run a RNA-seq pipeline using Snakemake locally and later port it to Google Cloud. Snakemake can parallelize jobs of a pipeline and even across machines.</em></p>
<p><a href="https://snakemake.readthedocs.io/">Snakemake</a> has been my favorite workflow management system for a while. I came across it while writing <a href="https://www.dropbox.com/s/u7aa2mbsto77wwy/thesis_upload.pdf?dl=0">my master thesis</a> and from the first look, it already appeared to be extremely flexible and powerful. I got some time to play with it during my lab rotation and now after joining the lab, I am using it in my many research projects. With more and more projects in lab relying on virtualization like <a href="https://www.docker.com/">Docker</a>, package management like <a href="https://bioconda.github.io/">bioconda</a>, and cloud computing like <a href="https://cloud.google.com/">Google Cloud</a>, I would like to continue using Snakemake in those scenarios as well. Hence this post to write down all the details.</p>
<p>The post will introduce the Snakemake by writing the pipeline locally, then gradually move towards to Docker and more Google Cloud products, e.g., Google Cloud Storage, Google Compute Engine (GCE), and Google Container Engine (GKE). <a href="https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html">Snakemake tutorial</a> is a good place to start with to understand how Snakemake works.</p>
<div class="toc">
<ul>
<li><a href="#rna-seq-dataset-and-pipeline-for-demonstration">RNA-seq dataset and pipeline for demonstration</a></li>
<li><a href="#installation-of-snakemake-and-all-related-tools">Installation of snakemake and all related tools</a></li>
<li><a href="#snakemake-local-pipeline-execution">Snakemake local pipeline execution</a><ul>
<li><a href="#genome-reference-index-build-how-to-write-snakemake-rules">Genome reference index build (How to write snakemake rules)</a></li>
<li><a href="#run-snakemake">Run Snakemake</a></li>
<li><a href="#sample-alignment-how-to-write-a-general-rule">Sample alignment (How to write a general rule)</a></li>
<li><a href="#transcript-assement">Transcript assement</a></li>
<li><a href="#job-dependencies-and-dag">Job dependencies and DAG</a></li>
</ul>
</li>
<li><a href="#snakemake-on-google-cloud">Snakemake on Google Cloud</a><ul>
<li><a href="#move-input-files-to-the-cloud-from-google-cloud-storage">Move input files to the cloud (from Google Cloud Storage)</a></li>
<li><a href="#store-output-files-on-the-cloud">Store output files on the cloud</a></li>
</ul>
</li>
<li><a href="#dockerize-the-environment">Dockerize the environment</a><ul>
<li><a href="#use-google-cloud-storage-in-docker-image">Use Google Cloud Storage in Docker image</a></li>
</ul>
</li>
<li><a href="#google-container-engine-gke">Google Container Engine (GKE)</a><ul>
<li><a href="#potential-issues-of-using-gke-with-snakemake">Potential issues of using GKE with Snakemake</a></li>
</ul>
</li>
<li><a href="#summary">Summary</a></li>
</ul>
</div>
<h2 id="rna-seq-dataset-and-pipeline-for-demonstration">RNA-seq dataset and pipeline for demonstration</h2>
<p>In this example, I will use <code>~/snakemake_example</code> to store all the files and output. Make sure you change all the paths to be relative to the actual folder in your machine.</p>
<p>The demo pipeline will be a RNA-seq pipeline for transcript-level expression analysis, often called the <a href="https://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html"><em>new Tuxedo</em></a> pipeline involving <a href="https://ccb.jhu.edu/software/hisat2/">HISAT2</a> and <a href="https://ccb.jhu.edu/software/stringtie/">StringTie</a>. The RNA-seq dataset is from <a href="https://github.com/griffithlab/rnaseq_tutorial/">Griffith Lab’s RNA-seq tutorial</a> which,</p>
<blockquote>
<p>… consists of two commercially available RNA samples: Universal Human Reference (UHR) and Human Brain Reference (HBR). The UHR is total RNA isolated from a diverse set of 10 cancer cell lines. The HBR is total RNA isolated from the brains of 23 Caucasians, male and female, of varying age but mostly 60-80 years old.</p>
<p>(From the wiki page <a href="[griffith-lab-data]">“RNA-seq Data”</a> of the tutorial)</p>
</blockquote>
<p>Our RNA-seq raw data are the 10% downsampled FASTQ files for these samples. For the human genome reference, only the chromosome 22 from GRCh38 is used. The gene annotation is from <a href="http://dec2016.archive.ensembl.org/Homo_sapiens/Info/Index">Ensembl Version 87</a>. Let’s download all the samples and annotations.</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span><span class="nb">cd</span><span class="w"> </span>~/snakemake_example
<span class="gp">$ </span>wget<span class="w"> </span>https://storage.googleapis.com/lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr.tar.gz
<span class="gp">$ </span>tar<span class="w"> </span>xf<span class="w"> </span>griffithlab_brain_vs_uhr.tar.gz
</code></pre></div>
<p>Now you should have the following file structure:</p>
<div class="highlight"><pre><span></span><code>~/snakemake_example
├── griffithlab_brain_vs_uhr/
│ ├── GRCh38_Ens87_chr22_ERCC/
│ │ ├── chr22_ERCC92.fa
│ │ └── genes_chr22_ERCC92.gtf
│ └── HBR_UHR_ERCC_ds_10pc/
│ ├── HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
│ ├── HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz
│ ├── ...
│ ├── UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz
│ └── UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz
└── griffithlab_brain_vs_uhr.tar.gz
</code></pre></div>
<h2 id="installation-of-snakemake-and-all-related-tools">Installation of snakemake and all related tools</h2>
<p>After installing <a href="https://conda.io/miniconda.html">conda</a> and setting up <a href="https://bioconda.github.io/">bioconda</a>, the installation is simple. All the dependencies are kept in a conda environment called <code>new_tuxedo</code>.</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>conda<span class="w"> </span>create<span class="w"> </span>-n<span class="w"> </span>new_tuxedo<span class="w"> </span><span class="se">\</span>
<span class="w"> </span><span class="nv">python</span><span class="o">=</span><span class="m">3</span>.6<span class="w"> </span>snakemake<span class="w"> </span>hisat2<span class="w"> </span>stringtie<span class="w"> </span>samtools
<span class="gp">$ </span><span class="nb">source</span><span class="w"> </span>activate<span class="w"> </span>new_tuxedo<span class="w"> </span><span class="c1"># Use the conda env</span>
<span class="gp gp-VirtualEnv">(new_tuxedo)</span> <span class="gp">$ </span>hisat2<span class="w"> </span>--version<span class="w"> </span><span class="c1"># Tools are available in the env</span>
<span class="go">/Users/liang/miniconda3/envs/new_tuxedo/bin/hisat2-align-s version 2.1.0</span>
<span class="go">...</span>
<span class="gp gp-VirtualEnv">(new_tuxedo)</span> <span class="gp">$ </span>deactivate<span class="w"> </span><span class="c1"># Exit the env</span>
<span class="gp">$ </span>hisat2<span class="w"> </span>--version<span class="w"> </span><span class="c1"># Tools are isolated in the env</span>
<span class="go">bash: hisat2: command not found</span>
</code></pre></div>
<p>All the following steps should be run inside this conda environment unless it’s specified otherwise.</p>
<h2 id="snakemake-local-pipeline-execution">Snakemake local pipeline execution</h2>
<p>The RNA-seq pipeline largely consists of the following steps:</p>
<ol>
<li>Build HISAT2 genome reference index for alignment</li>
<li>Align sample reads to the genome by HISAT2</li>
<li>Assemble per-sample transcripts by StringTie</li>
<li>Merge per-sample transcripts by StringTie</li>
<li>Quantify transcript abundance by StringTie</li>
</ol>
<p>To get the taste of how to write a Snakemake pipeline, I will implement it gradually by breaking it into three major parts: genome reference index build, alignment, and transcript assessment.</p>
<h3 id="genome-reference-index-build-how-to-write-snakemake-rules">Genome reference index build (How to write snakemake rules)</h3>
<p>To build the genome reference, we need to extract the splice sites and exons by two of the HISAT2 scripts, <code>hisat2_extract_splice_sites.py</code> and <code>hisat2_extract_exons.py</code>. Then we call <code>hisat2-build</code> to build the index. Create a new file at <code>~/snakemake_example/Snakefile</code> with the following content:</p>
<div class="highlight"><pre><span></span><code><span class="n">GENOME_FA</span> <span class="o">=</span> <span class="s2">"griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/chr22_ERCC92.fa"</span>
<span class="n">GENOME_GTF</span> <span class="o">=</span> <span class="s2">"griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf"</span>
<span class="n">HISAT2_INDEX_PREFIX</span> <span class="o">=</span> <span class="s2">"hisat2_index/chr22_ERCC92"</span>
<span class="n">rule</span> <span class="n">extract_genome_splice_sites</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span> <span class="n">GENOME_GTF</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"hisat2_index/chr22_ERCC92.ss"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"hisat2_extract_splice_sites.py </span><span class="si">{input}</span><span class="s2"> > </span><span class="si">{output}</span><span class="s2">"</span>
<span class="n">rule</span> <span class="n">extract_genome_exons</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span> <span class="n">GENOME_GTF</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"hisat2_index/chr22_ERCC92.exon"</span>
<span class="n">shell</span><span class="p">:</span> <span class="s2">"hisat2_extract_exons.py </span><span class="si">{input}</span><span class="s2"> > </span><span class="si">{output}</span><span class="s2">"</span>
<span class="n">rule</span> <span class="n">build_hisat_index</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">genome_fa</span><span class="o">=</span><span class="n">GENOME_FA</span><span class="p">,</span>
<span class="n">splice_sites</span><span class="o">=</span><span class="s2">"hisat2_index/chr22_ERCC92.ss"</span><span class="p">,</span>
<span class="n">exons</span><span class="o">=</span><span class="s2">"hisat2_index/chr22_ERCC92.exon"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span> <span class="n">expand</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">HISAT2_INDEX_PREFIX</span><span class="si">}</span><span class="s2">.</span><span class="se">{{</span><span class="s2">ix</span><span class="se">}}</span><span class="s2">.ht2"</span><span class="p">,</span> <span class="n">ix</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">9</span><span class="p">))</span>
<span class="n">log</span><span class="p">:</span> <span class="s2">"hisat2_index/build.log"</span>
<span class="n">threads</span><span class="p">:</span> <span class="mi">8</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s2">"hisat2-build -p </span><span class="si">{threads}</span><span class="s2"> </span><span class="si">{input.genome_fa}</span><span class="s2"> "</span>
<span class="s2">"--ss </span><span class="si">{input.splice_sites}</span><span class="s2"> --exon </span><span class="si">{input.exons}</span><span class="s2"> </span><span class="si">{HISAT2_INDEX_PREFIX}</span><span class="s2"> "</span>
<span class="s2">"2></span><span class="si">{log}</span><span class="s2">"</span>
</code></pre></div>
<p>Overall <code>Snakefile</code> is Python-based, so one can define variables and functions, import Python libraries, and use all the string operations as one does in the Python source code. Here I defined some constants to the genome reference files (<code>GENOME_FA</code> and <code>GENOME_GTF</code>) and the output index prefix (<code>HISAT2_INDEX_PREFIX</code>) because they will get quite repetitive and specifying them at the front can make future modifications easier.</p>
<p>In case one hasn’t read the <a href="https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html">Snakemake Tutorial</a>, here is an overview of the Snakemake pipeline execution. A Snakemake rule is similar to a Makefile rule. In a rule, one can specify the input pattern and the output pattern of a rule, as well as the command to run for this rule. When snakemake runs, all the output user wants to generate will be translated into a sets of rules to be run. Based on the desired output, Snakemake will find the rule that can generate them (matching the rule’s output pattern) and the required input. The finding process can be traversed rules after rules, that is, some input of a rule depends on the output of another rule, until all the inputs are available. Then Snakemake will start to generate the output by running the commands each rule gives.</p>
<p>Now we can look at the three rules in our current <code>Snakefile</code>.</p>
<p>The first rule <code>extract_genome_splice_sites</code> extracts the genome splice sites. The input file is <code>GENOME_GTF</code> which is the Ensembl gene annotation. The output is a file at <code>hisat2_index/chr22_ERCC92.ss</code>. The command to generate the output from the given input is a shell command. The command contains some variables, <code>{input}</code> and <code>{output}</code>, where Snakemake will fill in them with the sepcified intput and output. So when the first rule is activated, Snakemake will let Bash shell to run:</p>
<div class="highlight"><pre><span></span><code>hisat2_extract_splice_sites.py<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>><span class="w"> </span>hisat2_index/chr22_ERCC92.ss
</code></pre></div>
<p>The second rule <code>extract_genome_exons</code> is quite similar to the first one, but extracts the genome exons and stores it in <code>hisat2_index/chr22_ERCC92.exon</code>.</p>
<p>The third rule <code>build_hisat_index</code> builds the actual index. Input can be multiple files, in this case there are three entries, including the chromosome sequence, splice sites and exons. One can later refer only to input of the same entry by their entry name. For example, <code>{input.genome_fa}</code> means the chromosome sequence FASTA file.</p>
<p>The output of the third rule is <code>expand(f"{HISAT2_INDEX_PREFIX}.{{ix}}.ht2", ix=range(1, 9))</code>, where <code>expand(...)</code> is a Snakemake function which can interpolate a string pattern into an array of strings. In this case the generate index files are <code><index_prefix>.1.ht2</code>, … ,<code><index_prefix>.8.ht2</code>. Instead of specifies the output eight times, we use <code>expand</code> and pass a variable <code>ix</code> to iterate from 1 to 8. The double curly brackets are to escape the <code>f"..."</code> f-string interpolation (see <a href="https://docs.python.org/3/whatsnew/3.6.html#whatsnew36-pep498">the Python documentation</a>). So the whole process to interpret the output is:</p>
<div class="highlight"><pre><span></span><code><span class="n">expand</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">HISAT2_INDEX_PREFIX</span><span class="si">}</span><span class="s2">.</span><span class="se">{{</span><span class="s2">ix</span><span class="se">}}</span><span class="s2">.ht2"</span><span class="p">,</span> <span class="n">ix</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">9</span><span class="p">))</span>
<span class="n">expand</span><span class="p">(</span><span class="s2">"hisat2_index/chr22_ERCC92.</span><span class="si">{ix}</span><span class="s2">.ht2"</span><span class="p">,</span> <span class="n">ix</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">9</span><span class="p">))</span>
<span class="s2">"hisat2_index/chr22_ERCC92.1.ht2"</span><span class="p">,</span> <span class="s2">"hisat2_index/chr22_ERCC92.2.ht2"</span><span class="p">,</span> <span class="o">...</span><span class="p">,</span> <span class="s2">"hisat2_index/chr22_ERCC92.8.ht2"</span>
</code></pre></div>
<p>For the rest of the entries such as <code>threads</code>, and <code>log</code>, one can find more information at <a href="https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html">the Snakemake documentation about Rules</a>.</p>
<h3 id="run-snakemake">Run Snakemake</h3>
<p>Let’s build the genome reference index.</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>snakemake<span class="w"> </span>-j<span class="w"> </span><span class="m">8</span><span class="w"> </span>-p<span class="w"> </span>build_hisat_index
<span class="go">Provided cores: 8</span>
<span class="go">Rules claiming more threads will be scaled down.</span>
<span class="go">Job counts:</span>
<span class="go"> count jobs</span>
<span class="go"> 1 build_hisat_index</span>
<span class="go"> 1 extract_genome_exons</span>
<span class="go"> 1 extract_genome_splice_sites</span>
<span class="go"> 3</span>
<span class="go">rule extract_genome_exons:</span>
<span class="go"> input: griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf</span>
<span class="go"> output: hisat2_index/chr22_ERCC92.exon</span>
<span class="go"> jobid: 1</span>
<span class="go">hisat2_extract_exons.py griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf > hisat2_index/chr22_ERCC92.exon</span>
<span class="go">...</span>
<span class="go">3 of 3 steps (100%) done</span>
</code></pre></div>
<p>The command <code>snakemake -j 8 -p build_hisat_index</code> means:</p>
<ul>
<li><code>-j 8</code>: Use 8 cores</li>
<li><code>-p</code>: Print the actual command of each job</li>
<li><code>build_hisat_index</code>: The rule or certain output to be generated</li>
</ul>
<p>If one runs it again, one will find that snakemake won’t do anything since all the output are present and updated.</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>snakemake<span class="w"> </span>-j<span class="w"> </span><span class="m">8</span><span class="w"> </span>-p<span class="w"> </span>build_hisat_index
<span class="go">Nothing to be done.</span>
</code></pre></div>
<h3 id="sample-alignment-how-to-write-a-general-rule">Sample alignment (How to write a general rule)</h3>
<p>Let’s write the rule to do the sample alignment. Append the <code>Snakefile</code> with the following content:</p>
<div class="highlight"><pre><span></span><code><span class="n">SAMPLES</span><span class="p">,</span> <span class="o">*</span><span class="n">_</span> <span class="o">=</span> <span class="n">glob_wildcards</span><span class="p">(</span><span class="s1">'griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/</span><span class="si">{sample}</span><span class="s1">.read1.fastq.gz'</span><span class="p">)</span>
<span class="n">rule</span> <span class="n">align_hisat</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">hisat2_index</span><span class="o">=</span><span class="n">expand</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">HISAT2_INDEX_PREFIX</span><span class="si">}</span><span class="s2">.</span><span class="se">{{</span><span class="s2">ix</span><span class="se">}}</span><span class="s2">.ht2"</span><span class="p">,</span> <span class="n">ix</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">9</span><span class="p">)),</span>
<span class="n">fastq1</span><span class="o">=</span><span class="s2">"griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/</span><span class="si">{sample}</span><span class="s2">.read1.fastq.gz"</span><span class="p">,</span>
<span class="n">fastq2</span><span class="o">=</span><span class="s2">"griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/</span><span class="si">{sample}</span><span class="s2">.read2.fastq.gz"</span><span class="p">,</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"align_hisat2/</span><span class="si">{sample}</span><span class="s2">.bam"</span>
<span class="n">log</span><span class="p">:</span> <span class="s2">"align_hisat2/</span><span class="si">{sample}</span><span class="s2">.log"</span>
<span class="n">threads</span><span class="p">:</span> <span class="mi">4</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s2">"hisat2 -p </span><span class="si">{threads}</span><span class="s2"> --dta -x </span><span class="si">{HISAT2_INDEX_PREFIX}</span><span class="s2"> "</span>
<span class="s2">"-1 </span><span class="si">{input.fastq1}</span><span class="s2"> -2 </span><span class="si">{input.fastq2}</span><span class="s2"> 2></span><span class="si">{log}</span><span class="s2"> | "</span>
<span class="s2">"samtools sort -@ </span><span class="si">{threads}</span><span class="s2"> -o </span><span class="si">{output}</span><span class="s2">"</span>
<span class="n">rule</span> <span class="n">align_all_samples</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span> <span class="n">expand</span><span class="p">(</span><span class="s2">"align_hisat2/</span><span class="si">{sample}</span><span class="s2">.bam"</span><span class="p">,</span> <span class="n">sample</span><span class="o">=</span><span class="n">SAMPLES</span><span class="p">)</span>
</code></pre></div>
<p>There are two rules here but only <code>align_hisat</code> does the real work. The rule looks familar but there are something new. There is a unresolved variable <code>{sample}</code> in input, output and log entries, such as <code>fastq1=".../{sample}.read1.fastq.gz"</code>. So this rule will apply to all outputs that match the pattern <code>align_hisat2/{sample}.bam</code>. For example, given an output <code>align_hisat2/mysample.bam</code>, Snakemake will look for the inputs <code>griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/mysample.read1.fastq.gz</code>, where <code>sample = "mysample"</code> in this case.</p>
<p>To get the names of all the samples, we use <code>glob_wildcards(...)</code> which finds all the files that match the given string pattern, and collects the possible values of the variables in the string pattern as a list. Hence all the sample names are stored in <code>SAMPLES</code>, and the other rule takes input of all samples’ BAM files to generate alignment of all samples.</p>
<p>Now run Snakemake again with a different rule target:</p>
<div class="highlight"><pre><span></span><code>snakemake -j 8 -p align_all_samples
</code></pre></div>
<p>This time pay attention to the CPU usage (say, using <a href="http://hisham.hm/htop/"><code>htop</code></a>), one should find out that snakemake runs jobs in parallel, and tries to use as many cores as possible.</p>
<h3 id="transcript-assement">Transcript assement</h3>
<p>Let’s complete the whole pipeline by adding all StringTie steps to <code>Snakefile</code>:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="n">rule</span> <span class="n">stringtie_assemble</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">genome_gtf</span><span class="o">=</span><span class="n">GENOME_GTF</span><span class="p">,</span>
<span class="n">bam</span><span class="o">=</span><span class="s2">"align_hisat2/</span><span class="si">{sample}</span><span class="s2">.bam"</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"stringtie/assembled/</span><span class="si">{sample}</span><span class="s2">.gtf"</span>
<span class="n">threads</span><span class="p">:</span> <span class="mi">4</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s2">"stringtie -p </span><span class="si">{threads}</span><span class="s2"> -G </span><span class="si">{input.genome_gtf}</span><span class="s2"> "</span>
<span class="s2">"-o </span><span class="si">{output}</span><span class="s2"> -l </span><span class="si">{wildcards.sample}</span><span class="s2"> </span><span class="si">{input.bam}</span><span class="s2">"</span>
<span class="n">rule</span> <span class="n">stringtie_merge_list</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span> <span class="n">expand</span><span class="p">(</span><span class="s2">"stringtie/assembled/</span><span class="si">{sample}</span><span class="s2">.gtf"</span><span class="p">,</span> <span class="n">sample</span><span class="o">=</span><span class="n">SAMPLES</span><span class="p">)</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"stringtie/merged_list.txt"</span>
<span class="n">run</span><span class="p">:</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">output</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="s1">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="k">for</span> <span class="n">gtf</span> <span class="ow">in</span> <span class="nb">input</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="n">gtf</span><span class="p">)</span><span class="o">.</span><span class="n">resolve</span><span class="p">(),</span> <span class="n">file</span><span class="o">=</span><span class="n">f</span><span class="p">)</span>
<span class="n">rule</span> <span class="n">stringtie_merge</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">genome_gtf</span><span class="o">=</span><span class="n">GENOME_GTF</span><span class="p">,</span>
<span class="n">merged_list</span><span class="o">=</span><span class="s2">"stringtie/merged_list.txt"</span><span class="p">,</span>
<span class="n">sample_gtfs</span><span class="o">=</span><span class="n">expand</span><span class="p">(</span><span class="s2">"stringtie/assembled/</span><span class="si">{sample}</span><span class="s2">.gtf"</span><span class="p">,</span> <span class="n">sample</span><span class="o">=</span><span class="n">SAMPLES</span><span class="p">)</span>
<span class="n">output</span><span class="p">:</span> <span class="s2">"stringtie/merged.gtf"</span>
<span class="n">threads</span><span class="p">:</span> <span class="mi">4</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s2">"stringtie --merge -p </span><span class="si">{threads}</span><span class="s2"> -G </span><span class="si">{input.genome_gtf}</span><span class="s2"> "</span>
<span class="s2">"-o </span><span class="si">{output}</span><span class="s2"> </span><span class="si">{input.merged_list}</span><span class="s2">"</span>
<span class="n">rule</span> <span class="n">stringtie_quant</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">merged_gtf</span><span class="o">=</span><span class="s2">"stringtie/merged.gtf"</span><span class="p">,</span>
<span class="n">sample_bam</span><span class="o">=</span><span class="s2">"align_hisat2/</span><span class="si">{sample}</span><span class="s2">.bam"</span>
<span class="n">output</span><span class="p">:</span>
<span class="n">gtf</span><span class="o">=</span><span class="s2">"stringtie/quant/</span><span class="si">{sample}</span><span class="s2">/</span><span class="si">{sample}</span><span class="s2">.gtf"</span><span class="p">,</span>
<span class="n">ctabs</span><span class="o">=</span><span class="n">expand</span><span class="p">(</span>
<span class="s2">"stringtie/quant/{{sample}}/</span><span class="si">{name}</span><span class="s2">.ctab"</span><span class="p">,</span>
<span class="n">name</span><span class="o">=</span><span class="p">[</span><span class="s1">'i2t'</span><span class="p">,</span> <span class="s1">'e2t'</span><span class="p">,</span> <span class="s1">'i_data'</span><span class="p">,</span> <span class="s1">'e_data'</span><span class="p">,</span> <span class="s1">'t_data'</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">threads</span><span class="p">:</span> <span class="mi">4</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s2">"stringtie -e -B -p </span><span class="si">{threads}</span><span class="s2"> -G </span><span class="si">{input.merged_gtf}</span><span class="s2"> "</span>
<span class="s2">"-o </span><span class="si">{output.gtf}</span><span class="s2"> </span><span class="si">{input.sample_bam}</span><span class="s2">"</span>
<span class="n">rule</span> <span class="n">quant_all_samples</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span> <span class="n">expand</span><span class="p">(</span><span class="s2">"stringtie/quant/</span><span class="si">{sample}</span><span class="s2">/</span><span class="si">{sample}</span><span class="s2">.gtf"</span><span class="p">,</span> <span class="n">sample</span><span class="o">=</span><span class="n">SAMPLES</span><span class="p">)</span>
</code></pre></div>
<p>Most rules are similar to the previous ones except for <code>stringtie_merge_list</code>. This step a file is generated to contain list of paths to all the samples’ GTF file. Instead of running some command (no <code>shell</code> entry), a <code>run</code> entry is used to write a Python code snippet to generate the file.</p>
<p>Another thing to be noted is the output entry <code>ctabs=...</code> of <code>stringtie_quant</code>. The following lines are equivalent:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Before expansion</span>
<span class="n">ctabs</span><span class="o">=</span><span class="n">expand</span><span class="p">(</span>
<span class="s2">"stringtie/quant/{{sample}}/</span><span class="si">{name}</span><span class="s2">.ctab"</span><span class="p">,</span>
<span class="n">name</span><span class="o">=</span><span class="p">[</span><span class="s1">'i2t'</span><span class="p">,</span> <span class="s1">'e2t'</span><span class="p">,</span> <span class="s1">'i_data'</span><span class="p">,</span> <span class="s1">'e_data'</span><span class="p">,</span> <span class="s1">'t_data'</span><span class="p">]</span>
<span class="p">)</span>
<span class="c1"># After expansion</span>
<span class="n">ctabs</span><span class="o">=</span><span class="s2">"stringtie/quant/</span><span class="si">{sample}</span><span class="s2">/i2t.ctab"</span><span class="p">,</span>
<span class="s2">"stringtie/quant/</span><span class="si">{sample}</span><span class="s2">/e2t.ctab"</span><span class="p">,</span>
<span class="o">...</span><span class="p">,</span>
<span class="s2">"stringtie/quant/</span><span class="si">{sample}</span><span class="s2">/t_data.ctab"</span>
</code></pre></div>
<p>The full <code>Snakefile</code> can be found <a href="https://gist.github.com/ccwang002/2659b19439b6205284c0ae68ca06345d">here</a>.</p>
<h3 id="job-dependencies-and-dag">Job dependencies and DAG</h3>
<p>Now with the pipeline complete, we can further look at the how all the rules are chained with each other. Snakemake has a command to generate the job depedency graph (a DAG):</p>
<div class="highlight"><pre><span></span><code>snakemake --dag quant_all_samples | dot -Tsvg > dag.svg
</code></pre></div>
<figure class="full-img">
<img src="https://blog.liang2.tw/posts/2017/08/snakemake-google-cloud/pics/snakemake_rnaseq_dag.svg"/>
<figcaption>Snakemake job dependency graph.</figcaption>
</figure>
<p>Snakemake generates such DAG first before execution, where each node represents a job. As long as two nodes have no connected edges and their input exist, they can be executed parallely. This is a powerful feature to pipeline management, which can use the resources in a fin grain.</p>
<p>A simpler graph that shows rules instead of jobs can be generated by:</p>
<div class="highlight"><pre><span></span><code>snakemake --rulegraph quant_all_samples | dot -Tsvg > ruledag.svg
</code></pre></div>
<figure>
<img src="https://blog.liang2.tw/posts/2017/08/snakemake-google-cloud/pics/snakemake_rnaseq_ruledag.svg"/>
<figcaption>Snakemake rule dependency graph.</figcaption>
</figure>
<h2 id="snakemake-on-google-cloud">Snakemake on Google Cloud</h2>
<p>Now we start to move our Snakemake pipeline to the Google Cloud. To complete all the following steps, one needs a Google account and has a bucket on the Google Cloud with write access. That is, be able to upload the output back to Google Cloud Storage. Snakemake is able to download/upload files from the cloud, one needs to <a href="https://cloud.google.com/sdk/downloads">set up the Google Cloud SDK on the local machine</a> and create the default application credentials:</p>
<div class="highlight"><pre><span></span><code>gcloud auth application-default login
</code></pre></div>
<p>Also, install the neccessary Python packages to give Snakemake the access to storage API:</p>
<div class="highlight"><pre><span></span><code>conda install google-cloud-storage
</code></pre></div>
<p>Actually snakemake support remote files from many more providers. More detail can be found at <a href="https://snakemake.readthedocs.io/en/stable/snakefiles/remote_files.html">the Snakemake documentation</a>.</p>
<p>Note that although one can run this section on a local machine, this step will be significantly faster if one runs it on a Google Computer Engine (GCE) instance. It also saves extra bandwidth and fees.</p>
<h3 id="move-input-files-to-the-cloud-from-google-cloud-storage">Move input files to the cloud (from Google Cloud Storage)</h3>
<p>Let’s modify the <code>Snakefile</code> to use the reference and FASTQ files from Google Cloud Storage. Replace those file paths with the following:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span> <span class="nn">snakemake.remote.GS</span> <span class="kn">import</span> <span class="n">RemoteProvider</span> <span class="k">as</span> <span class="n">GSRemoteProvider</span>
<span class="n">GS</span> <span class="o">=</span> <span class="n">GSRemoteProvider</span><span class="p">()</span>
<span class="n">GS_PREFIX</span> <span class="o">=</span> <span class="s2">"lbwang-playground/snakemake_rnaseq"</span>
<span class="n">GENOME_FA</span> <span class="o">=</span> <span class="n">GS</span><span class="o">.</span><span class="n">remote</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">GS_PREFIX</span><span class="si">}</span><span class="s2">/griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/chr22_ERCC92.fa"</span><span class="p">)</span>
<span class="n">GENOME_GTF</span> <span class="o">=</span> <span class="n">GS</span><span class="o">.</span><span class="n">remote</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">GS_PREFIX</span><span class="si">}</span><span class="s2">/griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/genes_chr22_ERCC92.gtf"</span><span class="p">)</span>
<span class="n">HISAT2_INDEX_PREFIX</span> <span class="o">=</span> <span class="s2">"hisat2_index/chr22_ERCC92"</span>
<span class="n">SAMPLES</span><span class="p">,</span> <span class="o">*</span><span class="n">_</span> <span class="o">=</span> <span class="n">GS</span><span class="o">.</span><span class="n">glob_wildcards</span><span class="p">(</span><span class="n">GS_PREFIX</span> <span class="o">+</span> <span class="s1">'/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/</span><span class="si">{sample}</span><span class="s1">.read1.fastq.gz'</span><span class="p">)</span>
<span class="c1"># rule extract_genome_splice_sites:</span>
<span class="c1"># ...</span>
<span class="n">rule</span> <span class="n">align_hisat</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">hisat2_index</span><span class="o">=</span><span class="n">expand</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">HISAT2_INDEX_PREFIX</span><span class="si">}</span><span class="s2">.</span><span class="se">{{</span><span class="s2">ix</span><span class="se">}}</span><span class="s2">.ht2"</span><span class="p">,</span> <span class="n">ix</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">9</span><span class="p">)),</span>
<span class="n">fastq1</span><span class="o">=</span><span class="n">GS</span><span class="o">.</span><span class="n">remote</span><span class="p">(</span><span class="n">GS_PREFIX</span> <span class="o">+</span> <span class="s2">"/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/</span><span class="si">{sample}</span><span class="s2">.read1.fastq.gz"</span><span class="p">),</span>
<span class="n">fastq2</span><span class="o">=</span><span class="n">GS</span><span class="o">.</span><span class="n">remote</span><span class="p">(</span><span class="n">GS_PREFIX</span> <span class="o">+</span> <span class="s2">"/griffithlab_brain_vs_uhr/HBR_UHR_ERCC_ds_10pc/</span><span class="si">{sample}</span><span class="s2">.read2.fastq.gz"</span><span class="p">),</span>
<span class="c1"># ...</span>
</code></pre></div>
<p>Now all the file paths are on Google Cloud Storage under the bucket <code>lbwang-playground</code>. For example, <code>GENOME_FA</code> points to <code>gs://lbwang-playground/snakemake_rnaseq/griffithlab_brain_vs_uhr/GRCh38_Ens87_chr22_ERCC/chr22_ERCC92.fa</code>.</p>
<p>One could launch Snakemake again:</p>
<div class="highlight"><pre><span></span><code>snakemake --timestamp -p --verbose --keep-remote -j 8 quant_all_samples
</code></pre></div>
<h3 id="store-output-files-on-the-cloud">Store output files on the cloud</h3>
<p>Although we could replace all the file paths to <code>GS.remote(...)</code>, there is a simpler way to replace every path through the command line option. On top of that, we need to add a <code>FULL_HISAT2_INDEX_PREFIX</code> variable to reflect the path change that prepends the path under the writable bucket. Replace all <code>{WRITABLE_BUCKET_PATH}</code> with a writable Google Cloud Storage bucket.</p>
<div class="highlight"><pre><span></span><code><span class="n">HISAT2_INDEX_PREFIX</span> <span class="o">=</span> <span class="s2">"hisat2_index/chr22_ERCC92"</span>
<span class="n">FULL_HISAT2_INDEX_PREFIX</span> <span class="o">=</span> <span class="s2">"</span><span class="si">{WRITABLE_BUCKET_PATH}</span><span class="s2">/hisat2_index/chr22_ERCC92"</span>
<span class="n">rule</span> <span class="n">build_hisat_index</span><span class="p">:</span>
<span class="c1"># ...</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s2">"hisat2-build -p </span><span class="si">{threads}</span><span class="s2"> </span><span class="si">{input.genome_fa}</span><span class="s2"> "</span>
<span class="s2">"--ss </span><span class="si">{input.splice_sites}</span><span class="s2"> --exon </span><span class="si">{input.exons}</span><span class="s2"> </span><span class="si">{FULL_HISAT2_INDEX_PREFIX}</span><span class="s2"> "</span>
<span class="s2">"2></span><span class="si">{log}</span><span class="s2">"</span>
<span class="n">rule</span> <span class="n">align_hisat</span><span class="p">:</span>
<span class="c1"># ...</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s2">"hisat2 -p </span><span class="si">{threads}</span><span class="s2"> --dta -x </span><span class="si">{FULL_HISAT2_INDEX_PREFIX}</span><span class="s2"> "</span>
<span class="s2">"-1 </span><span class="si">{input.fastq1}</span><span class="s2"> -2 </span><span class="si">{input.fastq2}</span><span class="s2"> 2></span><span class="si">{log}</span><span class="s2"> | "</span>
<span class="s2">"samtools sort -@ </span><span class="si">{threads}</span><span class="s2"> -o </span><span class="si">{output}</span><span class="s2">"</span>
</code></pre></div>
<p>The full <code>Snakefile</code> can be found <a href="https://gist.github.com/ccwang002/2686840e90574a67a673ec4b48e9f036">here</a>. Now run the Snakemake with the following options:</p>
<div class="highlight"><pre><span></span><code>snakemake<span class="w"> </span>--timestamp<span class="w"> </span>-p<span class="w"> </span>--verbose<span class="w"> </span>--keep-remote<span class="w"> </span>-j<span class="w"> </span><span class="m">8</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--default-remote-provider<span class="w"> </span>GS<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--default-remote-prefix<span class="w"> </span><span class="o">{</span>WRITABLE_BUCKET_PATH<span class="o">}</span><span class="w"> </span>><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>quant_all_samples
</code></pre></div>
<p>To understand how the whole remote files work, here is the the folder structure after the exection:</p>
<div class="highlight"><pre><span></span><code>~/snakemake_example
├── lbwang-playground/
│ └── snakemake_rnaseq/
│ └── griffithlab_brain_vs_uhr/
│ ├── GRCh38_Ens87_chr22_ERCC/
│ └── HBR_UHR_ERCC_ds_10pc/
├── {WRITABLE_BUCKET_PATH}/
│ ├── align_hisat2/
│ ├── hisat2_index/
│ └── stringtie/
└── Snakefile
</code></pre></div>
<p>So Snakemake simply downloads/generates the files with the full path on remote storage.</p>
<h2 id="dockerize-the-environment">Dockerize the environment</h2>
<p>Although bioconda has made the package installation very easy, it would be easier to just isolate the whole environment at the operating system level. One common approach is to use <a href="https://www.docker.com/">Docker</a>.</p>
<p>A minimal working Dockerfile would be:</p>
<div class="highlight"><pre><span></span><code><span class="k">FROM</span><span class="w"> </span><span class="s">continuumio/miniconda3</span>
<span class="k">RUN</span><span class="w"> </span>conda<span class="w"> </span>install<span class="w"> </span>-y<span class="w"> </span><span class="nv">python</span><span class="o">=</span><span class="m">3</span>.6<span class="w"> </span>nomkl<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>stringtie<span class="w"> </span>samtools<span class="w"> </span>hisat2<span class="w"> </span>snakemake<span class="w"> </span>google-cloud-storage<span class="w"> </span><span class="se">\</span>
<span class="w"> </span><span class="o">&&</span><span class="w"> </span>conda<span class="w"> </span>clean<span class="w"> </span>-y<span class="w"> </span>--all
</code></pre></div>
<p>However there are some details required extra care at the time of writing, so I’ve created a Docker image for this pipeline on Docker Hub, <a href="https://hub.docker.com/r/lbwang/snakemake-conda-rnaseq/"><code>lbwang/snakemake-conda-rnaseq</code></a>. One could be able to run the snakemake by:</p>
<div class="highlight"><pre><span></span><code><span class="nb">cd</span><span class="w"> </span>~/snakemake_example
docker<span class="w"> </span>run<span class="w"> </span>-t<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-v<span class="w"> </span><span class="k">$(</span><span class="nb">pwd</span><span class="k">)</span>:/analysis<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>lbwang/snakemake-conda-rnaseq<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>snakemake<span class="w"> </span>-j<span class="w"> </span><span class="m">2</span><span class="w"> </span>--timestamp<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-s<span class="w"> </span>/analysis/Snakefile<span class="w"> </span>--directory<span class="w"> </span>/analysis<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>quant_all_samples
</code></pre></div>
<h3 id="use-google-cloud-storage-in-docker-image">Use Google Cloud Storage in Docker image</h3>
<p>To use Google’s Cloud products in a Docker image, one needs to install <a href="https://cloud.google.com/sdk/downloads">Google Cloud SDK</a> inside the Docker image. Refer to <a href="https://github.com/GoogleCloudPlatform/cloud-sdk-docker/blob/master/debian_slim/Dockerfile">Google’s Dockerfile with Cloud SDK</a> for detail. <a href="https://hub.docker.com/r/lbwang/snakemake-conda-rnaseq/"><code>lbwang/snakemake-conda-rnaseq</code></a> has installed the Cloud SDK.</p>
<div class="highlight"><pre><span></span><code>sudo<span class="w"> </span>docker<span class="w"> </span>run<span class="w"> </span>-t<span class="w"> </span>-i<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-v<span class="w"> </span><span class="k">$(</span><span class="nb">pwd</span><span class="k">)</span>:/analysis<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-v<span class="w"> </span>~/.config/gcloud:/root/.config/gcloud<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>lbwang/snakemake-conda-rnaseq<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>snakemake<span class="w"> </span>-j<span class="w"> </span><span class="m">4</span><span class="w"> </span>--timestamp<span class="w"> </span>--verbose<span class="w"> </span>-p<span class="w"> </span>--keep-remote<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-s<span class="w"> </span>/analysis/Snakefile<span class="w"> </span>--directory<span class="w"> </span>/analysis<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--default-remote-provider<span class="w"> </span>GS<span class="w"> </span>--default-remote-prefix<span class="w"> </span><span class="s2">"{WRITABLE_BUCKET_PATH}"</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>quant_all_samples
</code></pre></div>
<p>To run Docker on a GCE VM instance, it requires the host machine (the VM instance) to have Docker installed. One may refer to Docker’s <a href="https://docs.docker.com/engine/installation/linux/docker-ce/debian/#install-using-the-repository">official installation guide</a> to install it. VM instance by default inherit the user’s permission (via the automatically created service account), thus the command above should apply to the GCE instance as well.</p>
<h2 id="google-container-engine-gke">Google Container Engine (GKE)</h2>
<p>To scale up the pipeline execution across multiple machines, Snakemake could use <a href="https://cloud.google.com/container-engine/">Google Container Engine</a> (GKE, implemented on top of Kubernetes). This method is built on Docker which each node will pull down the given Docker image to load the environment. After <a href="https://bitbucket.org/snakemake/snakemake/issues/602">some discussions</a> about how to specify user input image <sup id="fnref:kubernetes-docker"><a class="footnote-ref" href="#fn:kubernetes-docker">1</a></sup>, on Snakemake 4.1+ one is able to specify the Docker image Kubernete’s node uses by <code>--container-image <image></code>.</p>
<p>To install the master branch of Snakemake, run:</p>
<div class="highlight"><pre><span></span><code>pip<span class="w"> </span>install<span class="w"> </span>git+https://bitbucket.org/snakemake/snakemake.git@master
</code></pre></div>
<p>Following Snakemake’s <a href="https://snakemake.readthedocs.io/en/stable/executable.html#executing-a-snakemake-workflow-via-kubernetes">GKE guide</a>, extra packages need to be installed to talk to GKE (Kubernetes) cluster:</p>
<div class="highlight"><pre><span></span><code>pip<span class="w"> </span>install<span class="w"> </span>kubernetes
gcloud<span class="w"> </span>components<span class="w"> </span>install<span class="w"> </span>kubectl
<span class="c1"># or Debian on GCE:</span>
<span class="c1"># sudo apt-get install kubectl</span>
</code></pre></div>
<p>First we create the GKE cluster by:</p>
<div class="highlight"><pre><span></span><code><span class="nb">export</span><span class="w"> </span><span class="nv">CLUSTER_NAME</span><span class="o">=</span><span class="s2">"snakemake-cluster"</span>
<span class="nb">export</span><span class="w"> </span><span class="nv">ZONE</span><span class="o">=</span><span class="s2">"us-central1-a"</span>
gcloud<span class="w"> </span>container<span class="w"> </span>clusters<span class="w"> </span>create<span class="w"> </span><span class="nv">$CLUSTER_NAME</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--zone<span class="o">=</span><span class="nv">$ZONE</span><span class="w"> </span>--num-nodes<span class="o">=</span><span class="m">3</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--machine-type<span class="o">=</span><span class="s2">"n1-standard-4"</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--scopes<span class="w"> </span>storage-rw
gcloud<span class="w"> </span>container<span class="w"> </span>clusters<span class="w"> </span>get-credentials<span class="w"> </span>--zone<span class="o">=</span><span class="nv">$ZONE</span><span class="w"> </span><span class="nv">$CLUSTER_NAME</span>
</code></pre></div>
<p>This will launch 3 GCE VM instances using <code>n1-standard-4</code> machine type (4 CPUs). Therefore in the cluster there are total 12 CPUs available for computation. Modify the variables to fit one’s setting.</p>
<p>Note that some rule may specify a number of CPUs that no node in the clusters has, say the rule <code>build_hisat_index</code> specifies 8 threads. In this case, the cluster cannot find a node with enough free CPUs to forward the job to a <a href="https://kubernetes.io/docs/concepts/workloads/pods/pod/">pod</a> and the cluster will halt. Therefore, make sure to lower the <code>threads</code> to a reasonable number (or use <a href="https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html">configfile</a> to apply to mulitple samples). We will continue to use the same Docker image <a href="https://hub.docker.com/r/lbwang/snakemake-conda-rnaseq/"><code>lbwang/snakemake-conda-rnaseq</code></a> as the Kubernetes’ container image.</p>
<p>By default, Snakemake will always check if the output files are outdated, that is, older than the rule that generated them. To ensure it re-runs the pipeline, one might need to remove the generated output before calling Snakemake again:</p>
<div class="highlight"><pre><span></span><code>gsutil<span class="w"> </span>-m<span class="w"> </span>rm<span class="w"> </span>-r<span class="w"> </span>gs://<span class="o">{</span>WRITABLE_BUCKET_PATH<span class="o">}</span>/<span class="o">{</span>align_hisat2,hisat2_index,stringtie<span class="o">}</span>
</code></pre></div>
<p>Then we are able to run the pipeline again.</p>
<div class="highlight"><pre><span></span><code>snakemake<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--timestamp<span class="w"> </span>-p<span class="w"> </span>--verbose<span class="w"> </span>--keep-remote<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-j<span class="w"> </span><span class="m">12</span><span class="w"> </span>--kubernetes<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--container-image<span class="w"> </span>lbwang/snakemake-conda-rnaseq<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--default-remote-provider<span class="w"> </span>GS<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--default-remote-prefix<span class="w"> </span><span class="o">{</span>WRITABLE_BUCKET_PATH<span class="o">}</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>quant_all_samples
</code></pre></div>
<p>Note that since we change the container image, we have to make sure the version of Snakemake in the Docker image and the machine starting the pipeline matches. An easy way to ensure that the versions are matched is to start the workflow inside the same Docker image.</p>
<p>To connect the Kubernete cluster inside Docker, we need to pass kubectl’s config file as well, which is at <code>~/.kube/config</code>. So the full command becomes:</p>
<div class="highlight"><pre><span></span><code>sudo<span class="w"> </span>docker<span class="w"> </span>run<span class="w"> </span>-t<span class="w"> </span>-i<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-v<span class="w"> </span><span class="k">$(</span><span class="nb">pwd</span><span class="k">)</span>:/analysis<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-v<span class="w"> </span>~/.config/gcloud:/root/.config/gcloud<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-v<span class="w"> </span>~/.kube/config:/root/.kube/config<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>lbwang/snakemake-conda-rnaseq<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>snakemake<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-s<span class="w"> </span>/analysis/Snakefile<span class="w"> </span>--directory<span class="w"> </span>/analysis<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--timestamp<span class="w"> </span>-p<span class="w"> </span>--verbose<span class="w"> </span>--keep-remote<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-j<span class="w"> </span><span class="m">12</span><span class="w"> </span>--kubernetes<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--container-image<span class="w"> </span>lbwang/snakemake-conda-rnaseq<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--default-remote-provider<span class="w"> </span>GS<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--default-remote-prefix<span class="w"> </span><span class="o">{</span>WRITABLE_BUCKET_PATH<span class="o">}</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>quant_all_samples
</code></pre></div>
<p>After running our pipeline, make sure to delete the GKE cluster by:</p>
<div class="highlight"><pre><span></span><code>gcloud<span class="w"> </span>container<span class="w"> </span>clusters<span class="w"> </span>delete<span class="w"> </span>--zone<span class="o">=</span><span class="nv">$ZONE</span><span class="w"> </span><span class="nv">$CLUSTER_NAME</span>
</code></pre></div>
<h3 id="potential-issues-of-using-gke-with-snakemake">Potential issues of using GKE with Snakemake</h3>
<p>I still encountered the following issues while running the whole pipeline on the Kubernetes. It is likely that they are not Snakemake’s fault but I couldn’t find enough time to dig into the details at the time of writing:</p>
<ul>
<li>HISAT2 cannot build its index on Kubenetes. So the step <code>build_hisat_index</code> failed for unknown reason. The error message from HISAT2 looks like this:</li>
</ul>
<div class="highlight"><pre><span></span><code>...
Wrote 8912688 bytes to secondary GFM file: {WRITABLE_BUCKET_PATH}/snakemake_demo/hisat2_index/chr22_ERCC92.6.ht2
Index is corrupt: File size for {WRITABLE_BUCKET_PATH}/snakemake_demo/hisat2_index/chr22_ERCC92.6.ht2 should have been 8912688 but is actually 0.
Please check if there is a problem with the disk or if disk is full.
Total time for call to driver() for forward index: 00:01:18
Error: Encountered internal HISAT2 exception (#1)
</code></pre></div>
<h2 id="summary">Summary</h2>
<p>Snakemake is a flexible pipeline management tool that can be run locally and on the cloud. Although it is able to run on Kubernetes such as Google Container Engine, it is a relatively new feature and will take some time to stablize. Currently if one wants to run everything (both the computing and the data) on the cloud, using Google Compute Engine and Google Cloud Storage will be the way to go.</p>
<p>Using a 4-core (n1-standard-4) GCE instance, the total time to finish the pipeline locally and via Google Cloud Storage were 3.2 mins and 5.8 mins resepctively. So there are some overhead to transfer files from/to the storage.</p>
<p>Docker and bioconda have made the deployment a lot easier. Bioconda truly saves a lot of duplicated efforts to figure out the tool compilation. Docker provides an OS-level isolation and an ecosystem of deployment. With more tools such as <a href="http://singularity.lbl.gov/">Singularity</a> continuing to come out, virtualization seems to be a inevitable trend.</p>
<p>Other than Google cloud products, Snakemake also supports AWS, S3, LSF, SLURM and many other cluster settings. It seems to me that the day when one <code>Snakefile</code> works for all platforms might be around the corner.</p>
<p>EDIT 2017-08-15: Add a section about using Google Cloud in Docker. Update summary with some time measurements. Add links to the full Snakefiles.<br>
EDIT 2017-09-07: Snakemake has added the support of custom Kubernetes container image. Thus update the GKE section to use the official parameter to pass image.<br>
EDIT 2017-11-17: Add instructions to run the Snakemake on Kubernete inside Docker. And also list out the issues of using GKE.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:kubernetes-docker">
<p>In the discussion, Snakemake’s author, Johannes, mentioned the possiblity of using <a href="http://singularity.lbl.gov/">Singularity</a> so each rule can run in a different virutal environment. Singularity support comes at Snakemake 4.2+. <a class="footnote-backref" href="#fnref:kubernetes-docker" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Variants、eQTL、MPRA2017-06-20T00:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2017-06-20:/posts/2017/06/variants-eqtl-mpra/<p>本文內容主要來自 Barak Cohen 教授給的數堂課的筆記,以 Systems Biology 的角度來看 coding/noncoding variant modeling 和相關實驗 MPRA。</p><p>Computational Biology 和 Bioinformatics 在現在可能區分不大,本文也不打算深究兩定義,但他們大致能代表兩大類將電腦科學、程式運用在生物上的研究。</p>
<p>在碩班,我的實驗室一直鼓勵我們去想新的演算法,把某種預測做得更好或者快,或者運用更多來源的數據;做新的工具;整合出新的資料庫。這些應用都有他們的研究價值,也需要大量的技術投入,即便在發表上並不會放入這些細節。這類研究比較偏向 Bioinformatics。</p>
<p>來 WashU 前,我期許自己繼續往 Bioinformatics 深入。然而,在過去的數月裡,即便我仍投入在這些數據分析與工具開發上,另一大部份的時間,我經歷了許多關於模型,或者,關於「如何回答重要的生物問題」上的討論,有了較碩班訓練不同的啟發。這另一類研究比較偏向 Computational Biology。</p>
<p>本文想用另一個角度來看所謂的「modeling」。內容主要來自 <a href="http://genetics.wustl.edu/bclab/">Barak Cohen</a> 教授給的數堂課的筆記,主題為 <em>Coding and Noncoding Variant</em>。我生物背景不足,如果筆記有任何錯誤,煩請告知。</p>
<p><strong>Conflict of Interest</strong>: Cohen Lab 開發了 <a href="http://www.pnas.org/content/109/47/19498.short">CRE-seq</a> (<em>cis</em>-regulatory element by sequencing),其中一種 MPRA (Massively Parallel Reporter Assay) 技術。</p>
<div class="toc">
<ul>
<li><a href="#coding-vs-noncoding-variants">Coding vs noncoding variants</a><ul>
<li><a href="#noncoding-elements">Noncoding elements</a></li>
<li><a href="#noncoding-variants">Noncoding variants</a></li>
<li><a href="#endophenotypes">Endophenotypes</a></li>
</ul>
</li>
<li><a href="#eqtl">eQTL</a></li>
<li><a href="#mpra">MPRA</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
</div>
<h2 id="coding-vs-noncoding-variants">Coding vs noncoding variants</h2>
<p>首先來談談 coding 和 noncoding variant。課堂上老師讓我們自由辯論研究兩者的「優缺點」,亦或,如果你是 PI 比較想研究哪個,面臨的優勢與困境。</p>
<p>Coding variant 很好理解,就是在某個 gene coding region 產生的序列改變,一般會先看所謂的 nonsynonymous,即這個 variant 造成 amino acid 改變,影響到蛋白質的結構,進而影響到其功能。synonymous variant 雖然不會改變 amino acid,但在模式物種中,可能會討論不同 amino acid 對於不同 tRNA 的偏好,也許會影響到 gene expression。另一方面,它也可能會影響 transcription factor (TF) binding,某些 TF 在 biding 有偏好的 DNA sequence(motif),即使蛋白質序列不變,TF binding 變化也會影響到其他基因的調控。</p>
<p>不過一般而言,coding variant 主要都是考慮 nonsynonymous change,這造成的變化十分具大,無法解釋像 complex traits、gene expression 高低這種細微的變化。</p>
<h3 id="noncoding-elements">Noncoding elements</h3>
<p>Noncoding vairant 相對而言複雜的多。在討論它之前,不如來說說看我們知道哪些 noncoding elements:</p>
<ul>
<li>Introns</li>
<li>Promoters</li>
<li>Regulatory elements (REs)<ul>
<li><em>cis</em>-regulatory elements (CREs): promoters, enhancers</li>
<li><em>trans</em>-regulatory elements</li>
</ul>
</li>
<li>miRNAs</li>
<li>Retrovirus, satellites, centromeres, telemeres</li>
<li>Structual elements <ul>
<li>Matrix Attachment Regions (MARs)</li>
<li>Lamina Associated Domains (LADs)</li>
<li>CTCF/Cohesin</li>
<li>Topologically associating domains (TADs)</li>
<li>3D genome<sup id="fnref:3D genome"><a class="footnote-ref" href="#fn:3D genome">1</a></sup></li>
</ul>
</li>
<li>Methylation</li>
</ul>
<p>咦,忘記提到 histone modification 嗎?關於這些 epigenetics markers,Barak 對於他們有深刻的懷疑,他認為這些只是 markers 而非最終 regulatory element:</p>
<blockquote>
<p>“Something you can measure does not mean it is interesting.”</p>
</blockquote>
<p>然後建議我們去讀一篇批評 ENCODE 的論文<sup id="fnref:ENCODE paper"><a class="footnote-ref" href="#fn:ENCODE paper">2</a></sup>,被他評之為近十年最辛辣,標題也非常有趣。</p>
<h3 id="noncoding-variants">Noncoding variants</h3>
<p>從 non-coding elements 我們能知道控制 gene expression 可以從很多面向切入,於是討論 non-coding variant 時就會有很多不同的機制影響 gene expression。底下針對所謂的 enhancers (RE) 和 promoters 來畫個簡單的示意圖:</p>
<div class="highlight"><pre><span></span><code> TF1 TF2 TF3 TF4 [RNA PolII]
----[ enhancer ]-------[promoter]--[gene body]-------------------------
<-- Topologically Associating Domain, TAD -----> <-- Another TAD -->
</code></pre></div>
<p>RNA Polymerase II (RNA PolII) 負責 gene transcription,而 promoter 是一段在 gene body 前不特定的序列可以吸引 RNA PolII 來提高 gene transcription rate,很有可能就會提高 gene expression。TF 可能會辨別 promoter 上特定的序列,它對 PolII 有更強的吸引力。除了 promoter 之外,enhancer 相較於 gene body 的距離就更不確定,可能是 10kb 或 100kb 之外,但它在立體的距離可能非常近,本身也可以 recruit TF 然後增加 Pol affinity。這一切可以用抑制、競爭的角度來想產生負向的調控。</p>
<p>Enhancer 的影響力沒有方向性,即上下游的 gene 都會受同個 enhancer 調控。於是有所謂 TAD 的概念,它會讓 chromosome 形成一個 loop 侷限這樣立體空間上下游的互動,使得只在同個 TAD 的 REs 和 gene 能互相作用。
這樣的觀念可以進一步推廣到 3D genome 上,考慮不同 chromosome 間的互動。TAD 的邊界由某些 motif(例如 CTCF)決定,但究竟 TAD 是如果建立與調控,機制尚未明朗。</p>
<p>在 non-coding 複雜的交互作用的另一面,代表了每個交互作用很可能僅改變了基因表達的程度,而不是大幅度的開關。但這也代表他們對生物體不一定有很強的影響,所以有變化並一定代表它有功能。不過,不同的 cell type 倒可以用透過 TF 有無來調控一系列的 gene,而不是一味增加 gene 數量。因此,在很多情況下,了解 non-coding variants 造成的影響是很有趣的。</p>
<h3 id="endophenotypes">Endophenotypes</h3>
<p>我們要如何看 non-coding variants 呢?首先要了解從 genotype 到 phenotype 其實中間包含了很多層級:</p>
<div class="highlight"><pre><span></span><code>DNA (genotype) → RNA → Proteins → Metabolites → Phenotype
</code></pre></div>
<p>中間的每個步驟都可能影響,或不會傳遞影響至下個階段。但我們在 DNA 和 RNA level 有非常好的工具─定序─可以同時看 genome wide 非常多基因或區域。於是在大多數的情況我們都只有看到 endophenotypes,要務必僅記在心這和真正的 phenotype 是有所差異的。</p>
<h2 id="eqtl">eQTL</h2>
<p>eQTL 即是一種 endophenotype。QTL (quantitative trait loci) 意即某個 chromosome region 可以關連至一些量化數值的變動(即 locations that map to some quantitative measures),而 eQTL 即為 expresion QTL,關心某段區域的 variants 影響 gene expression。</p>
<p>過往常見的 eQTL study 有:</p>
<ul>
<li>Linkage study</li>
<li>Family tree</li>
<li>GWAS on two groups</li>
</ul>
<p>這都是使用數個不同人不同 sample 來看 eQTL。但這些對於 non-coding variant 來說變因太多,為何不從個人、單一 sample 著手,即 allele imballance?然而單一 sample 就會牽扯到 eQTL 本身的問題,即它很難進一步從某個區域縮小到是哪個 variant 或哪幾個 variants 為決定性因子 (causal vairants)。</p>
<h2 id="mpra">MPRA</h2>
<p>於是我們可以想辦法設計實驗來進一步解釋 eQTL。實驗可以從兩個方向來設計:necessary 和 sufficient。Necessity 可以透過 CRISPR 設計一系列的 tiled gRNAs 把某個 eQTL 逐步刪掉。平行化這個實驗,可以透過 growth selection 和 single cell sequencing 讀出是哪些 gRNAs 最有影響力。</p>
<p>在 sufficiency 方面,我們可以設計 reporter assay 來回答這問題:</p>
<div class="highlight"><pre><span></span><code>-----------------------------[weak promoter]--[GFP]--
---[cis RE, CRE]-------------[weak promoter]--[GFP]--
</code></pre></div>
<p>reporter assay 可以用個 plasmid 放到 target cell,但要怎麼平行化,同時看很多 genes 呢?這時候就是 MPRA (Massively Parallel Reporter Assay) 表現的時候了。我們可以用 DNA synthesis 把該 <em>cis</em>-regulatory element (CRE) 和 barcode 做出來,可以建立一個 CRE library,用 RNA-seq 就可以同時看到不同 CRE 所造成的 gene expression change。當然細節有像 normalization DNA amount 和 barcode efficiency,但我們可以用 MPRA 來分析 CRE。</p>
<p>這裡提到的 CRE-seq 有什麼缺陷呢?它是 Plasmid based,沒有 histone modification,有 copy number 問題;再來他的 genome context 也只有區域性(像 TAD 就沒有考慮)。於是接下來如何改善他,就是目前 Barak Lab 研究最新動態。</p>
<h2 id="conclusion">Conclusion</h2>
<p>我覺得從這個角度,把很多觀念用系統的角度整合,並且提出新的實驗與模型,非常有趣。像要怎麼 model enhancer 和 TFs 的交互作用,都是很有趣的題目。他的課非常有啟發性,很有意思。</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:3D genome">
<p>關於 3D genome 就要提一下這篇論文:<br>
Adrian and Suhas <em>el al.</em>, <a href="http://www.pnas.org/content/112/47/E6456.abstract"><em>Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes</em></a>, PNAS, 2015.<br>
裡面用碎形 (fractal globule) 去解釋 CTCF 形成 TADs 造成怎麼樣的染色體摺疊,並如何透過這樣的摺疊產生 long distance interaction,因為可能在立體空間他們是接近的。模型用來解釋 Hi-C 數據。這篇論文使用數學之抽象和複雜,甚至請丘成桐來當 reviewer。 <a class="footnote-backref" href="#fnref:3D genome" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:ENCODE paper">
<p>針對 ENCODE 所謂 80% genome are functional 非常有名的戰文:<br>Dan <em>et al.</em>, <a href="https://academic.oup.com/gbe/article-lookup/doi/10.1093/gbe/evt028"><em>On the Immortality of Television Sets: “Function” in the Human Genome According to the Evolution-Free Gospel of ENCODE</em></a>, Genome Biol Evol, 2013. <a class="footnote-backref" href="#fnref:ENCODE paper" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
</ol>
</div>Changing login shell without chsh2017-01-23T00:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2017-01-23:/posts/2017/01/without-chsh/<p>For my daily terminal life, I use <a href="https://fishshell.com/">fish shell</a>. Fish shell can be largely described by the headline on its official website:</p>
<blockquote>
<p><strong>Finally, a command line shell for the 90s</strong><br>
fish is a smart and user-friendly command line
shell for macOS, Linux, and the rest of the family.</p>
</blockquote>
<p>Among all …</p><p>For my daily terminal life, I use <a href="https://fishshell.com/">fish shell</a>. Fish shell can be largely described by the headline on its official website:</p>
<blockquote>
<p><strong>Finally, a command line shell for the 90s</strong><br>
fish is a smart and user-friendly command line
shell for macOS, Linux, and the rest of the family.</p>
</blockquote>
<p>Among all of its features, I particularly enjoy how the autocompletions are widely available and easy to use for generally all common commands and operations. I think I’ve got spoiled by the autocompletion so much that sometimes I get lazy at typing the full commands. Undoubtedly, fish is my login shell, replacing the ubiquitous Bash shell.</p>
<p>Most of the time, one can change the login shell by the command <code>chsh</code>. In order to let <code>chsh</code> accept the new fish shell, it must be added into the list of all accepted shells which requires root permission. However, in many occasions including working on a large shared server, one may not has the permission to add new shell and thus the options for the login shell are often limited. </p>
<h4 id="replacing-login-shell-by-exec">Replacing login shell by <code>exec</code></h4>
<p>Alternative solution will be calling the new shell upon the execution of current shell. A POSIX-compliant shell<sup id="fnref:posix"><a class="footnote-ref" href="#fn:posix">1</a></sup> should always read the <code>.profile</code> configuration file upon login, the following command execute fish and sweep the process with the current running shell (usually, bash). </p>
<div class="highlight"><pre><span></span><code><span class="nb">exec</span><span class="w"> </span>-l<span class="w"> </span><span class="nv">$SHELL</span><span class="w"> </span>-l
</code></pre></div>
<p><code>-l</code> tells shell to act like a login shell. For more explanation about login and non-login (as well as (non-)interactive) shells can be found at <a href="http://unix.stackexchange.com/a/46856">this StackOverflow answer</a>. By pointing <code>$SHELL</code> to the desired shell binary, one can achieve the similar behavior to <code>chsh</code>.</p>
<p>However putting <code>exec</code> in the login profile comes with a risk that if the new shell executable crashes (e.g. failed symlink, erroneous compilation and failed dynamic library linking), one cannot establish proper shell connections. I’ve experienced these catastrophic failures for a couple of times. Since the new shell crashes when the original shell is replacing its process, one cannot set up the proper terminal session, or simply put, one will fail to login. It was not fun at all to recover.</p>
<h4 id="fail-safe-shell-changing">Fail-safe shell changing</h4>
<p>To provide a fail-safe mechanism, I use the following code in my <code>~/.profile</code> to change the shell. Only login shells will read this file so it won’t be executed when one runs a bash script.</p>
<div class="highlight"><pre><span></span><code><span class="nv">FISH_BIN</span><span class="o">=</span><span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/.linuxbrew/bin/fish"</span>
<span class="c1"># The replacement is only done in non-fish login interactive shell in</span>
<span class="c1"># SSH connection and fish executable exists.</span>
<span class="k">if</span><span class="w"> </span><span class="o">[</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span><span class="s2">"</span><span class="nv">$SHELL</span><span class="s2">"</span><span class="w"> </span>!<span class="o">=</span><span class="w"> </span><span class="s2">"</span><span class="nv">$FISH_BIN</span><span class="s2">"</span><span class="w"> </span>-a<span class="w"> </span>-n<span class="w"> </span><span class="s2">"</span><span class="nv">$SSH_TTY</span><span class="s2">"</span><span class="w"> </span>-a<span class="w"> </span>-x<span class="w"> </span><span class="s2">"</span><span class="nv">$FISH_BIN</span><span class="s2">"</span><span class="w"> </span><span class="se">\</span>
<span class="o">]</span><span class="w"> </span><span class="p">;</span><span class="w"> </span><span class="k">then</span>
<span class="w"> </span><span class="c1"># we first check whether fish can be executed, otherwise the</span>
<span class="w"> </span><span class="c1"># replacement will cause immediate crash at login (not fun)</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="s2">"</span><span class="nv">$FISH_BIN</span><span class="s2">"</span><span class="w"> </span>-c<span class="w"> </span><span class="s1">'echo "Test fish running" >/dev/null'</span><span class="w"> </span><span class="p">;</span><span class="w"> </span><span class="k">then</span>
<span class="w"> </span><span class="nb">export</span><span class="w"> </span><span class="nv">SHELL</span><span class="o">=</span><span class="s2">"</span><span class="nv">$FISH_BIN</span><span class="s2">"</span>
<span class="w"> </span><span class="nb">echo</span><span class="w"> </span><span class="s2">"One can launch the fish shell by 'exec -l \$SHELL -l'"</span>
<span class="w"> </span><span class="c1"># exec -l $SHELL -l # launch the fish login shell</span>
<span class="w"> </span><span class="k">else</span>
<span class="w"> </span><span class="nb">echo</span><span class="w"> </span><span class="s2">"Failed to launch fish shell. Go check its installation!"</span>
<span class="w"> </span><span class="nb">echo</span><span class="w"> </span><span class="s2">"Fall back to default shell </span><span class="nv">$SHELL</span><span class="s2"> ..."</span>
<span class="w"> </span><span class="k">fi</span>
<span class="k">fi</span>
</code></pre></div>
<p>Basically we ensure whether fish executable work by running <code>echo '...'</code> before we change the shell. By uncommenting the <code>exec ..</code> line one will get automatically directed to fish shell. But the safest option is to run the shell change oneself. This kind of “shell swapping” will only happen when we log in the server by ssh</p>
<p>The actual setting is quite straight-forward. While the backstory here is that I messed up a few times and I was lucky enough to keep another session alive. At first I only checked <code>fish --version</code> but it was not sufficient, since it didn’t actually execute fish’s main code instructions. I got a illegal instruction after changing to fish even though printing its version was fine.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:posix">
<p>Interestingly, fish is not a POSIX-compliant shell so it won’t read <code>~/.profile</code> configuration file. <a class="footnote-backref" href="#fnref:posix" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>GPG Key Transition2016-12-06T00:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-12-06:/posts/2016/12/gpg-key-transition-2016/<p>I started using GPG key as one of my small experiments in March, 2015. Throughout the setup, I made some mistakes, which I revoked later, and explored several usage scenarios. Although like what was said in the post <a href="https://blog.filippo.io/giving-up-on-long-term-pgp/"><em>I’m giving up on PGP</em></a>, I don’t really use the …</p><p>I started using GPG key as one of my small experiments in March, 2015. Throughout the setup, I made some mistakes, which I revoked later, and explored several usage scenarios. Although like what was said in the post <a href="https://blog.filippo.io/giving-up-on-long-term-pgp/"><em>I’m giving up on PGP</em></a>, I don’t really use the encryption in daily email communication, it is still good to have an online identity.</p>
<p>A year later, which is 3 months before my <em>experimental</em> key expires, I think now is a good time to roll out a new one. I followed Alex’s post <a href="https://alexcabal.com/creating-the-perfect-gpg-keypair/"><em>Creating the Perfect GPG Keypair</em></a> to create a signing subkey for daily usage and keep my master key sperately in a safe place. The following is my transition statement.</p>
<div class="highlight"><pre><span></span><code>-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
I am transitioning GPG keys from an old 4192-bit RSA key to a new
4096-bit RSA key. The old key will continue to be valid for some
time, but I prefer all new correspondance to be encrypted in the new
key, and will be making all signatures going forward with the new key.
This transition document is signed with both keys to validate the
transition.
If you have signed my old key, I would appreciate signatures on my new
key as well, provided that your signing policy permits that without
reauthenticating me.
The old key, which I am transitioning away from, is:
pub 4096R/30A45011B233544E 2015-03-21 [expires: 2017-03-22]
Key fingerprint = 6ECD C5B8 235C D44D 2471 866E 30A4 5011 B233 544E
The new key, to which I am transitioning to, is:
pub 4096R/44D10E44730992C4 2016-12-04 [expires: 2018-12-04]
Key fingerprint = 85DF A3EB 72CD DE7D 3F2A 127C 44D1 0E44 7309 92C4
To fetch the full new key from a public key server using GnuPG, run:
gpg --keyserver hkps://hkps.pool.sks-keyservers.net --recv-key 44D10E44730992C4
If you have already validated my old key, you can then validate that
the new key is signed by my old key:
gpg --check-sigs 44D10E44730992C4
If you then want to sign my new key, a simple and safe way to do that
is by using caff (shipped in Debian as part of the "signing-party"
package) as follows:
caff 44D10E44730992C4
Please contact me via e-mail at <me@liang2.tw> if you have any
questions about this document or this transition.
Liang-Bo Wang
(liang2, ccwang002)
me@liang2.tw
Dec 06, 2016
-----BEGIN PGP SIGNATURE-----
iQIcBAEBCgAGBQJYR5q3AAoJEPhHjvdXantXlzEP/iEgSd2NcfcBThmrY84U+MXR
UOLED3Ax6YvDUv/nInkMAH74SyqujeF7E7+ZuZmDEWRCVS6pQtpuLTvKBviDPyWx
W/hS03AU5nV9llSYZ4I/FzQdVtdY5PBBNCHxK34LoqJQVr3LPdAQOO2m9g8M11z0
+7FjmyNOjvZIxqhU+PK7VNEcZQ9X30ndjgkwCZQFE/8Wz9FnPt5QdwZoxNRBfx7Q
tMtHpMxKNTHV1t3lCcOubf5zQLFQ07SZv2f2rmfDPlsrvzp4bzq8QWEEo/XvClRc
hyFpbM55FqlJE+5lg/Dj/XC5AN0LS8HNd9x7UBWZnNQhg00elc+CXxgNc+wMVj+N
uPXYP2n1oZ4T4Fr45eFg9nJagBpIUsu+M5hoNGXtCdLcYPzJeERPebt/VZJQybse
T60hO8K15A5WCkZYnw1mXu/JangNGY8Bxq3xm7VzXHLJktf33gIIMUQC2ZCn/I0N
MQIjkMrARFpsGfb1DEglyWk/QdK4A8Cy3eKsNaKvz4A+PMr4Eskn8zhO05yJQlux
3IkXt1lnn2YTF5fjInYQPo22bNCuub4qoJhthOoySy2Zv04NHnOtmYPBZMv14aZe
MPcv4Kvn8szNeBRDT/zKqCWoHmxIbxIs7ZvSIfvj/NpugtkSkJVtvr7gpXmqObcc
L7kJSEXjkLiqvFrq6MkxiQIcBAEBCgAGBQJYR5q3AAoJEDCkUBGyM1ROlzEQAJ3i
gpH6z/rHrAVCNxru7ATLmVZYGF0uxLvth0hUnOvmhWb7a60v4KwRgTBFJ9vdUB24
MW0T0BdxN8zJPrN6hGj9RxML5UpzH///oeL1gINM8IEhZWaG1/th7bx5f/ip8xN+
dbkA8Hp3LW/LAB09uJOITLbLaPa+N2Umcsu6stPXL+Z/06JSUYIliDRDkzzpb/qw
/OZD6sj1oI25A7KYEUPiNn+FxtBmNiFetDqwhCJSglEF3SBl8ZlrbgMDxIudZX/5
+ihTn2Za5q59c7u2ESMmInP1n8/lFxYxi/DWE2n8vrw84PwQ5lG5zdiiYQf78QeB
j77giQzYibzvRHZlslJEM0lSeNLQ72svT5SIFB+45wqtfVIAfZCxTppv35MkpyDw
gtYW/zL6U+Qx+chPgVpBLkpC7LbBvrJozIU0oHw8V837IByaeqBPu9rm+F3M++Mo
taLmkzNvhX6wozw9Tj0gnW6e8ytH7Xi8K8IYO7xSSOGih/oKF2PrWPd8gufMiIML
lOtcuwZOCQqAB2yAQ2BHliwrm78XELARZXM1sbWJTpXBJPAZ+ZbvnNFK6fUwnclK
H35TsvRJK7hH+4d10EdURleyRj7d0EcXlHqki4urKlwSzRebLzq365vADzXEjFYp
DmfC2ISS64uLqHgJ3HHxhSmTLdc8KSJqzFi90ZUu
=JS7v
-----END PGP SIGNATURE-----
</code></pre></div>St. Louis PhD 生活2016-10-02T00:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-10-02:/posts/2016/10/phd-life-in-st-louis/<p>不知不覺,距離上篇文章已經 4 個月。這 4 個月裡,擔任 <a href="https://tw.pycon.org/2016/">PyCon TW 2016</a> 工作人員、寫完<a href="https://github.com/ccwang002/master-thesis">碩士論文</a>、從一個讀了 7 年的學校<a href="https://www.facebook.com/photo.php?fbid=1388036924547016&set=a.181592455191475.49590.100000221666201&type=3&theater">畢業</a>、然後前 …</p><p>不知不覺,距離上篇文章已經 4 個月。這 4 個月裡,擔任 <a href="https://tw.pycon.org/2016/">PyCon TW 2016</a> 工作人員、寫完<a href="https://github.com/ccwang002/master-thesis">碩士論文</a>、從一個讀了 7 年的學校<a href="https://www.facebook.com/photo.php?fbid=1388036924547016&set=a.181592455191475.49590.100000221666201&type=3&theater">畢業</a>、然後前往美國的 <a href="https://wustl.edu/">Washington University in St. Louis</a> 就讀 <a href="http://dbbs.wustl.edu/divprograms/compbio/Pages/default.aspx">Bioinformatics</a> PhD。</p>
<p>如今除了繼續認識新環境與新朋友之外,生活大抵安定。因為開學有各種學校或私下學生間的活動,以及適應新的課程與實驗室,其實手上的專案沒什麼進度。不過還是簡單的更新一下近況。</p>
<h2 id="_1">關於我的學校與城市</h2>
<figure>
<img src="https://blog.liang2.tw/posts/2016/10/phd-life-in-st-louis/pics/STL_map.png"/>
<figcaption>St. Louis 城市位置。</figcaption>
</figure>
<p>我想大多數的人並不認識 <a href="https://en.wikipedia.org/wiki/St._Louis">St. Louis</a> (STL) 這城市與 Washington University in St. Louis (WashU or WUSTL) 這學校。首先關於 St. Louis,它是一個美國中西部的中型城市。與東西岸的城市相比來得鄉下,但學校附近也不會是一望無際的田野。曾經於 19 世紀美國西部拓荒時,作為西進的重要陸河交通埠口,城市的地標是一個很高的拱門。STL 舉辦過 1904 年夏季奧運與世界博覽會,因此城市很多建設都是在那時完成或更新,包含我們學校不少建築物與設施。</p>
<figure class="full-img">
<img src="https://blog.liang2.tw/posts/2016/10/phd-life-in-st-louis/pics/stl_cardinals.jpg"/>
<figcaption>MLB 紅雀 (Cardinals) 棒球隊主場場地 Busch Stadium。背後的拱門即為 St. Louis 地標。</figcaption>
</figure>
<p>氣候四季分明。夏天可以到 37°C (100°F)。目前剛入秋季,早晚溫差大 (15 — 25°C),緊接著的冬季會更加寒冷(-5 — 10°C),下雪數天,但這可能等我經歷過冬天之後才能正確的描述。生活方面,基本上沒有什麼不便,但跟大城市可能一些娛樂休閒活動會少一點,在地的華人社區也比較小。不過反過來說,物價較低,房價親民。來之前曾經擔心網路速度不夠快,因為多數地方並沒有給住民使用的光纖網路,不過目前寬頻有 100MBps 而且實際傳輸能維持這個速度,偶爾能更快,因此暫時還可以接受。對一個阿宅來說,網路夠快就表示能連接到他的全世界了。</p>
<figure>
<img src="https://blog.liang2.tw/posts/2016/10/phd-life-in-st-louis/pics/stl_symphony_outdoor_concert.jpg"/>
<figcaption>STL 交響樂團於 Forest Park 舉辦的戶外音樂會,在這邊是個闔家出遊的行程。坐在草地上野餐與聽表演。</figcaption>
</figure>
<p>接下來談談我的學校,<a href="https://en.wikipedia.org/wiki/Washington_University_in_St._Louis">Washington University in St. Louis</a>,簡稱 WashU 或 WUSTL。因為太多學校都以 Washington 為名,例如 Seattle 有個公立學校 UW (University of Washington),但彼此間並沒有任何關係。講 WashU 常常會誤會,因此常會寫 WUSTL 來區分,這也是學校的 domain name。底下就是我們學校的地理位置:</p>
<figure>
<img src="https://blog.liang2.tw/posts/2016/10/phd-life-in-st-louis/pics/WUSTL_map.png"/>
<p class="caption">WUSTL 校區位置。
Ref: <a href="http://www.openstreetmap.org/#map=15/38.6373/-90.2829&layers=T">OpenStreetMap</a>
</p>
</figure>
<p>學校被中間一個大公園切割成兩大校區。西邊的是主校區,大多數的學院都在此。東邊則是醫學院校區,包含諸多醫院,與生醫研究相關的科系。我就讀的科系即在東邊的醫學院校區。我的感覺台大公館校區與學校差不多大。中間的大公園 Forest Park 長寬 2.5 x 1.6 km,即是世博會的主場地,目前裡面有博物館、植物園、動物園等等,許多戶外活動舉辦於此,同時也是健身去處。</p>
<p>西邊的主校區建築物為學院歌德式 (Collegiate Gothic),非常漂亮,如同概念中古老的大學有的模樣,實際上學校也已創立 160 多年。雖然說學校在台灣沒什麼知名度,但還算是個以研究為導向的私立「世界百大」學校,有 25 位諾貝爾獎得主。雖然看起來舊,但學校一直都在整修或建新的大樓,他們會維持現有建物的風格。</p>
<figure>
<img src="https://blog.liang2.tw/posts/2016/10/phd-life-in-st-louis/pics/wustl_campus.jpg"/>
<figcaption>西邊主校區 (Danforth Campus) 校門入口建築 Brookings Hall。</figcaption>
</figure>
<p>醫學院相較而言就是比較現代的校區,例如下圖就是我實驗室的所在建築。</p>
<figure>
<img src="https://blog.liang2.tw/posts/2016/10/phd-life-in-st-louis/pics/mckinley_building.jpg"/>
<figcaption>McKinley Research Building,於 2015 年落成。</figcaption>
</figure>
<p>關於研究的話,就不列在這邊了,也等之後更熟悉環境再來分享。如果你對申請 PhD 的過程有興趣的話,我有寫了一篇 <a href="https://www.ptt.cc/bbs/studyabroad/M.1462818195.A.882.html">PhD 申請準備在 ptt</a>。</p>
<h2 id="whats-next">What’s Next?</h2>
<p>花了一點時間安頓,現在這邊的生活其實有相當多的個人時間,除了一年級有課跟作業要準備外,還是有一些自己的時間。目前我已經累積了 6 篇草稿,大部份是關於 Django,但碩班畢業之後,就不太有機會再寫網站,直到最近幫實驗室做了個 proposal review system 才又碰了一下。有機會會整理它們。</p>
<p>在台灣還有 Python 文件翻譯這個專案,很可惜沒辦法在離開前完成,但這個是我今年的目標。目前是個擺著讓它自由發展的狀態,但我想在年底 <a href="https://www.python.org/dev/peps/pep-0494/">Python 3.6 新版本釋出</a>之前把 tutorial 部份告一段落。除了這專案之外,重新學習 C,以及看了些 Rust 的東西,多練習基本功。還沒有加入當地實體的開源社群,加上晚上活動也比在台灣危險一點。不過在美國時區,不論 IRC 或 mail list 都能輕鬆跟到討論,倒也不會覺得沒辦法接受新知。目標參加明年的 PyCon US。</p>
<p>順帶一提,部落格也換了新的主題,風格比較像 Medium,應該比較好閱讀。不過我還沒空把細節調整好,目前維持著只要還不會很不順眼,就維持現狀的做法。希望不久後又能恢復定期發文。</p>使用 conda env 部署 Django2016-05-24T00:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-05-24:/posts/2016/05/django-deploy-conda/<p>沒幾天前剛部署一次 Django,記錄在<a href="https://blog.liang2.tw/posts/2016/05/django-deploy-uwsgi-nginx-systemd/">《使用 uWSGI、nginx、systemd 部署 Django》</a>。今天又部署了另一個專案。部署的設定跟上次一樣:</p>
<div class="highlight"><pre><span></span><code>nginx -- unix socket -- uWSGI -- Django
</code></pre></div>
<p>一樣寫 …</p><p>沒幾天前剛部署一次 Django,記錄在<a href="https://blog.liang2.tw/posts/2016/05/django-deploy-uwsgi-nginx-systemd/">《使用 uWSGI、nginx、systemd 部署 Django》</a>。今天又部署了另一個專案。部署的設定跟上次一樣:</p>
<div class="highlight"><pre><span></span><code>nginx -- unix socket -- uWSGI -- Django
</code></pre></div>
<p>一樣寫一個 <code>PROJ.service</code> 的 systemd unit 來管理網站的啟動 (uWSGI)。之後提到 <code>PROJ</code> 時就換成自己的專案名稱;<code>USER</code> 就換成執行網站的帳號。</p>
<div class="toc">
<ul>
<li><a href="#conda">conda</a></li>
<li><a href="#uwsgi-path">uWSGI 和 $PATH</a></li>
<li><a href="#sysmted-unit">在 sysmted unit 使用環境變數</a></li>
<li><a href="#_1">結論</a></li>
</ul>
</div>
<h3 id="conda">conda</h3>
<p><a href="http://conda.pydata.org/">conda</a> 是一個 Python 套件的管理系統,他的好處是,遇到要使用外部 library 時,會這些套件相依的 library 都一併安裝管理,也可以管理不同 Python 版本。可以想像是加強版的 pip + venv。conda 跟 pip 是相容的。</p>
<p>這個 Django 專案就用到很多像 numpy、pandas 的套件。為了維護方便,我考慮用 conda 來安裝。我使用的是 <a href="http://conda.pydata.org/miniconda.html">miniconda3</a>,預設會安裝在 <code>~/miniconda3</code> 底下,虛擬環境會出現在 <code>~/miniconda3/envs/</code>。</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>conda<span class="w"> </span>create<span class="w"> </span>-n<span class="w"> </span>VENV<span class="w"> </span><span class="nv">python</span><span class="o">=</span><span class="m">3</span>.5<span class="w"> </span>numpy<span class="w"> </span>pandas<span class="w"> </span>django
$<span class="w"> </span><span class="nb">source</span><span class="w"> </span>activate<span class="w"> </span>VENV
<span class="o">(</span>VENV<span class="o">)</span><span class="w"> </span>$<span class="w"> </span>pip<span class="w"> </span>install<span class="w"> </span>uwsgi
</code></pre></div>
<p>uWSGI 沒有在 conda 裡面,所以就用 pip 裝。從<a href="https://blog.liang2.tw/posts/2016/05/django-deploy-uwsgi-nginx-systemd/">上次的文章</a>知道系統並不用安裝。</p>
<h3 id="uwsgi-path">uWSGI 和 $PATH</h3>
<p>理論上,之後就照著上次操作就好,但在 uWSGI 就碰到問題:</p>
<div class="highlight"><pre><span></span><code><span class="go">$ sudo /home/USER/miniconda3/envs/VENV/bin/uwsgi --ini PROJ.ini</span>
<span class="go">[uWSGI] getting INI configuration from PROJ.ini</span>
<span class="go">*** Starting uWSGI 2.0.13.1 (64bit) on [Wed May 25 08:04:23 2016] ***</span>
<span class="go">compiled with version: 5.3.1 20160413 on 25 May 2016 01:35:28</span>
<span class="go">os: Linux-4.4.0-22-generic #40-Ubuntu SMP Thu May 12 22:03:46 UTC 2016</span>
<span class="go">nodename: s66</span>
<span class="go">machine: x86_64</span>
<span class="go">clock source: unix</span>
<span class="go">detected number of CPU cores: 24</span>
<span class="go">current working directory: /etc/uwsgi/vassals</span>
<span class="go">detected binary path: /home/USER/miniconda3/envs/VENV/bin/uwsgi</span>
<span class="go">……</span>
<span class="go">chdir() to /path/to/PROJ/</span>
<span class="go">your processes number limit is 514650</span>
<span class="go">your memory page size is 4096 bytes</span>
<span class="go">detected max file descriptor number: 1024</span>
<span class="go">lock engine: pthread robust mutexes</span>
<span class="go">thunder lock: disabled (you can enable it with --thunder-lock)</span>
<span class="go">uwsgi socket 0 bound to UNIX address /run/PROJ/django.sock fd 3</span>
<span class="go">Python version: 3.5.1 |Continuum Analytics, Inc.| (default, Dec 7 2015, 11:16:01) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]</span>
<span class="go">Set PythonHome to /home/USER/miniconda3/envs/VENV</span>
<span class="go">Failed to import the site module</span>
<span class="gt">Traceback (most recent call last):</span>
File <span class="nb">"/usr/lib/python3.5/site.py"</span>, line <span class="m">580</span>, in <span class="n"><module></span>
<span class="w"> </span><span class="n">main</span><span class="p">()</span>
<span class="gr"> …… </span>
<span class="gr"> File "/usr/lib/python3.5/_sysconfigdata.py", line 6, in <module></span>
<span class="gr"> from _sysconfigdata_m import *</span>
<span class="gr">ImportError</span>: <span class="n">No module named '_sysconfigdata_m'</span>
</code></pre></div>
<p>但因為步驟實在太簡單,想不出來哪裡有錯,查網路也沒什麼相關的結果。在這邊卡了很久。</p>
<p>結果後來才發現,Traceback 那邊 uWSGi 跑去讀到 <code>/usr/lib/python3.5/site.py</code>,這表示一定有環境設錯才讓它找到這個不是我們要的 python 環境,理論上應該是找到 <code>/home/USER/miniconda3/envs/VENV/lib/python3.5/site.py</code> 才對。</p>
<p>經過一陣嘗試,發現只要修改 <code>$PATH</code> 環境變數就能運作了。</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>sudo<span class="w"> </span>-i
<span class="gp"># </span><span class="nb">echo</span><span class="w"> </span><span class="nv">$PATH</span>
<span class="go">/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin</span>
<span class="gp"># </span><span class="nb">export</span><span class="w"> </span><span class="nv">PATH</span><span class="o">=</span>/home/USER/miniconda3/envs/VENV/bin:<span class="nv">$PATH</span>
<span class="gp"># </span>/home/USER/miniconda3/envs/VENV/bin/uwsgi<span class="w"> </span>--ini<span class="w"> </span>PROJ.ini
</code></pre></div>
<h3 id="sysmted-unit">在 sysmted unit 使用環境變數</h3>
<p>根據 <a href="https://www.freedesktop.org/software/systemd/man/systemd.exec.html#%24PATH">systemd.exec(5)</a> 關於 <code>$PATH</code> 環境變數的使用:</p>
<blockquote>
<p>Colon-separated list of directories to use when launching executables. Systemd uses a fixed value of /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin. </p>
</blockquote>
<p>預設只有以上提到的路徑,如果要修改環境變數的話,就透過 <a href="https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Environment="><code>Environment=</code></a>,因此多加了一行在 systemd unit 裡。其餘的設定都是相同的。</p>
<div class="highlight"><pre><span></span><code><span class="k">[Unit]</span>
<span class="na">Description</span><span class="o">=</span><span class="s">PROJ Django server by uWSGI</span>
<span class="na">After</span><span class="o">=</span><span class="s">syslog.target</span>
<span class="k">[Service]</span>
<span class="na">Environment</span><span class="o">=</span><span class="s">"PATH=/home/USER/miniconda3/envs/VENV/bin:$PATH"</span>
<span class="na">ExecStart</span><span class="o">=</span><span class="s">/home/USER/miniconda3/envs/VENV/bin/uwsgi --ini /etc/uwsgi/vassals/PROJ.ini</span>
<span class="na">Restart</span><span class="o">=</span><span class="s">always</span>
<span class="na">KillSignal</span><span class="o">=</span><span class="s">SIGQUIT</span>
<span class="na">Type</span><span class="o">=</span><span class="s">notify</span>
<span class="na">StandardError</span><span class="o">=</span><span class="s">syslog</span>
<span class="na">NotifyAccess</span><span class="o">=</span><span class="s">all</span>
<span class="k">[Install]</span>
<span class="na">WantedBy</span><span class="o">=</span><span class="s">multi-user.target</span>
</code></pre></div>
<h3 id="_1">結論</h3>
<p>如果要改用 conda 管理套件的話,只要在 systemd unit 那邊多加一行修改 $PATH,加入虛擬環境放執行檔的路徑,其餘的設定都與一般 Python 虛擬環境相同。這樣就搞定了。但這個問題花了我 1 個多小時……</p>Ensembl Genomic Reference in Bioconductor2016-05-21T18:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-05-21:/posts/2016/05/biocondutor-ensembl-reference/<p>Using fundamental R/Biocondcutor packages (e.g. AnnotationHub, ensembldb and biomaRt) to query Ensembl genomic references or annotations.</p><p><em><strong>TL;DR</strong> I gave a talk <a href="https://blog.liang2.tw/2016Talk-Genomics-in-R/">Genomics in R</a> about querying genomic annotations and references in R/Bioconductor. In this post, we re-visit all the operations in my talk using Ensembl references instead of UCSC/NCBI ones.</em></p>
<p>This post is part of the “<a href="https://blog.liang2.tw/posts/2015/12/biocondutor-genomic-data/">Genomic Data Processing in Bioconductor</a>” series. In that post, I mentioned several topics critical for genomic data analysis in Bioconductor:</p>
<ul>
<li>Annotation and genome reference (OrgDb, TxDb, OrganismDb, BSgenome)</li>
<li>Experiment data storage (ExpressionSets)</li>
<li>Operations on genome (GenomicRanges)</li>
<li>Genomic data visualization (Gviz, ggbio)</li>
</ul>
<p>Few days ago in a local R community meetup, I gave a talk <a href="https://blog.liang2.tw/2016Talk-Genomics-in-R/"><em>Genomics in R</em></a> covering the “Annotation and genome reference” part and a quick glance through “Operations on genome”, which should be sufficient for daily usage such as searching annotations in the subset of some genomic ranges. You can find the <a href="https://blog.liang2.tw/2016Talk-Genomics-in-R/">slides</a>, the <a href="https://www.youtube.com/watch?v=ZR4GYQ487j8">meetup screencast</a> (in Chinese) and the <a href="https://github.com/ccwang002/2016Talk-Genomics-in-R">accompanied source code</a> online. I don’t think a write-up is needed for the talk. But if anyone is interested, feel free to drop your reply below. :)</p>
<h3 id="fundamental-bioconductor-packages">Fundamental Bioconductor packages</h3>
<p>Some Bioconductor packages are the building blocks for genomic data analysis. I put a table here containing all the classes covered in rest of the post. If you are not familiar with these classes and their methods, go through the <a href="https://blog.liang2.tw/2016Talk-Genomics-in-R/">talk slides</a> first, or at least follow the <a href="https://bioconductor.org/help/workflows/annotation/annotation/">annotation workflow</a> on Bioconductor.</p>
<table>
<thead>
<tr>
<th style="text-align: left;">R Class</th>
<th style="text-align: left;">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left;"><code>OrgDb</code></td>
<td style="text-align: left;">Gene-based information for Homo sapiens; useful for mapping between gene IDs, Names, Symbols, GO and KEGG identifiers, etc.</td>
</tr>
<tr>
<td style="text-align: left;"><code>TxDb</code></td>
<td style="text-align: left;">Transcriptome ranges for the known gene track of Homo sapiens, e.g., introns, exons, UTR regions.</td>
</tr>
<tr>
<td style="text-align: left;"><code>OrganismDb</code></td>
<td style="text-align: left;">Collection of multiple annotations for a common organism and genome build.</td>
</tr>
<tr>
<td style="text-align: left;"><code>BSgenome</code></td>
<td style="text-align: left;">Full genome sequence for Homo sapiens.</td>
</tr>
<tr>
<td style="text-align: left;"><code>AnnotationHub</code></td>
<td style="text-align: left;">Provides a convenient interface to annotations from many different sources; objects are returned as fully parsed Bioconductor data objects or as the name of a file on disk.</td>
</tr>
</tbody>
</table>
<h3 id="ensembl-genome-browser-and-its-ecosystem">Ensembl genome browser and its ecosystem</h3>
<p>In Bioconductor, most annotations are built against NCBI and UCSC naming systems, which are also used in my talk. However, there is another naming system maintained by <a href="http://www.ensembl.org/index.html">Ensembl</a>, whose IDs are very recognizable with suffix “ENSG” and “ENST” for gene and transcript respectively.</p>
<p>I particularly enjoy the Ensembl genome browser. The information is well organized and structured. For example, take a look at the description page of <a href="http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000100030">gene MAPK1</a>,</p>
<figure>
<img src="https://blog.liang2.tw/posts/2016/05/biocondutor-ensembl-reference/pics/gene_MAPK1_ensembl_browser.png">
<p class="caption center">Gene information page of MAP1 on Ensembl Genome Browser release 84 (<a href="http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000100030">link</a>)</p>
</figure>
<p>The <a href="http://www.ensembl.org/Homo_sapiens/Gene/Compara_Tree?db=core;g=ENSG00000100030">gene tree</a> tab shows its homologs and paralogs. The <a href="http://www.ensembl.org/Homo_sapiens/Gene/Variation_Gene/Table?db=core;g=ENSG00000100030">variant table</a> tab shows various kinds of SNPs within MAPK1’s transcript region. SNPs are annotated with their sources, different levels of supporting evidence, and SIFT/PolyPhen prediction on protein function change. Finally, there is a <a href="http://www.ensembl.org/Homo_sapiens/Gene/Matches?db=core;g=ENSG00000100030">external references</a> tab which links the Ensembl IDs with <a href="https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi">NCBI CCDS</a> and <a href="http://www.ncbi.nlm.nih.gov/refseq/">NCBI RefSeq</a> IDs. There are many ways to explore different aspects of this gene, and it seems everything at multiple biological levels is simply connected.</p>
<p>I always think of the Ensembl ecosystem as a decent learning portal, so it is a pity if one cannot easily use its information in R/Bioconductor. After a quick research, I found using Ensembl annotations are quite straightforward even though the required files does not ship with Bioconductor. Also, there were some topics I failed to mention in the talk, such as AnnotationHub and genomic coordinate system conversion (e.g., from hg19 to hg38). I am going to cover these topics in the talk.</p>
<div class="toc">
<ul>
<li><a href="#fundamental-bioconductor-packages">Fundamental Bioconductor packages</a></li>
<li><a href="#ensembl-genome-browser-and-its-ecosystem">Ensembl genome browser and its ecosystem</a></li>
<li><a href="#orgdb">OrgDb</a></li>
<li><a href="#txdb">TxDb</a></li>
<li><a href="#bsgenome-and-annotationhub">BSgenome and AnnotationHub</a></li>
<li><a href="#biomart">biomaRt</a><ul>
<li><a href="#compatibility-with-annotationdbs-interface">Compatibility with AnnotationDb’s interface</a></li>
<li><a href="#biomarts-original-interface">biomaRt’s original interface</a></li>
</ul>
</li>
<li><a href="#conversion-between-genomic-coordinate-systems">Conversion between genomic coordinate systems</a></li>
<li><a href="#summary">Summary</a></li>
</ul>
</div>
<h2 id="orgdb">OrgDb</h2>
<p>The same OrgDb object for human (<code>org.Hs.eg.db</code>) can be used. It relates different gene IDs, including Entrez and Ensembl gene ID. From its metadata, human’s OrgDb gets updated frequently. Most of its data source were fetched during this March. So one should be able to use it for both hg19 and hg38 human reference.</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">org.Hs.eg.db</span><span class="p">)</span>
<span class="n">human</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">org.Hs.eg.db</span>
<span class="n">mapk_gene_family_info</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">select</span><span class="p">(</span>
<span class="w"> </span><span class="n">human</span><span class="p">,</span>
<span class="w"> </span><span class="n">keys</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s">"MAPK1"</span><span class="p">,</span><span class="w"> </span><span class="s">"MAPK3"</span><span class="p">,</span><span class="w"> </span><span class="s">"MAPK6"</span><span class="p">),</span>
<span class="w"> </span><span class="n">keytype</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"SYMBOL"</span><span class="p">,</span>
<span class="w"> </span><span class="n">columns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s">"ENTREZID"</span><span class="p">,</span><span class="w"> </span><span class="s">"ENSEMBL"</span><span class="p">,</span><span class="w"> </span><span class="s">"GENENAME"</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">mapk_gene_family_info</span>
</code></pre></div>
<table>
<thead>
<tr>
<th style="text-align: left;">SYMBOL</th>
<th style="text-align: left;">ENTREZID</th>
<th style="text-align: left;">ENSEMBL</th>
<th style="text-align: left;">GENENAME</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left;">MAPK1</td>
<td style="text-align: left;">5594</td>
<td style="text-align: left;">ENSG00000100030</td>
<td style="text-align: left;">mitogen-activated protein kinase 1</td>
</tr>
<tr>
<td style="text-align: left;">MAPK3</td>
<td style="text-align: left;">5595</td>
<td style="text-align: left;">ENSG00000102882</td>
<td style="text-align: left;">mitogen-activated protein kinase 3</td>
</tr>
<tr>
<td style="text-align: left;">MAPK6</td>
<td style="text-align: left;">5597</td>
<td style="text-align: left;">ENSG00000069956</td>
<td style="text-align: left;">mitogen-activated protein kinase 6</td>
</tr>
</tbody>
</table>
<p>Here comes a small pitfall for Ensembl annotation. We cannot sufficiently map Ensembl’s gene ID to its transcript ID,</p>
<div class="highlight"><pre><span></span><code><span class="nf">select</span><span class="p">(</span>
<span class="w"> </span><span class="n">human</span><span class="p">,</span>
<span class="w"> </span><span class="n">keys</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mapk_gene_family_info</span><span class="o">$</span><span class="n">ENSEMBL</span><span class="p">[[</span><span class="m">1</span><span class="p">]],</span>
<span class="w"> </span><span class="n">keytype</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"ENSEMBL"</span><span class="p">,</span>
<span class="w"> </span><span class="n">columns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s">"ENSEMBLTRANS"</span><span class="p">)</span>
<span class="p">)</span>
<span class="c1"># 'select()' returned 1:1 mapping between keys and columns</span>
<span class="c1"># ENSEMBL ENSEMBLTRANS</span>
<span class="c1"># 1 ENSG00000100030 <NA></span>
</code></pre></div>
<p>We got <em>no</em> Ensembl transcript ID for MAPK1, which is impossible. Therefore, to find the real Ensembl transcript IDs, we need to find other references.</p>
<h2 id="txdb">TxDb</h2>
<p>There is no pre-built Ensembl TxDb object available on Bioconductor. But with the help of <a href="http://bioconductor.org/packages/release/bioc/html/ensembldb.html">ensembldb</a>, we can easily build the TxDb ourselves.</p>
<p>Following the instructions in ensembldb’s <a href="http://bioconductor.org/packages/release/bioc/vignettes/ensembldb/inst/doc/ensembldb.html">vignette file</a>, we can build the TxDb object from the Ensembl latest release, which is release 84 (Mar, 2016) at the time of writing. Ensembl releases all human transcript records as GTF file, which can be found here <a href="ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz">ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz</a>.</p>
<p>After processing GTF file via <code>ensDbFromGtf()</code>, the generated data for creating the TxDb object will be stored in a SQLite3 database file <code>Homo_sapiens.GRCh38.84.sqlite</code> at the R working directory. Building TxDB is just one command away, <code>EnsDb()</code>. Putting two commands together, the script for building Ensembl TxDb is listed below. To prevent from rebuilding the TxDb every time the script is executed, we first check if the sqlite file exists,</p>
<div class="highlight"><pre><span></span><code><span class="c1"># xxx_DB in the vignette is just a string to the SQLite db file path</span>
<span class="n">ens84_txdb_pth</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s">'./Homo_sapiens.GRCh38.84.sqlite'</span>
<span class="nf">if </span><span class="p">(</span><span class="o">!</span><span class="nf">file.exists</span><span class="p">(</span><span class="n">ens84_human_txdb_pth</span><span class="p">))</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">ens84_txdb_pth</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">ensDbFromGtf</span><span class="p">(</span><span class="n">gtf</span><span class="o">=</span><span class="s">"Homo_sapiens.GRCh38.84.gtf.gz"</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">txdb_ens84</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">EnsDb</span><span class="p">(</span><span class="n">ens84txdb_pth</span><span class="p">)</span>
<span class="n">txdb_ens84</span><span class="w"> </span><span class="c1"># Preview the metadata</span>
</code></pre></div>
<p>The filtering syntax for finding desired genes or transcripts is different to the built-in TxDb object,</p>
<div class="highlight"><pre><span></span><code><span class="nf">transcripts</span><span class="p">(</span>
<span class="w"> </span><span class="n">txdb_ens84</span><span class="p">,</span>
<span class="w"> </span><span class="n">filter</span><span class="o">=</span><span class="nf">GeneidFilter</span><span class="p">(</span><span class="n">mapk_gene_family_info</span><span class="o">$</span><span class="n">ENSEMBL</span><span class="p">[[</span><span class="m">1</span><span class="p">]])</span>
<span class="p">)</span>
<span class="c1"># GRanges object with 4 ranges and 5 metadata columns:</span>
<span class="c1"># seqnames ranges strand | tx_id tx_biotype</span>
<span class="c1"># <Rle> <IRanges> <Rle> | <character> <character></span>
<span class="c1"># ENST00000215832 22 [21754500, 21867629] - | ENST00000215832 protein_coding</span>
<span class="c1"># ENST00000491588 22 [21763984, 21769428] - | ENST00000491588 processed_transcript</span>
<span class="c1"># ENST00000398822 22 [21769040, 21867680] - | ENST00000398822 protein_coding</span>
<span class="c1"># ENST00000544786 22 [21769204, 21867440] - | ENST00000544786 protein_coding</span>
<span class="c1"># tx_cds_seq_start tx_cds_seq_end gene_id</span>
<span class="c1"># <numeric> <numeric> <character></span>
<span class="c1"># ENST00000215832 21769204 21867440 ENSG00000100030</span>
<span class="c1"># ENST00000491588 <NA> <NA> ENSG00000100030</span>
<span class="c1"># ENST00000398822 21769204 21867440 ENSG00000100030</span>
<span class="c1"># ENST00000544786 21769204 21867440 ENSG00000100030</span>
<span class="c1"># -------</span>
<span class="c1"># seqinfo: 1 sequence from GRCh38 genome</span>
</code></pre></div>
<p>So filtering is done by passing special filter functions to <code>filter=</code>. Likewise, there are <code>TxidFilter</code>, <code>TxbiotypeFilter</code>, and <code>GRangesFilter</code> for filtering on the respective columns.</p>
<div class="highlight"><pre><span></span><code><span class="n">tx_gr</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">transcripts</span><span class="p">(</span>
<span class="w"> </span><span class="n">txdb_ens84</span><span class="p">,</span>
<span class="w"> </span><span class="n">filter</span><span class="o">=</span><span class="nf">TxidFilter</span><span class="p">(</span><span class="s">"ENST00000215832"</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">tr_gr</span>
<span class="c1"># GRanges object with 1 range and 5 metadata columns:</span>
<span class="c1"># seqnames ranges strand | tx_id tx_biotype tx_cds_seq_start tx_cds_seq_end gene_id</span>
<span class="c1"># <Rle> <IRanges> <Rle> | <character> <character> <numeric> <numeric> <character></span>
<span class="c1"># ENST00000215832 22 [21754500, 21867629] - | ENST00000215832 protein_coding 21769204 21867440 ENSG00000100030</span>
<span class="c1"># -------</span>
<span class="c1"># seqinfo: 1 sequence from GRCh38 genome</span>
</code></pre></div>
<p>Check the result with the online Ensembl genome browser. Note that Ensembl release 84 use hg38.</p>
<h2 id="bsgenome-and-annotationhub">BSgenome and AnnotationHub</h2>
<p>We can load the sequence from <code>BSgenome.Hsapiens.UCSC.hg38</code>, however, we can obtain the genome (chromosome) sequence of Ensembl using <a href="https://bioconductor.org/packages/release/bioc/html/AnnotationHub.html">AnnotationHub</a>. References of non-model organisms can be found on AnnotationHub, many of which are extracted from Ensembl. But they can be downloaded as Bioconductor objects directly so it should be easier to use.</p>
<p>First we create a AnnotationHub instance, it cached the metadata all available annotations locally for us to query.</p>
<div class="highlight"><pre><span></span><code><span class="n">ah</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">AnnotationHub</span><span class="p">()</span>
<span class="nf">query</span><span class="p">(</span><span class="n">ah</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s">"Homo sapiens"</span><span class="p">,</span><span class="w"> </span><span class="s">"release-84"</span><span class="p">))</span>
<span class="c1"># AnnotationHub with 5 records</span>
<span class="c1"># # snapshotDate(): 2016-05-12</span>
<span class="c1"># # $dataprovider: Ensembl</span>
<span class="c1"># # $species: Homo sapiens</span>
<span class="c1"># # $rdataclass: TwoBitFile</span>
<span class="c1"># # additional mcols(): taxonomyid, genome, description, tags, sourceurl, sourcetype</span>
<span class="c1"># # retrieve records with, e.g., 'object[["AH50558"]]'</span>
<span class="c1">#</span>
<span class="c1"># title</span>
<span class="c1"># AH50558 | Homo_sapiens.GRCh38.cdna.all.2bit</span>
<span class="c1"># AH50559 | Homo_sapiens.GRCh38.dna.primary_assembly.2bit</span>
<span class="c1"># AH50560 | Homo_sapiens.GRCh38.dna_rm.primary_assembly.2bit</span>
<span class="c1"># AH50561 | Homo_sapiens.GRCh38.dna_sm.primary_assembly.2bit</span>
<span class="c1"># AH50562 | Homo_sapiens.GRCh38.ncrna.2bit</span>
</code></pre></div>
<p>From the search results, human hg38 genome sequences are available as <a href="https://genome.ucsc.edu/goldenpath/help/twoBit.html">TwoBit</a> format. But having multiple results is confusing at first. After checking the Ensembl’s <a href="ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/README">gnome DNA assembly readme</a>, what we should use here is the full DNA assembly without any masking (or you can decide it based on your application).</p>
<div class="highlight"><pre><span></span><code><span class="c1"># There are a plenty of query hits. Description of different file suffix:</span>
<span class="c1"># GRCh38.dna.*.2bit genome sequence</span>
<span class="c1"># GRCh38.dna_rm.*.2bit hard-masked genome sequence (masked regions are replaced with N's)</span>
<span class="c1"># GRCh38.dna_sm.*.2bit soft-masked genome sequence (.............. are lower cased)</span>
<span class="n">ens84_human_dna</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ah</span><span class="p">[[</span><span class="s">"AH50559"</span><span class="p">]]</span>
</code></pre></div>
<p>Then we can use it to obtain the DNA sequence of desired genomic range (in <code>GRagnes</code>).</p>
<div class="highlight"><pre><span></span><code><span class="nf">getSeq</span><span class="p">(</span><span class="n">ens84_human_dna</span><span class="p">,</span><span class="w"> </span><span class="n">tx_gr</span><span class="p">)</span>
<span class="c1"># A DNAStringSet instance of length 1</span>
<span class="c1"># width seq names</span>
<span class="c1"># [1] 113130 TTTATAGAGAAAA...CTCGGACCGATTGCCT ENST00000215832</span>
</code></pre></div>
<h2 id="biomart">biomaRt</h2>
<p><a href="http://www.ensembl.org/biomart/martview">BioMart</a> are a collections of database that can be accessed by the same API, including Ensembl, Uniprot and HapMap. <a href="https://bioconductor.org/packages/release/bioc/html/biomaRt.html">biomaRt</a> provides an R interface to these database resources. We will use BioMart for ID conversion between Ensembl and RefSeq. Its <a href="https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.pdf">vignette</a> contains solutions to common scenarios so should be a good starting point to get familiar with it.</p>
<p>You could first explore which Marts are currently available by <code>listMarts()</code>,</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">biomaRt</span><span class="p">)</span>
<span class="nf">listMarts</span><span class="p">()</span>
<span class="c1"># biomart version</span>
<span class="c1"># 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 84</span>
<span class="c1"># 2 ENSEMBL_MART_SNP Ensembl Variation 84</span>
<span class="c1"># 3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 84</span>
<span class="c1"># 4 ENSEMBL_MART_VEGA Vega 64</span>
</code></pre></div>
<p>Here we will use Ensembl’s biomart. Each mart contains multiple datasets, usually separated by different organisms. In our case, human’s dataset is <code>hsapiens_gene_ensembl</code>. For other organisms, you can find their dataset by <code>listDatasets(ensembl)</code>.</p>
<div class="highlight"><pre><span></span><code><span class="n">ensembl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">useMart</span><span class="p">(</span><span class="s">"ensembl"</span><span class="p">)</span>
<span class="n">ensembl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">useDataset</span><span class="p">(</span><span class="s">"hsapiens_gene_ensembl"</span><span class="p">,</span><span class="w"> </span><span class="n">mart</span><span class="o">=</span><span class="n">ensembl</span><span class="p">)</span>
<span class="c1"># or equivalently</span>
<span class="n">ensembl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">useMart</span><span class="p">(</span><span class="s">"ensembl"</span><span class="p">,</span><span class="w"> </span><span class="n">dataset</span><span class="o">=</span><span class="s">"hsapiens_gene_ensembl"</span><span class="p">)</span>
</code></pre></div>
<h3 id="compatibility-with-annotationdbs-interface">Compatibility with AnnotationDb’s interface</h3>
<p>The way to query the <code>ensembl</code> Mart object is slightly different to how we query a AnnotationDb object. The major difference is the terminology. Luckily, Mart object provides a compatibility layer so we can still call functions such as <code>select(db, ...)</code>, <code>keytypes(db)</code>, <code>keys(db)</code> and <code>columns(db)</code>, which we frequently do<sup id="fnref:select-compat"><a class="footnote-ref" href="#fn:select-compat">1</a></sup> when using NCBI/UCSC references.</p>
<p>A Mart can have hundreds of keys and columns. So we select a part of them out by <code>grep()</code>,</p>
<div class="highlight"><pre><span></span><code><span class="nf">grep</span><span class="p">(</span><span class="s">"^refseq"</span><span class="p">,</span><span class="w"> </span><span class="nf">keytypes</span><span class="p">(</span><span class="n">ensembl</span><span class="p">),</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span>
<span class="c1"># [1] "refseq_mrna" "refseq_mrna_predicted" ...</span>
<span class="nf">grep</span><span class="p">(</span><span class="s">"^ensembl"</span><span class="p">,</span><span class="w"> </span><span class="nf">keytypes</span><span class="p">(</span><span class="n">ensembl</span><span class="p">),</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span>
<span class="c1"># [1] "ensembl_exon_id" "ensembl_gene_id" ...</span>
<span class="nf">grep</span><span class="p">(</span><span class="s">"hsapiens_paralog_"</span><span class="p">,</span><span class="w"> </span><span class="nf">columns</span><span class="p">(</span><span class="n">ensembl</span><span class="p">),</span><span class="w"> </span><span class="n">value</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span>
<span class="c1"># [1] "hsapiens_paralog_associated_gene_name"</span>
<span class="c1"># [2] "hsapiens_paralog_canonical_transcript_protein"</span>
<span class="c1"># [3] "hsapiens_paralog_chrom_end"</span>
<span class="c1"># ...</span>
</code></pre></div>
<p>We start by finding the MAPK1’s RefSeq transcript IDs and their corresponding Ensembl transcript IDs, which is something we cannot do by our locally built Ensembl TxDb nor the human OrgDb.</p>
<div class="highlight"><pre><span></span><code><span class="nf">select</span><span class="p">(</span>
<span class="w"> </span><span class="n">ensembl</span><span class="p">,</span>
<span class="w"> </span><span class="n">keys</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s">"MAPK1"</span><span class="p">),</span>
<span class="w"> </span><span class="n">keytype</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"hgnc_symbol"</span><span class="p">,</span>
<span class="w"> </span><span class="n">columns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span>
<span class="w"> </span><span class="s">"refseq_mrna"</span><span class="p">,</span><span class="w"> </span><span class="s">"ensembl_transcript_id"</span><span class="p">,</span>
<span class="w"> </span><span class="s">"hgnc_symbol"</span><span class="p">,</span><span class="w"> </span><span class="s">"entrezgene"</span><span class="p">,</span>
<span class="w"> </span><span class="s">"chromosome_name"</span><span class="p">,</span>
<span class="w"> </span><span class="s">"transcript_start"</span><span class="p">,</span><span class="w"> </span><span class="s">"transcript_end"</span><span class="p">,</span><span class="w"> </span><span class="s">"strand"</span>
<span class="w"> </span><span class="p">)</span>
<span class="p">)</span>
<span class="c1"># refseq_mrna ensembl_transcript_id hgnc_symbol entrezgene</span>
<span class="c1"># 1 NM_002745 ENST00000215832 MAPK1 5594</span>
<span class="c1"># 2 ENST00000491588 MAPK1 5594</span>
<span class="c1"># 3 NM_138957 ENST00000398822 MAPK1 5594</span>
<span class="c1"># 4 ENST00000544786 MAPK1 5594</span>
<span class="c1"># chromosome_name transcript_start transcript_end strand</span>
<span class="c1"># 1 22 21754500 21867629 -1</span>
<span class="c1"># 2 22 21763984 21769428 -1</span>
<span class="c1"># 3 22 21769040 21867680 -1</span>
<span class="c1"># 4 22 21769204 21867440 -1</span>
</code></pre></div>
<p>So some of the MAPK1 Ensembl transcripts does not have RefSeq identifiers. This is common to see since RefSeq is more conservative about including new transcripts. Anyway, we can now translate our analysis result between a wider range of naming systems.</p>
<p>Moreover, what’s awesome about BioMart is that almost all the information on the Ensembl genome browser can be retreived by BioMart. For example, getting the paralog and the mouse homolog of MAPK1,</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Get paralog of MAPK1</span>
<span class="nf">select</span><span class="p">(</span>
<span class="w"> </span><span class="n">ensembl</span><span class="p">,</span>
<span class="w"> </span><span class="n">keys</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s">"ENSG00000100030"</span><span class="p">),</span>
<span class="w"> </span><span class="n">keytype</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"ensembl_gene_id"</span><span class="p">,</span>
<span class="w"> </span><span class="n">columns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span>
<span class="w"> </span><span class="s">"hsapiens_paralog_associated_gene_name"</span><span class="p">,</span>
<span class="w"> </span><span class="s">"hsapiens_paralog_orthology_type"</span><span class="p">,</span>
<span class="w"> </span><span class="s">"hsapiens_paralog_ensembl_peptide"</span>
<span class="w"> </span><span class="p">)</span>
<span class="p">)</span>
<span class="c1"># hsapiens_paralog_associated_gene_name hsapiens_paralog_orthology_type hsapiens_paralog_ensembl_peptide</span>
<span class="c1"># 1 MAPK3 within_species_paralog ENSP00000263025</span>
<span class="c1"># 2 MAPK6 within_species_paralog ENSP00000261845</span>
<span class="c1"># 3 MAPK4 within_species_paralog ENSP00000383234</span>
<span class="c1"># 4 NLK within_species_paralog ENSP00000384625</span>
<span class="c1"># 5 MAPK7 within_species_paralog ENSP00000311005</span>
<span class="c1"># Get homolog of MAPK1 in mouse</span>
<span class="nf">select</span><span class="p">(</span>
<span class="w"> </span><span class="n">ensembl</span><span class="p">,</span>
<span class="w"> </span><span class="n">keys</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s">"ENSG00000100030"</span><span class="p">),</span>
<span class="w"> </span><span class="n">keytype</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"ensembl_gene_id"</span><span class="p">,</span>
<span class="w"> </span><span class="n">columns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span>
<span class="w"> </span><span class="s">"mmusculus_homolog_associated_gene_name"</span><span class="p">,</span>
<span class="w"> </span><span class="s">"mmusculus_homolog_orthology_type"</span><span class="p">,</span>
<span class="w"> </span><span class="s">"mmusculus_homolog_ensembl_peptide"</span>
<span class="w"> </span><span class="p">)</span>
<span class="p">)</span>
<span class="c1"># mmusculus_homolog_associated_gene_name mmusculus_homolog_orthology_type mmusculus_homolog_ensembl_peptide</span>
<span class="c1"># 1 Mapk1 ortholog_one2one ENSMUSP00000065983</span>
</code></pre></div>
<h3 id="biomarts-original-interface">biomaRt’s original interface</h3>
<p>The <code>select()</code> function we use is not the original biomaRt’s interface. In fact, keys and columns are interpreted as BioMart’s <strong>filters</strong> and <strong>attributes</strong> respectively. To find all available filters and attributes,</p>
<div class="highlight"><pre><span></span><code><span class="n">filters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">listFilters</span><span class="p">(</span><span class="n">ensembl</span><span class="p">)</span>
<span class="n">attributes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">listAttributes</span><span class="p">(</span><span class="n">ensembl</span><span class="p">)</span>
</code></pre></div>
<p>each of the command return a data.frame that contains each filter’s or attribute’s name and description.</p>
<p>Behind the scene, arguments of <code>select(db, ...)</code> is converted to <code>getBM(mart, ...)</code>. For the same example of finding RefSeq and Ensembl transcript IDs, it can be re-written as</p>
<div class="highlight"><pre><span></span><code><span class="nf">getBM</span><span class="p">(</span>
<span class="w"> </span><span class="n">attributes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span>
<span class="w"> </span><span class="s">"refseq_mrna"</span><span class="p">,</span><span class="w"> </span><span class="s">"ensembl_transcript_id"</span><span class="p">,</span>
<span class="w"> </span><span class="s">"chromosome_name"</span><span class="p">,</span>
<span class="w"> </span><span class="s">"transcript_start"</span><span class="p">,</span><span class="w"> </span><span class="s">"transcript_end"</span><span class="p">,</span><span class="w"> </span><span class="s">"strand"</span><span class="p">,</span>
<span class="w"> </span><span class="s">"hgnc_symbol"</span><span class="p">,</span><span class="w"> </span><span class="s">"entrezgene"</span><span class="p">,</span><span class="w"> </span><span class="s">"ensembl_gene_id"</span>
<span class="w"> </span><span class="p">),</span>
<span class="w"> </span><span class="n">filters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"hgnc_symbol"</span><span class="p">,</span>
<span class="w"> </span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s">"MAPK1"</span><span class="p">),</span>
<span class="w"> </span><span class="n">mart</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ensembl</span>
<span class="p">)</span>
</code></pre></div>
<h2 id="conversion-between-genomic-coordinate-systems">Conversion between genomic coordinate systems</h2>
<p>Somethings we need to convert between different verions of the reference. For example, today we’d like to convert a batch of genomic locations of reference hg38 to that of hg19, so we can compare our new research with previous studies. It is a non-trivial task that can be currently handled by the following tools:</p>
<ul>
<li><a href="http://crossmap.sourceforge.net/">CrossMap</a> (used by Ensembl)</li>
<li><a href="https://genome.ucsc.edu/cgi-bin/hgLiftOver">liftOver</a> (used by UCSC)</li>
</ul>
<p>Frankly I don’t have experience for such conversion in real study (the converted result still gives the sense of unease), but anyway here I follow <a href="http://genomicsclass.github.io/book/pages/bioc1_liftOver.html">the guide on PH525x series</a>. In Bioconductor, we can use UCSC’s Chain file to apply the <code>liftOver()</code> method provided by package <code>rtracklayer</code>. To convert regions from hg38 to hg19, we need the <code>hg38ToHg19.over.chain</code> file, which can be found at <a href="ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/">ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/</a>.</p>
<p>We still use MAPK1 as an example of conversion. First extract MAPK1’s genomic ranges in hg38 and hg19 respectively,</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">TxDb.Hsapiens.UCSC.hg38.knownGene</span><span class="p">)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">TxDb.Hsapiens.UCSC.hg19.knownGene</span>
<span class="n">tx38</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">TxDb.Hsapiens.UCSC.hg38.knownGene</span>
<span class="n">tx19</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">TxDb.Hsapiens.UCSC.hg19.knownGene</span>
<span class="n">MAPK1_hg38</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">genes</span><span class="p">(</span><span class="n">tx38</span><span class="p">,</span><span class="w"> </span><span class="n">filter</span><span class="o">=</span><span class="nf">list</span><span class="p">(</span><span class="n">gene_id</span><span class="o">=</span><span class="s">"5594"</span><span class="p">))</span>
<span class="n">MAPK1_hg19</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">genes</span><span class="p">(</span><span class="n">tx19</span><span class="p">,</span><span class="w"> </span><span class="n">filter</span><span class="o">=</span><span class="nf">list</span><span class="p">(</span><span class="n">gene_id</span><span class="o">=</span><span class="s">"5594"</span><span class="p">))</span>
</code></pre></div>
<p>Then we convert <code>MAPK1_hg38</code> to use the hg19 coordinate system.</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">rtracklayer</span><span class="p">)</span>
<span class="n">ch</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">import.chain</span><span class="p">(</span><span class="s">"./hg38ToHg19.over.chain"</span><span class="p">)</span>
<span class="n">MAPK1_hg19_lifted</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">liftOver</span><span class="p">(</span><span class="n">MAPK1_hg38</span><span class="p">,</span><span class="w"> </span><span class="n">ch</span><span class="p">)</span>
<span class="n">MAPK1_hg19</span>
<span class="c1"># GRanges object with 1 range and 1 metadata column:</span>
<span class="c1"># seqnames ranges strand | gene_id</span>
<span class="c1"># <Rle> <IRanges> <Rle> | <character></span>
<span class="c1"># 5594 chr22 [22113947, 22221970] - | 5594</span>
<span class="c1"># -------</span>
<span class="c1"># seqinfo: 93 sequences (1 circular) from hg19 genome</span>
<span class="n">MAPK1_hg19_lifted</span>
<span class="c1"># GRangesList object of length 1:</span>
<span class="c1"># $5594</span>
<span class="c1"># GRanges object with 2 ranges and 1 metadata column:</span>
<span class="c1"># seqnames ranges strand | gene_id</span>
<span class="c1"># <Rle> <IRanges> <Rle> | <character></span>
<span class="c1"># [1] chr22 [22113947, 22216652] - | 5594</span>
<span class="c1"># [2] chr22 [22216654, 22221970] - | 5594</span>
<span class="c1">#</span>
<span class="c1"># -------</span>
<span class="c1"># seqinfo: 1 sequence from an unspecified genome; no seqlengths</span>
</code></pre></div>
<p>So the conversion worked as expected, though it created a gap in the range (missing a base at 22216653). I haven’t looked into the results. To ensure the correctness of the conversion, maybe a comparison with CrossMap is needed.</p>
<h2 id="summary">Summary</h2>
<p>We skimmed through OrgDb and TxDb again using the Ensembl references, including how to build the TxDb for Ensembl locally and obtain external annotations from AnnotationHub.</p>
<p>BioMart is an abundant resource to query across various types of databases and references, which can be used in conversion between different naming systems.</p>
<p>Finally, we know how to convert between different version of the reference. Though the correctness of the conversion requires further examination (not meaning it is wrong), at least the conversion by liftOver works as expected.</p>
<p>Starting here, you should have no trouble dealing with annotations in R anymore. For the next post, I plan to further explore the way to read sequencing analysis results in R.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:select-compat">
<p>Note that the <code>select(mart, ...)</code> compatibility does not apply to all existed filters (keys) and attributes (columns) of the given Mart. <a class="footnote-backref" href="#fnref:select-compat" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>使用 uWSGI、nginx、systemd 部署 Django2016-05-19T00:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-05-19:/posts/2016/05/django-deploy-uwsgi-nginx-systemd/<p>上一次很認真的 Django 部署記錄在<a href="https://blog.liang2.tw/posts/2016/02/pydoctw-server/">《設定 Python 官方文件中文化自動更新 Server》</a>一文。很巧地自己畢業的題目也要架個 Django 網站,所以就 …</p><p>上一次很認真的 Django 部署記錄在<a href="https://blog.liang2.tw/posts/2016/02/pydoctw-server/">《設定 Python 官方文件中文化自動更新 Server》</a>一文。很巧地自己畢業的題目也要架個 Django 網站,所以就再跑了一次部署設定。舊文還提了有的沒的,這篇僅針對 Django 的部署。</p>
<p>這邊的部署設定都儘量不使用 root 權限,整個連線的流程圖如下:</p>
<div class="highlight"><pre><span></span><code>nginx -- unix socket -- uWSGI -- Django
</code></pre></div>
<p>寫一個名為 <code>PROJ.service</code> 的 systemd unit 來管理這網站的啟動與否。之後 <code>PROJ</code> 就換成自己的專案名稱;<code>USER</code> 就換成執行網站的帳號。</p>
<div class="toc">
<ul>
<li><a href="#_1">作業系統</a></li>
<li><a href="#postgresql">PostgreSQL</a></li>
<li><a href="#django-proj">Django PROJ</a></li>
<li><a href="#tmpfilesd">tmpfiles.d</a></li>
<li><a href="#nginx">nginx</a></li>
<li><a href="#uwsgi">uWSGI</a></li>
<li><a href="#systemd">systemd</a></li>
<li><a href="#_2">確認、總結</a></li>
</ul>
</div>
<h3 id="_1">作業系統</h3>
<p>使用 Ubuntu 16.04 LTS。我對 Ubuntu 其實沒愛,但因為很多人用,畢業之後應該還找得到人維護。他跟 Debian 差不多,所以跟舊文沒什麼差別。Ubuntu 16 內建就有 Python 3.5,不用再裝;PostgreSQL 也來到 9.5 版。</p>
<p>使用 <a href="https://wiki.debian.org/UnattendedUpgrades">unattended-upgrades</a> 定期更新與 security 相關的套件,它預設一天檢查一次,更新的記錄會在 <code>/var/log/unattended-upgrades</code> 目錄中。</p>
<h3 id="postgresql">PostgreSQL</h3>
<p>參考<a href="https://blog.liang2.tw/posts/2016/01/postgresql-install/">《安裝 PostgreSQL 9 於 Debian Jessie / OSX》</a>一文設定。建立跟 OS user 同名的 PostgreSQL 帳號,給了建立 database 的權限,這樣開發比較方便。不用設定密碼。</p>
<h3 id="django-proj">Django PROJ</h3>
<p>使用內建 <a href="https://docs.python.org/3/library/venv.html">venv</a> 在自己家目錄下某處,建立名為 <code>VENV</code> 的虛擬環境:</p>
<div class="highlight"><pre><span></span><code>python3.5<span class="w"> </span>-m<span class="w"> </span>venv<span class="w"> </span>VENV
</code></pre></div>
<p>有關部署的設定(即 <code>settings.py</code>),利用 <a href="https://github.com/joke2k/django-environ">django-environ</a> 把 secret key、database 連線資訊、寄信 SMTP server 等設定寫在獨立的檔案,就可以讓 local 和 production 環境讀到各自的設定。具體的做法可以參考 <a href="https://github.com/pycontw/pycontw2016/blob/master/src/pycontw2016/settings/production.py">PyCon Taiwan 2016 網站管理設定</a> 的寫法。</p>
<p>在連 PostgreSQL 時使用 local connection (Unix-domain socket),即使用者同名的身份。</p>
<div class="highlight"><pre><span></span><code><span class="na">DATABASE_URL</span><span class="o">=</span><span class="s">postgres:///TABLE_NAME</span>
</code></pre></div>
<h3 id="tmpfilesd">tmpfiles.d</h3>
<p>把 nginx 與 uwsgi 溝通用的 socket 放在 <code>/run/PROJ</code> 底下,但這也表示重開機之後,<code>/run/PROJ</code> 資料夾就會消失不見,所以使用 <a href="https://www.freedesktop.org/software/systemd/man/tmpfiles.d.html">tmpfiles.d</a><sup id="fnref:systemd-runtimedir"><a class="footnote-ref" href="#fn:systemd-runtimedir">1</a></sup>。除了資料夾的命名改成用專案名稱,設定都跟<a href="https://blog.liang2.tw/posts/2016/02/pydoctw-server/">舊文</a>一樣。</p>
<h3 id="nginx">nginx</h3>
<p>nginx 設定跟<a href="https://blog.liang2.tw/posts/2016/02/pydoctw-server/">舊文</a>一樣。放在 <code>/etc/nginx/sites-available/PROJ.conf</code></p>
<div class="highlight"><pre><span></span><code><span class="c1"># Upstream Django setting; the socket nginx connects to</span>
<span class="k">upstream</span><span class="w"> </span><span class="s">django</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kn">server</span><span class="w"> </span><span class="s">unix:///run/PROJ/django.sock</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">server</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kn">listen</span><span class="w"> </span><span class="mi">80</span><span class="p">;</span>
<span class="w"> </span><span class="kn">listen</span><span class="w"> </span><span class="mi">443</span><span class="w"> </span><span class="s">default</span><span class="w"> </span><span class="s">ssl</span><span class="p">;</span>
<span class="w"> </span><span class="kn">server_name</span><span class="w"> </span><span class="mi">123</span><span class="s">.123.123.123</span>
<span class="w"> </span><span class="p">;</span>
<span class="w"> </span><span class="kn">charset</span><span class="w"> </span><span class="s">utf-8</span><span class="p">;</span>
<span class="w"> </span><span class="kn">client_max_body_size</span><span class="w"> </span><span class="s">10M</span><span class="p">;</span><span class="w"> </span><span class="c1"># max upload size</span>
<span class="w"> </span><span class="kn">keepalive_timeout</span><span class="w"> </span><span class="mi">15</span><span class="p">;</span>
<span class="w"> </span><span class="kn">location</span><span class="w"> </span><span class="s">/static</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kn">alias</span><span class="w"> </span><span class="s">/path/to/PROJ/assets</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="c1"># Finally, send all non-media requests to the Django server.</span>
<span class="w"> </span><span class="kn">location</span><span class="w"> </span><span class="s">/</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kn">uwsgi_pass</span><span class="w"> </span><span class="s">django</span><span class="p">;</span>
<span class="w"> </span><span class="kn">include</span><span class="w"> </span><span class="s">/etc/nginx/uwsgi_params</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
</code></pre></div>
<p><code>/path/to/PROJ/assets</code> 是 Django <a href="https://docs.djangoproject.com/en/1.9/ref/settings/#std:setting-STATIC_ROOT">STATIC_ROOT</a> 的路徑。只要執行 <code>python manage.py collectstatic</code> 後,即使 uWSGI 還沒設定就可以測試 /static/…/ 有沒有被 nginx 抓到。 </p>
<p>啟動時,先把檔案連結到 <code>/etc/nginx/site-enabled/</code>,重載 nginx 設定:</p>
<div class="highlight"><pre><span></span><code><span class="nb">cd</span><span class="w"> </span>/etc/nginx/sites-enabled/
sudo<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>../sites-available/PROJ.conf<span class="w"> </span>.
sudo<span class="w"> </span>systemctl<span class="w"> </span>reload<span class="w"> </span>nginx
</code></pre></div>
<h3 id="uwsgi">uWSGI</h3>
<p>跟舊文最大的差別,只要裝在 VENV 裡面就好了;然後也不使用 emperor mode。寫一個 <code>/etc/uwsgi/vassals/PROJ.ini</code> 放設定:</p>
<div class="highlight"><pre><span></span><code><span class="k">[uwsgi]</span>
<span class="na">chdir</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">/path/to/PROJ</span>
<span class="c1"># Django's wsgi file</span>
<span class="na">module</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">PROJ.wsgi:application</span>
<span class="na">env</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">DJANGO_SETTINGS_MODULE=PROJ.settings.production</span>
<span class="c1"># the virtualenv (full path)</span>
<span class="na">home</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">/path/to/VENV</span>
<span class="c1"># process-related settings</span>
<span class="c1"># master</span>
<span class="na">master</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">true</span>
<span class="c1"># maximum number of worker processes</span>
<span class="na">processes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">4</span>
<span class="c1"># the socket (use the full path to be safe</span>
<span class="na">socket</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">/run/PROJ/django.sock</span>
<span class="c1"># ... with appropriate permissions - may be needed</span>
<span class="na">chmod-socket</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">664</span>
<span class="na">uid</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">USER</span>
<span class="na">gid</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">www-data</span>
<span class="c1"># clear environment on exit</span>
<span class="na">vacuum</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">true</span>
</code></pre></div>
<p>設定好後執行以下指令,就應該能看到網站能動了。</p>
<div class="highlight"><pre><span></span><code>sudo<span class="w"> </span>/path/to/VENV/bin/uwsgi<span class="w"> </span>--ini<span class="w"> </span>/etc/uwsgi/vassals/PROJ.ini
</code></pre></div>
<h3 id="systemd">systemd</h3>
<p>這邊除了執行 uWSGI 的指令不同外,都跟<a href="https://blog.liang2.tw/posts/2016/02/pydoctw-server/">舊文</a>相同。Debian 系 systemd system unit 設定檔放在 <code>/etc/systemd/system/PROJ.service</code>:</p>
<div class="highlight"><pre><span></span><code><span class="k">[Unit]</span>
<span class="na">Description</span><span class="o">=</span><span class="s">PROJ's Django server by uWSGI</span>
<span class="na">After</span><span class="o">=</span><span class="s">syslog.target</span>
<span class="k">[Service]</span>
<span class="na">ExecStart</span><span class="o">=</span><span class="s">/path/to/VENV/bin/uwsgi --ini /etc/uwsgi/vassals/PROJ.ini</span>
<span class="na">Restart</span><span class="o">=</span><span class="s">always</span>
<span class="na">KillSignal</span><span class="o">=</span><span class="s">SIGQUIT</span>
<span class="na">Type</span><span class="o">=</span><span class="s">notify</span>
<span class="na">StandardError</span><span class="o">=</span><span class="s">syslog</span>
<span class="na">NotifyAccess</span><span class="o">=</span><span class="s">all</span>
<span class="k">[Install]</span>
<span class="na">WantedBy</span><span class="o">=</span><span class="s">multi-user.target</span>
</code></pre></div>
<p>這邊設定它會(有錯誤時)自動重新起動,並把 stderr 導到 syslog。接著,就要啟動這個 <code>PROJ.service</code> 服務:</p>
<div class="highlight"><pre><span></span><code>sudo<span class="w"> </span>systemctl<span class="w"> </span><span class="nb">enable</span><span class="w"> </span>PROJ
sudo<span class="w"> </span>systemctl<span class="w"> </span>status<span class="w"> </span>PROJ
</code></pre></div>
<p>可以透過 <code>sudo journalctl -xe -u PROJ</code> 來查看 uWSGI 執行、連線 log。</p>
<h3 id="_2">確認、總結</h3>
<p>重啟系統一次,如果網站還活著,就表示一切設定都沒問題。整體上不太複雜,但權限不符的錯誤可能會讓你鬼打牆,要有耐心。</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:systemd-runtimedir">
<p>也可以用 <a href="https://www.freedesktop.org/software/systemd/man/systemd.exec.html">systemd.exec(5)</a> 提到的 <code>RuntimeDirectory=PROJ</code> 來建立執行用目錄。但因為 <code>PROJ.service</code> 的 USER 必須是 root,這種情況 man page 就建議改用 tmpfiles.d。我覺得應該能解決使用 root 權限的問題,但太懶了就先這樣…… <a class="footnote-backref" href="#fnref:systemd-runtimedir" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Jupyter Notebook Progress Bar2016-03-23T02:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-03-23:/posts/2016/03/notebook-progress-bar/<p>相信很多人都已經在使用 <a href="https://jupyter.org/">Jupyter (IPython) Notebook</a> 跑分析。隨著分析的資料越跑越多,有時候刷下去就是幾十分鐘甚至數小時。此時沒有個 …</p><p>相信很多人都已經在使用 <a href="https://jupyter.org/">Jupyter (IPython) Notebook</a> 跑分析。隨著分析的資料越跑越多,有時候刷下去就是幾十分鐘甚至數小時。此時沒有個進度條還蠻無聊的,而且能讓自己感覺<strong>很有進度</strong>,何樂不為呢?</p>
<p>例如我<a href="https://blog.liang2.tw/play_aiohttp/?full#asyncio-progressbar-cover">去年介紹 aiohttp</a> 時就有用到 notebook 和 console 底下的進度條 (progress bar)。不過,這幾個月 Jupyter Notebook 4+ 架構上的調整,可能 code 都不能用了。剛好昨天的 Taipei.py 有人提到這事,就來整理一下吧。</p>
<div class="toc">
<ul>
<li><a href="#ipywidgets">IPywidgets 介紹</a><ul>
<li><a href="#_1">安裝</a></li>
</ul>
</li>
<li><a href="#_2">使用進度條</a></li>
<li><a href="#progress-bar-example-in-action">Progress bar example in action</a></li>
<li><a href="#misc">Misc. 螢幕截圖</a><ul>
<li><a href="#ffmpeg">FFmpeg 轉檔</a></li>
</ul>
</li>
</ul>
</div>
<h3 id="ipywidgets">IPywidgets 介紹</h3>
<p>Notebook 進度條使用 <a href="https://github.com/ipython/ipywidgets">ipywidgets</a> 中的元件實作。這件元件規範了 notebook client <-> server 間雙向的溝通,並且能把相關的 CSS / JS 包裝在一起。在 ipywidgets 範例的 <a href="http://nbviewer.jupyter.org/github/ipython/ipywidgets/blob/master/examples/Widget%20Basics.ipynb"><em>Widgets Basics</em></a> 中就有提到可能的用途:</p>
<blockquote>
<p>You can use widgets to <strong>build interactive GUIs</strong> for your notebooks. <br>
You can also use widgets to <strong>synchronize stateful and stateless information</strong> between Python and JavaScript.</p>
</blockquote>
<p>所以除了像進度條這樣單向的從 python code 傳訊息到 notebook (HTML) front-end 之外,也可以做一些介面把 front-end 的值傳回 python code。</p>
<p>……只是要用個進度條而已,哪來這麼多背景知識。更多介紹可以參考 <a href="http://nbviewer.jupyter.org/github/ipython/ipywidgets/blob/master/examples/Index.ipynb">ipywidgets 官方範例</a>。</p>
<h4 id="_1">安裝</h4>
<p>使用 Python 3.5 示範。安裝除了 notebook 本身外,還要額外裝上 ipywidgets 這套件。</p>
<div class="highlight"><pre><span></span><code>pip install notebook ipywidgets
</code></pre></div>
<p>再用 <code>jupyter notebook</code> 即可啟用 notebook。</p>
<h3 id="_2">使用進度條</h3>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">ipywidgets</span> <span class="kn">import</span> <span class="n">IntProgress</span>
<span class="kn">from</span> <span class="nn">IPython.display</span> <span class="kn">import</span> <span class="n">display</span>
</code></pre></div>
<p><code>IntProgress</code> 就是進度條,<code>display</code> 則是 IPython 顯示各種 Python 物件的函數,在這邊用它才能把 widget 以 HTML 顯示並與 python code 聯動。</p>
<p>建立一個進度條的方式很簡單。建立 <code>IntProgress</code> widget object,然後顯示它:</p>
<div class="highlight"><pre><span></span><code><span class="n">p</span> <span class="o">=</span> <span class="n">IntProgress</span><span class="p">()</span>
<span class="n">display</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
</code></pre></div>
<figure>
<img src="https://blog.liang2.tw/posts/2016/03/notebook-progress-bar/pics/progressbar_default.png">
</figure>
<p>預設是進度條有 100 個單位,初始值為 0。進度條的值與最大值的狀態分別存在 <code>.value</code>、<code>.max</code> 屬性裡:</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">p</span><span class="o">.</span><span class="n">value</span><span class="p">,</span> <span class="n">p</span><span class="o">.</span><span class="n">max</span>
<span class="go">(0, 100)</span>
</code></pre></div>
<p>只要修改 <code>p.value</code> 前面的進度條的狀態就會自動更新(不用重跑 <code>display(p)</code>):</p>
<div class="highlight"><pre><span></span><code><span class="n">p</span><span class="o">.</span><span class="n">value</span> <span class="o">=</span> <span class="mi">50</span>
</code></pre></div>
<figure>
<img src="https://blog.liang2.tw/posts/2016/03/notebook-progress-bar/pics/progressbar_50.png">
</figure>
<div class="highlight"><pre><span></span><code><span class="n">p</span><span class="o">.</span><span class="n">value</span> <span class="o">=</span> <span class="mi">100</span>
</code></pre></div>
<figure>
<img src="https://blog.liang2.tw/posts/2016/03/notebook-progress-bar/pics/progressbar_100.png">
</figure>
<p>當然,最大值調整也會即時更新。此外,還可以透過 <code>.description</code> 給進度條一個 label。重新做一個完整的例子:</p>
<div class="highlight"><pre><span></span><code><span class="n">p2</span> <span class="o">=</span> <span class="n">IntProgress</span><span class="p">(</span><span class="nb">max</span><span class="o">=</span><span class="mi">56</span><span class="p">)</span>
<span class="n">p2</span><span class="o">.</span><span class="n">value</span> <span class="o">+=</span> <span class="mi">10</span>
<span class="n">p2</span><span class="o">.</span><span class="n">description</span> <span class="o">=</span> <span class="s1">'Running'</span>
<span class="n">display</span><span class="p">(</span><span class="n">p2</span><span class="p">)</span>
</code></pre></div>
<figure>
<img src="https://blog.liang2.tw/posts/2016/03/notebook-progress-bar/pics/progressbar_full.png">
</figure>
<p>完整的 code 就這樣,用起來非常方便。</p>
<h3 id="progress-bar-example-in-action">Progress bar example in action</h3>
<p>模擬一下真實情況,我們通常有一堆待做的 task,在這邊叫 <code>todo_tasks</code> 好了。</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">time</span>
<span class="n">todo_tasks</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'task </span><span class="si">%02d</span><span class="s1">'</span> <span class="o">%</span> <span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">50</span><span class="p">)]</span>
</code></pre></div>
<p>只是個字串,但用 <code>time.sleep(sec)</code> 來模擬有在做事。</p>
<p>搭配進度條的時候,把實際動態做成底下的動畫。</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Initialize a progess bar</span>
<span class="n">progress</span> <span class="o">=</span> <span class="n">IntProgress</span><span class="p">()</span>
<span class="n">progress</span><span class="o">.</span><span class="n">max</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">todo_tasks</span><span class="p">)</span>
<span class="n">progress</span><span class="o">.</span><span class="n">description</span> <span class="o">=</span> <span class="s1">'(Init)'</span>
<span class="n">display</span><span class="p">(</span><span class="n">progress</span><span class="p">)</span>
<span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.25</span><span class="p">)</span>
<span class="c1"># Simulating task execution</span>
<span class="k">for</span> <span class="n">task</span> <span class="ow">in</span> <span class="n">todo_tasks</span><span class="p">:</span>
<span class="n">progress</span><span class="o">.</span><span class="n">value</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.05</span><span class="p">)</span>
<span class="n">progress</span><span class="o">.</span><span class="n">description</span> <span class="o">=</span> <span class="n">task</span>
<span class="n">progress</span><span class="o">.</span><span class="n">description</span> <span class="o">=</span> <span class="s1">'(Done)'</span>
</code></pre></div>
<figure>
<video loop auto autoplay>
<source src="https://blog.liang2.tw/posts/2016/03/notebook-progress-bar/pics/progressbar_demo.webm" type="video/webm">
<source src="https://blog.liang2.tw/posts/2016/03/notebook-progress-bar/pics/progressbar_demo.mp4" type="video/mp4">
Your browser doesn't support HTML5 video in WebM with VP8 or MP4 with H.264. You can still download the <a href="https://blog.liang2.tw/posts/2016/03/notebook-progress-bar/pics/progressbar_demo.mp4">screencast</a> and view it locally.
</video>
<p class="caption center">Progressbar in action</p>
</figure>
<p>非常方便吧!</p>
<h3 id="misc">Misc. 螢幕截圖</h3>
<p>寫這篇文章花最多的時間是在截圖跟做動畫 XD</p>
<p>瀏覽器的截圖,現在 Firefox (45) 已經可以只選擇截某個 DOM,十分方便。</p>
<p>在最後動態的錄製花了不少時間。一開始想說是不是要用 GIF,但<a href="http://blog.imgur.com/2014/10/09/introducing-gifv/">都 2016 年了還用什麼 GIF 啊!</a>,雖然螢幕可以試著改 GIF palette 讓畫面不會很醜體積又小,但覺得用個 H.264 / VP9 簡單多了。</p>
<p>使用 QuickTime Screen Capture,開始錄的時候能只選擇螢幕一部份區域。以我 13” retina 螢幕為例,會得到 1636x736 H.264 .mov 檔。但我覺得解析度不用這麼高,所以最後輸出成 480p (1148x480) 就好,順便裁了一點白邊。</p>
<p>透過 HTML5 <a href="https://developer.mozilla.org/en-US/docs/Web/HTML/Element/video"><code><video></code></a> 能把 MP4 / WebM 當成動畫來使用:</p>
<div class="highlight"><pre><span></span><code><span class="p"><</span><span class="nt">video</span> <span class="na">loop</span> <span class="na">autoplay</span><span class="p">></span>
<span class="p"><</span><span class="nt">source</span> <span class="na">src</span><span class="o">=</span><span class="s">"vid.webm"</span> <span class="na">type</span><span class="o">=</span><span class="s">"video/webm"</span><span class="p">></span>
<span class="p"><</span><span class="nt">source</span> <span class="na">src</span><span class="o">=</span><span class="s">"vid.mp4"</span> <span class="na">type</span><span class="o">=</span><span class="s">"video/mp4"</span><span class="p">></span>
Your browser doesn't support HTML5 video in WebM with VP8
or MP4 with H.264. You can still download the
<span class="p"><</span><span class="nt">a</span> <span class="na">href</span><span class="o">=</span><span class="s">"vid.mp4"</span><span class="p">></span>screencast<span class="p"></</span><span class="nt">a</span><span class="p">></span> and view it locally.
<span class="p"></</span><span class="nt">video</span><span class="p">></span>
</code></pre></div>
<p>各家 web browser 的支援度可參考 caniuse.com:<a href="http://caniuse.com/#feat=webm">WebM</a>、<a href="http://caniuse.com/#feat=mpeg4">MP4</a></p>
<h4 id="ffmpeg">FFmpeg 轉檔</h4>
<p>筆記而已,沒有認真調參數讓輸出檔案最小。VP9 的部份參考 <a href="https://trac.ffmpeg.org/wiki/Encode/VP9">FFmepg</a><sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>。</p>
<div class="highlight"><pre><span></span><code><span class="c1"># H.264 MP4</span>
ffmpeg<span class="w"> </span>-i<span class="w"> </span>Untitled.mov<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-vcodec<span class="w"> </span>h264<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-strict<span class="w"> </span>-2<span class="w"> </span>-crf<span class="w"> </span><span class="m">22</span><span class="w"> </span>-preset<span class="w"> </span>slow<span class="w"> </span>-r<span class="w"> </span><span class="m">24</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-vf<span class="w"> </span><span class="s2">"crop=iw:ih-52:0:10, scale=-1:480"</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>out.mp4
<span class="c1"># VP9 WebM</span>
ffmpeg<span class="w"> </span>-i<span class="w"> </span>Untitled.mov<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-vcodec<span class="w"> </span>libvpx-vp9<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-b:v<span class="w"> </span>150K<span class="w"> </span>-r<span class="w"> </span><span class="m">24</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-vf<span class="w"> </span><span class="s2">"crop=iw:ih-52:0:10, scale=-1:480"</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>out.webm
</code></pre></div>
<p>4s 的檔案最後大約 60KB,相當不錯。我很多 PNG 截圖都大多了。</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>du<span class="w"> </span>-sh<span class="w"> </span>./*<span class="w"> </span><span class="p">|</span><span class="w"> </span>gsort<span class="w"> </span>-rh
<span class="go">744K ./Untitled.mov</span>
<span class="go"> 60K ./out.mp4</span>
<span class="go"> 52K ./out.webm</span>
</code></pre></div>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>需要額外安裝 libvpx,例如:<code>brew install ffmepg --with-libvpx</code> <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Python 官方文件中文化 Server HTTPS 使用 Let's Encrypt2016-02-21T14:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-02-21:/posts/2016/02/pydoctw-https/<p>整理了 server 從 HTTP 到 HTTPS 相關設定的調整。</p><p>現在可以透過 <a href="https://docs.python.org.tw">https://docs.python.org.tw</a> 訪問 Python 官方文件中文化網站。</p>
<p>Server 本身的設定可以參考<a href="https://blog.liang2.tw/posts/2016/02/pydoctw-server/">之前的文章</a>。加入 HTTPS 就要設定相關的憑証,我選擇 <a href="https://letsencrypt.org/">Let’s Encrypt</a> 做為憑証的簽署者。</p>
<p>Let’s Encrypt (LE) 使用 AMCE (Automated Certificate Management Environment) protocol 去驗証你是否擁有你欲簽証的 domain。官網上有<a href="https://letsencrypt.org/howitworks/technology/">圖文說明</a>,簡單來說,LE 會要求你的 server 在特定的 path 加入特定的檔案,如果你做得到,就代表你擁有這個 domain。這樣的簽証第一次要在 LE server 上註冊,之後最長每 90 天認証一次。</p>
<div class="toc">
<ul>
<li><a href="#_1">參考資料</a></li>
<li><a href="#lets-encrypt-certificate">Let’s Encrypt Certificate</a></li>
<li><a href="#systemd-timer-certificate">設定 systemd timer 定時更新 certificate</a><ul>
<li><a href="#systemd-service">Systemd service</a></li>
<li><a href="#systemd-timer">Systemd Timer</a></li>
</ul>
</li>
<li><a href="#nginx-http-redirect-to-https">nginx HTTP redirect to HTTPS</a></li>
<li><a href="#https">測試 HTTPS 設定</a></li>
<li><a href="#_2">心得</a></li>
<li><a href="#misc">Misc.</a></li>
</ul>
</div>
<h3 id="_1">參考資料</h3>
<p>我不是網路安全相關的專家,設定都是參考網路上的說明整理而成。LE certificate 的設定參考 <a href="https://robmclarty.com/blog/how-to-secure-your-web-app-using-https-with-letsencrypt"><em>How to Secure Your Web App Using HTTPS With Letsencrypt</em></a> by Rob McLarty 這篇文章。</p>
<h3 id="lets-encrypt-certificate">Let’s Encrypt Certificate</h3>
<p>沒有使用 LE 官方的 client,而是用 <a href="https://daylightpirates.org/">Daniel Roesler</a> 所寫的 <a href="https://github.com/diafygi/acme-tiny/">acme-tiny</a>。這是一個不到 200 行的 Python script,可以自行檢查它有沒有做任何奇怪的事。<a href="https://github.com/diafygi/acme-tiny/">acme-tiny</a> 的 README 也有個設定教學,應該是大同小異。</p>
<p>基本上都是照著 <a href="https://robmclarty.com/blog/how-to-secure-your-web-app-using-https-with-letsencrypt"><em>How to Secure Your Web App Using HTTPS With Letsencrypt</em></a> 該篇文章做,不過有調整了以下的東西:</p>
<ol>
<li>建立 letsencrypt 帳號時,禁止使用 password login。<br>
即: <code>adduser --disabled-password ...</code></li>
<li>使用 git 管理 <a href="https://github.com/diafygi/acme-tiny/">acme-tiny</a> script。</li>
<li>改用 systemd 控制 nginx,而不是透過 service。<br>
即: <code>systemctl reload nginx</code></li>
<li>沒有用 crontab 而是使用 <a href="https://wiki.archlinux.org/index.php/Systemd/Timers">systemd Timers</a>。</li>
<li>重新導引 http 連線至 https。</li>
</ol>
<p>第 4、5 點設定比較多,整理到後面。</p>
<h3 id="systemd-timer-certificate">設定 systemd timer 定時更新 certificate</h3>
<p>這邊參考 <a href="http://www.certdepot.net/rhel7-use-systemd-timers/"><em>RHEL7: How to use Systemd timers.</em></a> 一文。</p>
<p>Systemd Timer 概念如同 crontab,但差別是使用 timer 即與 systemd 整合。<code><unit>.timer</code> 可以定期執行 <code><unit>.service</code>,所以 <unit> 執行的結果與狀態都能顯示在 systemd 中,也帶入了 journald 有的 logging 功能。</p>
<p>更新 certificate 的指令寫在 <code>/etc/letsencrypt/renew_cert.sh</code>。Shell script 的內容與參考文章一樣。</p>
<h4 id="systemd-service">Systemd service</h4>
<p>首先建立一個 renew_cert,以 Debian 為例放在 <code>/etc/systemd/system/renew_cert.service</code>,</p>
<div class="highlight"><pre><span></span><code><span class="k">[Unit]</span>
<span class="na">Description</span><span class="o">=</span><span class="s">Renew Let's Encrypt cert</span>
<span class="na">After</span><span class="o">=</span><span class="s">syslog.target</span>
<span class="k">[Service]</span>
<span class="na">User</span><span class="o">=</span><span class="s">letsencrypt</span>
<span class="na">Group</span><span class="o">=</span><span class="s">letsencrypt</span>
<span class="na">ExecStart</span><span class="o">=</span><span class="s">/etc/letsencrypt/renew_cert.sh</span>
<span class="na">Type</span><span class="o">=</span><span class="s">simple</span>
<span class="na">StandardError</span><span class="o">=</span><span class="s">syslog</span>
<span class="k">[Install]</span>
<span class="na">WantedBy</span><span class="o">=</span><span class="s">multi-user.target</span>
</code></pre></div>
<p>要手動更新 certificate 的時候執行這個 service 即可,</p>
<div class="highlight"><pre><span></span><code>systemctl start renew_cert
</code></pre></div>
<p>我們不需要 enable 它,不然每次開機都會執行一次。看結果或記錄,</p>
<div class="highlight"><pre><span></span><code>systemctl status renew_cert
journalctl -e -u renew_cert
</code></pre></div>
<h4 id="systemd-timer">Systemd Timer</h4>
<p>建立一個 <code>/etc/systemd/system/renew_cert.timer</code></p>
<div class="highlight"><pre><span></span><code><span class="k">[Unit]</span>
<span class="na">Description</span><span class="o">=</span><span class="s">Update Let's Encrypt certificate every two months</span>
<span class="k">[Timer]</span>
<span class="na">OnCalendar</span><span class="o">=</span><span class="s">*-1/2-1 16:00:00</span>
<span class="na">Unit</span><span class="o">=</span><span class="s">renew_cert.service</span>
<span class="k">[Install]</span>
<span class="na">WantedBy</span><span class="o">=</span><span class="s">multi-user.target</span>
</code></pre></div>
<p>重點只有 <code>[Timer]</code> 這 directive。<code>Unit=</code> 表示要啟動的 service。<code>OnCalendar=</code><sup id="fnref:calendar"><a class="footnote-ref" href="#fn:calendar">1</a></sup> 則是設定這 timer 根據指定的時間點 (UTC 時間<sup id="fnref:utc"><a class="footnote-ref" href="#fn:utc">2</a></sup>) 啟動。</p>
<p>以這邊寫的時間 <code>*-1/2-1 16:00:00</code> 為例,代表每年的 1+2n 月 1 日 16:00 UTC 更新 certificate,即臺灣時間 1、3、5、……月 2 日凌晨更新。</p>
<p>啟用 timer,它需要被 enable 確保重開機時被執行。</p>
<div class="highlight"><pre><span></span><code>systemctl enable renew_cert.timer
systemctl start renew_cert.timer
</code></pre></div>
<p>可以用 <code>systemctl list-timers</code> 檢查它下次執行的時間:</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>sudo<span class="w"> </span>systemctl<span class="w"> </span>list-timers<span class="w"> </span>renew_cert.timer
<span class="go">NEXT LEFT LAST PASSED UNIT ACTIVATES</span>
<span class="go">Tue 2016-03-01 16:00:00 UTC 1 weeks 2 days left n/a n/a renew_cert.timer renew_cert.service</span>
</code></pre></div>
<h3 id="nginx-http-redirect-to-https">nginx HTTP redirect to HTTPS</h3>
<p>(這邊設定我沒信心,有更好的設定方法歡迎告訴我 > <)</p>
<p>要解決的問題為,ACME challenge 是透過 HTTP,其餘的連線都轉到 HTTPS。</p>
<p>在 nginx 中把主要的設定檔中拿掉 <code>listen 80;</code> 與 ACM challenge 的部份。把它們移成新的 server block:</p>
<div class="highlight"><pre><span></span><code><span class="k">server</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kn">listen</span><span class="w"> </span><span class="mi">80</span><span class="p">;</span>
<span class="w"> </span><span class="kn">server_name</span><span class="w"> </span><span class="s">docs.python.org.tw</span><span class="p">;</span>
<span class="w"> </span><span class="c1"># For Let's Encrypt ACME challenge files</span>
<span class="w"> </span><span class="kn">location</span><span class="w"> </span><span class="s">/.well-known/acme-challenge/</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kn">alias</span><span class="w"> </span><span class="s">/var/www/challenges/</span><span class="p">;</span>
<span class="w"> </span><span class="kn">try_files</span><span class="w"> </span><span class="nv">$uri</span><span class="w"> </span><span class="p">=</span><span class="mi">404</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="kn">location</span><span class="w"> </span><span class="s">/</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kn">return</span><span class="w"> </span><span class="mi">301</span><span class="w"> </span><span class="s">https://</span><span class="nv">$host$request_uri</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
</code></pre></div>
<h3 id="https">測試 HTTPS 設定</h3>
<p>用了 <a href="https://securityheaders.io/">securityheaders.io</a> 和 <a href="https://www.ssllabs.com/index.html">SSL Labs</a> 測試了一下,應該還可以:</p>
<figure>
<img src="https://blog.liang2.tw/posts/2016/02/pydoctw-https/pics/pydoctw_securityheaders_report.png"/>
<p class="caption center">Report from securityheaders.io (<a href="https://securityheaders.io/?q=https%3A%2F%2Fdocs.python.org.tw%2F3%2F">Live Report</a>)</p>
</figure>
<figure>
<img src="https://blog.liang2.tw/posts/2016/02/pydoctw-https/pics/pydoctw_ssllabs_report.png"/>
<p class="caption center">Report from SSL Labs (<a href="https://www.ssllabs.com/ssltest/analyze.html?d=docs.python.org.tw">Live Report</a>)</p>
</figure>
<h3 id="_2">心得</h3>
<p>總結來說,使用 <a href="https://letsencrypt.org/">Let’s Encrypt</a> 不難,但也沒到非常簡單。</p>
<p>如果你願意把 root 和 private key 權限給它的話,用官方 client 提供的 <code>letsencrypt-auto</code> 步驟能更少,用 Apache 它聲稱能全自動設定。覺得 <a href="https://github.com/diafygi/acme-tiny/">acme-tiny</a> 指令太複雜的話,原作者也寫了一個 <a href="https://gethttpsforfree.com/">Get HTTPS for free!</a> 服務,用網頁的方式協助整個註冊流程。</p>
<p>要注意目前 public beta 階段,相同 domain 在 7 天只能被簽署 5 次,測試的時候不要太衝動不然就要等一週了。</p>
<h3 id="misc">Misc.</h3>
<p>建立 CSR (Certificate Signing Request) 檔時,可以加入自己的 email 地址,不然預設是 <code>webmaster@<domain></code>:</p>
<div class="highlight"><pre><span></span><code>openssl<span class="w"> </span>req<span class="w"> </span>-new<span class="w"> </span>-sha256<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-key<span class="w"> </span>/etc/letsencrypt/private/domain.key<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-subj<span class="w"> </span><span class="s2">"/CN=docs.python.org.tw/emailAddress=me+pydoctw@liang2.tw"</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>><span class="w"> </span>/etc/letsencrypt/private/domain.csr
</code></pre></div>
<div class="footnote">
<hr>
<ol>
<li id="fn:calendar">
<p>除了 <code>OnCalendar</code> 還有很多設定 timer 的方式,如 <code>OnUnitActiveSec</code>。不過其他的時間算法,都會受有沒有開機,影響時間的計算。 <a class="footnote-backref" href="#fnref:calendar" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:utc">
<p>Debian Jessie 的 <a href="https://www.freedesktop.org/software/systemd/man/systemd.time.html#Calendar%20Events">systemd.time (7)</a> Calendar Events 裡並沒有指定時區的方式,所以加上時區會有 parse error。但新版的 systemd 似乎支援時區。總之應該用 <code>systemctl list-timers</code> 確定執行的時間。 <a class="footnote-backref" href="#fnref:utc" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
</ol>
</div>Add code block language name into CSS classes in Pelican Markdown2016-02-19T15:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-02-19:/posts/2016/02/markdown-codehilite-lang/<p>I used <a href="http://docs.getpelican.com/">Pelican</a> and its <a href="https://pythonhosted.org/Markdown/">Markdown</a> plugin to render blog post.</p>
<p>Recently I was playing with the <a href="https://docs.python.org/">Python Official Documentation</a>, which has a decent code syntax highlighter powered by <a href="http://pygments.org/">Pygments</a>.</p>
<p>What’s more, the output of code examples can be toggled. That is, a code example:</p>
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="nb">print</span><span class="p">(</span><span class="s1">'Hello World'</span><span class="p">)</span>
<span class="n">Hello …</span></code></pre></div><p>I used <a href="http://docs.getpelican.com/">Pelican</a> and its <a href="https://pythonhosted.org/Markdown/">Markdown</a> plugin to render blog post.</p>
<p>Recently I was playing with the <a href="https://docs.python.org/">Python Official Documentation</a>, which has a decent code syntax highlighter powered by <a href="http://pygments.org/">Pygments</a>.</p>
<p>What’s more, the output of code examples can be toggled. That is, a code example:</p>
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="nb">print</span><span class="p">(</span><span class="s1">'Hello World'</span><span class="p">)</span>
<span class="n">Hello</span> <span class="n">World</span>
<span class="o">>>></span> <span class="mi">6</span> <span class="o">*</span> <span class="mi">7</span>
<span class="mi">42</span>
</code></pre></div>
<p>can be toggled to:</p>
<div class="highlight"><pre><span></span><code><span class="nb">print</span><span class="p">(</span><span class="s1">'Hello World'</span><span class="p">)</span>
<span class="mi">6</span> <span class="o">*</span> <span class="mi">7</span>
</code></pre></div>
<p>which is very convenient for code copy-pasting.</p>
<p><del>However, the functionality is currently failed on the official Python doc (given by <a href="https://docs.python.org/3/_static/copybutton.js"><code>copybutton.js</code></a>) because the jQuery updates break previous API behavior. I’ve filed <a href="http://bugs.python.org/issue26246">issue 26246</a> on the Python issue tracker for this problem.</del> (EDIT 2016-02-27: the patch has been merged.)</p>
<h3 id="code-output-toggle-in-pelican">Code output toggle in Pelican</h3>
<p>After I fixed the copybutton.js, I wished to add this functionality to my blog.</p>
<p>Code highlighting in Pelican markdown files is handled by its <a href="https://pythonhosted.org/Markdown/extensions/code_hilite.html">CodeHilite</a> extension. To my surprise, I found CodeHilite does not express the language name specified for each code block.</p>
<p>What I expected was</p>
<div class="highlight"><pre><span></span><code><span class="p"><</span><span class="nt">div</span> <span class="na">class</span><span class="o">=</span><span class="s">"highlight-python3"</span><span class="p">></span>
<span class="p"><</span><span class="nt">div</span> <span class="na">class</span><span class="o">=</span><span class="s">"highlight"</span><span class="p">></span>
<span class="p"><</span><span class="nt">pre</span><span class="p">></span>
<span class="cm"><!-- ... --></span>
<span class="p"></</span><span class="nt">pre</span><span class="p">></span>
<span class="p"></</span><span class="nt">div</span><span class="p">></span>
<span class="p"></</span><span class="nt">div</span><span class="p">></span>
</code></pre></div>
<p>but the actual output was</p>
<div class="highlight"><pre><span></span><code><span class="p"><</span><span class="nt">div</span> <span class="na">class</span><span class="o">=</span><span class="s">"highlight"</span><span class="p">></span>
<span class="p"><</span><span class="nt">pre</span><span class="p">></span>
<span class="cm"><!-- ... --></span>
<span class="p"></</span><span class="nt">pre</span><span class="p">></span>
<span class="p"></</span><span class="nt">div</span><span class="p">></span>
</code></pre></div>
<p>So no way to find the language name the code block used, nor the lexer aliases Pygments guessed when no language name was specified.</p>
<p>A quick dig into the <a href="https://github.com/waylan/Python-Markdown/blob/master/markdown/extensions/codehilite.py#L106-L123">source code</a> showed that it is relatively easy to fix. Here is the diff:</p>
<div class="highlight"><pre><span></span><code><span class="gh">diff --git a/extensions/codehilite.py b/extensions/codehilite_updated.py</span>
<span class="gh">index 0657c37..4fad7c5 100644</span>
<span class="gd">--- a/extensions/codehilite.py</span>
<span class="gi">+++ b/extensions/codehilite_updated.py</span>
<span class="gu">@@ -75,7 +75,8 @@ class CodeHilite(object):</span>
<span class="w"> </span> def __init__(self, src=None, linenums=None, guess_lang=True,
<span class="w"> </span> css_class="codehilite", lang=None, style='default',
<span class="gd">- noclasses=False, tab_length=4, hl_lines=None, use_pygments=True):</span>
<span class="gi">+ noclasses=False, tab_length=4, hl_lines=None, use_pygments=True,</span>
<span class="gi">+ wrap_by_lang=True):</span>
<span class="w"> </span> self.src = src
<span class="w"> </span> self.lang = lang
<span class="w"> </span> self.linenums = linenums
<span class="gu">@@ -86,6 +87,7 @@ class CodeHilite(object):</span>
<span class="w"> </span> self.tab_length = tab_length
<span class="w"> </span> self.hl_lines = hl_lines or []
<span class="w"> </span> self.use_pygments = use_pygments
<span class="gi">+ self.wrap_by_lang = wrap_by_lang</span>
<span class="w"> </span> def hilite(self):
<span class="w"> </span> """
<span class="gu">@@ -114,13 +116,22 @@ class CodeHilite(object):</span>
<span class="w"> </span> lexer = get_lexer_by_name('text')
<span class="w"> </span> except ValueError:
<span class="w"> </span> lexer = get_lexer_by_name('text')
<span class="gi">+ lang = lexer.aliases[0]</span>
<span class="w"> </span> formatter = get_formatter_by_name('html',
<span class="w"> </span> linenos=self.linenums,
<span class="w"> </span> cssclass=self.css_class,
<span class="w"> </span> style=self.style,
<span class="w"> </span> noclasses=self.noclasses,
<span class="w"> </span> hl_lines=self.hl_lines)
<span class="gd">- return highlight(self.src, lexer, formatter)</span>
<span class="gi">+ hilited_html = highlight(self.src, lexer, formatter)</span>
<span class="gi">+ if self.wrap_by_lang and self.lang:</span>
<span class="gi">+ return '<div class="%(class)s-%(lang)s">%(html)s</div>\n' % {</span>
<span class="gi">+ 'class': self.css_class,</span>
<span class="gi">+ 'lang': lang.replace('+', '-'),</span>
<span class="gi">+ 'html': hilited_html,</span>
<span class="gi">+ }</span>
<span class="gi">+ else:</span>
<span class="gi">+ return hilited_html</span>
<span class="w"> </span> else:
<span class="w"> </span> # just escape and build markup usable by JS highlighting libs
<span class="w"> </span> txt = self.src.replace('&', '&amp;')
</code></pre></div>
<p>I’m happy with the patched codehilite output. I am now able to give code toggle function to specific code languages.</p>
<p>However it’s quite busy these days, so it may take a while to submit a proper pull request (e.g. fix any broken unit tests, write new tests, and tune the API as well as the new behavior). Moerover, <strong>currently my site does not use jQuery</strong> so I am missing a huge dependency. Rewriting it using vanilla JS seems to require considerable work, and the very thing I don’t have at hand is time :(</p>
<p>I’ve decided to leave this improvement in future development. But if your site use Pelican Markdown and imports jQuery, the diff will add the code language back.</p>設定 Python 官方文件中文化自動更新 Server2016-02-14T21:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-02-14:/posts/2016/02/pydoctw-server/<p>設定一個自動更新 Python 說明文件中文翻譯並且 host 中文化網頁版文件的 server。使用 Django 作 web server、Django-Q 做為 task queue,deploy stack 用 nginx、uWSGI,host 於 Amazon EC2 (Debian Jessie),資料庫用 PostgreSQL,並用 systemd 管理相關的 process。</p><p><em>TL;DR</em> 可至 <a href="http://docs.python.org.tw">http://docs.python.org.tw</a> 看線上自動更新的<a href="http://docs.python.org.tw/3/">中文化的文件</a>和 <a href="http://docs.python.org.tw/_build/">build server</a>。</p>
<p>EDIT 2016-02-16: 加上 language code、git sshconfig、swap 的設定;文句潤飾。<br>
EDIT 2016-02-20: 加上 tmpfiles.d 的設定。</p>
<div class="toc">
<ul>
<li><a href="#python">Python 說明文件中文翻譯計畫</a><ul>
<li><a href="#sphinx">Sphinx 文件多國語言架構</a></li>
<li><a href="#transifex-po">Transifex 線上服務讓多人共同翻譯 po 檔</a></li>
<li><a href="#_1">翻譯體驗改善</a></li>
</ul>
</li>
<li><a href="#pydoc-autobuild-server">PyDoc Autobuild Server</a><ul>
<li><a href="#_2">實作</a><ul>
<li><a href="#sphinx_1">Sphinx 文件</a></li>
<li><a href="#autobuild-django-server">Autobuild Django server</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#autobuild-server">Autobuild server 部署</a><ul>
<li><a href="#_3">作業系統</a><ul>
<li><a href="#python-35-and-apt-pinning">Python 3.5 and APT-pinning</a></li>
<li><a href="#postgresql">資料庫 PostgreSQL</a></li>
<li><a href="#swap">Swap</a></li>
<li><a href="#git-repo-ssh-config">Git repo ssh config</a></li>
<li><a href="#tmpfilesd">tmpfiles.d</a></li>
</ul>
</li>
<li><a href="#django-stack-nginx-uwsgi">Django Stack – nginx + uWSGI</a><ul>
<li><a href="#nginx">nginx 設定</a></li>
<li><a href="#uwsgi">uWSGI 設定</a></li>
</ul>
</li>
<li><a href="#systemd-services">Systemd services</a></li>
</ul>
</li>
<li><a href="#_4">總結</a></li>
</ul>
</div>
<h2 id="python">Python 說明文件中文翻譯計畫</h2>
<p>最近一段時間都在準備 <a href="https://github.com/python-doc-tw/python-doc-tw">Python 說明文件中文翻譯計畫</a>。翻譯本身雖然還沒很積極地進行,但經過<a href="http://www.meetup.com/Taipei-py/events/226558484/">前</a><a href="http://www.meetup.com/Taipei-py/events/227001232/">幾次</a> Taipei.py Projects On 的 sprint 活動,已經有蠻多人加入翻譯的行列。大家都有各自翻譯的主題,像我自己是從 <a href="http://docs.python.org.tw/3/tutorial/index.html">Tutorial</a> 的部份開始翻譯。</p>
<h3 id="sphinx">Sphinx 文件多國語言架構</h3>
<p>先簡介一下 <a href="https://docs.python.org/3/">CPython Documentation</a>(下稱 pydoc)的架構和翻譯方式。pydoc 是標準的 <a href="http://www.sphinx-doc.org/en/stable/">Sphinx</a> 文件,因此翻譯使用 Sphinx 自帶的 <a href="http://www.sphinx-doc.org/en/stable/intl.html">internationalization</a> (i18n or intl) 功能把文件的內容轉換到別的語言上。</p>
<p>如同 Django 等專案,i18n 都是透過 gettext,Sphinx 會按照 rst 檔案輸出同檔名的 po 檔。rst 檔案中的每個文字段落會對應到 po 檔一個 entry,不相干的程式碼範例等段落會被跳過。輸出的 po 檔放在對應的路徑例如 <code>locale/<lang>/LC_MESSAGES/xxx.po</code>。</p>
<p><a href="https://en.wikipedia.org/wiki/Gettext">po 檔的格式</a>很簡單,跳過有的沒的 header,實質內容長這樣:</p>
<div class="highlight"><pre><span></span><code><span class="kd">#: ../../tutorial/appetite.rst:50</span>
<span class="nv">msgid</span> <span class="s">""</span>
<span class="s">"Python enables programs to be written compactly and readably. Programs "</span>
<span class="s">"written in Python are typically much shorter than equivalent C, C++, or "</span>
<span class="s">"Java programs, for several reasons:"</span>
<span class="nv">msgstr</span> <span class="s">""</span>
<span class="s">"Python 讓程式寫得精簡並易讀。用 Python 實作的程式長度往往遠比用 "</span>
<span class="s">"C、C++、Java 實作的短。這有以下幾個原因:"</span>
</code></pre></div>
<p>實際上 Sphinx 會先輸出一份乾淨的 po 檔範本(稱為 pot 檔)到 <code>locale/pot/</code>,基本上就是只有原文的 po 檔。每增加一個新語言就會從 pot 檔製作一份 po 檔到各自的 <code>locale/<lang>/</code> 目錄下,翻譯時就修改那份 po 檔就可以。</p>
<p>翻譯完成後,首先 Sphinx 會先呼叫 gettext 把 po 檔編譯成 mo 檔加速搜尋翻譯字串速度。輸出翻譯後的文件只要設定不同語言,Sphinx 就會去找該語言的 mo 檔,並把原文字串換成 mo 檔裡的內容,就可以看到中文的文件。</p>
<h3 id="transifex-po">Transifex 線上服務讓多人共同翻譯 po 檔</h3>
<p>整個 Sphinx 文件翻譯流程就這樣,所以翻譯只要編輯中文 (lang code: zh-Hant<sup id="fnref:zh-Hant"><a class="footnote-ref" href="#fn:zh-Hant">1</a></sup>) 的 po 檔就好了。不過要直接寫 po 檔格式門檻還是太高,於是就有像 <a href="https://www.transifex.com/">Transifex</a> 這樣的網站。上傳 po/pot 檔就能線上修改翻譯,然後再把翻完的結果用 po 檔格式下載下來。我認為這是現在參加以 gettext-based PO 檔翻譯門檻最低的方式,至少日本也是這麼做。於是想要參考 pydoc 翻譯的人,只要登入 Transifex 就可以開始編輯。</p>
<p>用 Transifex 還有額外的好處。例如他有 POS tagging 可以標注專有名詞,定義統一的譯名,這些譯名會整理在 glossary terms 裡,翻譯時出現這些詞就會自動提示。類似的原文文句也會放在 suggestion 裡,讓翻譯完的用語文法也能一致。此外也有修改歷史、防呆提示(如該有的格式沒在譯文出現)、加註解 (comment)、評論 (issue) 等功能。</p>
<h3 id="_1">翻譯體驗改善</h3>
<p>這段時間翻譯的用詞、流程等規範都有個雛型了,相關的內容都可以在<a href="https://github.com/python-doc-tw/python-doc-tw/wiki">專案的 wiki</a> 裡找到。所以開始想要怎麼讓大家更好參與翻譯和看到翻譯的結果。</p>
<p>我發現參加翻譯本身已經不困難,大家沒什麼疑問。維護整體的用詞、翻譯討論用 Transifex issue 和 comment 效果不錯。整體上能保持極度分散式的工作形式。</p>
<p>平常遇到最多問題是出現 rst 格式錯誤、缺少必要的空白、前後文加上程式碼範例之後不通順、譯文曲解或誤會原文的意思。這些問題,我覺得只要自己讀過翻完的 pydoc 該頁、看一下輸出的 log 就能明白,也不需要我多作解釋。</p>
<p>再來,看不到自己翻譯的成果<strong>很沒有成就感</strong>,過一段時間我怕會失去動力。</p>
<p>於是變成需要一份保持更新的翻譯成果。當然自己輸出 doc 的方法都有寫在 <a href="https://github.com/python-doc-tw/python-doc-tw/wiki/How-to-build-the-doc-locally">wiki</a> 裡,但步驟很多,說簡單也沒多簡單,而且有錯或有問題可能都要來找我,就失去分散式分工的特性了。</p>
<p><strong>不如做個 autobuild server。</strong></p>
<p>於是有了這想法。但實在是個大坑,一直只能用想的。在過年的時候總算找到時間把 prototype 做出來了,其實蠻有成就感的。</p>
<h2 id="pydoc-autobuild-server">PyDoc Autobuild Server</h2>
<p>簡單整理幾個需求:</p>
<ul>
<li>PyDoc 結果網址對應本家 <a href="https://docs.python.org/">https://docs.python.org/</a>。例如 /3/ 就是 Python 3.x 版最新的,而現在 /3.5/ 就會自動轉址到 /3/<sup id="fnref:pydoc-url"><a class="footnote-ref" href="#fn:pydoc-url">2</a></sup>。</li>
<li>每一頁都有個更新翻譯連結,點一下就會從 Transifex 上抓新的翻譯,並更新輸出。</li>
<li>更新每頁翻譯的指令輸出都要保留,方便檢查 rst 語法等錯誤。</li>
<li>更新翻譯要有個 queue,才可以多人合作時不炸掉 autobuild server。</li>
<li>每日更新全部的文件,並且把更新加到 CPython-tw 的 git repo 中。更新的過程一樣要有記錄。</li>
<li>上述的所有功能都能在本機輕鬆地設定。</li>
</ul>
<h3 id="_2">實作</h3>
<p>目標就是完成上述的需求。pydoc 基本上就是個 static site,交給 nginx 設好路徑 host static files 就可以。Pydoc Sphinx 用 <a href="http://jinja.pocoo.org/docs/dev/">Jinja2</a> 作 HTML template,所以只要多加一些變數就能控制頁面的輸出,在 autobuild server 上時就可以加上額外的連結。而 Autobuild server 本身是個 task queue,其實功能很簡單,但為了維護方便,並考慮到 local、production 環境都要能動的話,選擇 <a href="https://www.djangoproject.com/">Django</a> 為基礎。真的給 Django 管理的就顯示 task queue、task result、接受 rebuild doc request 這幾個 view。</p>
<h4 id="sphinx_1">Sphinx 文件</h4>
<p>在 Sphinx 文件部份不想搞太複雜,就在每一頁加上一個自己的專屬連結,打這個網址就會加入一個更新該頁面的 task 到 autobuild server<sup id="fnref:build-link"><a class="footnote-ref" href="#fn:build-link">3</a></sup>。</p>
<p>在 autobuild 時加入專屬連結只要修改 Sphinx doc template 即可。Sphinx 在 build doc 時可以透過 <a href="http://www.sphinx-doc.org/en/stable/man/sphinx-build.html#options"><code>-A <name=value></code></a> 增加 Jinja2 template 的變數,就可控制 template render 行為:</p>
<div class="highlight"><pre><span></span><code><span class="c">{# <cpython-src>/Doc/tools/templates/layout.html #}</span>
<span class="cp">{%</span>- <span class="k">if</span> <span class="nv">autobuildi18n</span> <span class="cp">%}</span>
<span class="p"><</span><span class="nt">a</span> <span class="na">href</span><span class="o">=</span><span class="s">"/_build/update/?source_path=</span><span class="cp">{{</span> <span class="nv">pagename</span> <span class="cp">}}</span><span class="s">"</span><span class="p">></span>Update Translation<span class="p"></</span><span class="nt">a</span><span class="p">></span>
<span class="cp">{%</span>- <span class="k">endif</span> <span class="cp">%}</span>
</code></pre></div>
<ul>
<li><code>sphinx-build -A autobuildi18n=1</code> 時就會包含這個 Jinja2 block,多這個 Update Translation 連結。</li>
<li><a href="http://www.sphinx-doc.org/en/stable/templating.html#pagename"><code>{{ pagename }}</code></a> 是每頁文件的 rst 路徑。</li>
</ul>
<h4 id="autobuild-django-server">Autobuild Django server</h4>
<p>Django server 目標就是接受 task request 和顯示 task result。一個標準的 task queue 就有這些功能。</p>
<p>Django 上的 task queue 選擇很多,從 <a href="https://www.djangopackages.com/">Django Packages</a> 上的 <a href="https://www.djangopackages.com/grids/g/workers-queues-tasks/">Workers, Queues, and Tasks</a> 相關的套件可以看到有幾個有在更新而且 up 數多的:</p>
<ul>
<li><a href="http://celery.github.io/django-celery/">django-celery</a></li>
<li><a href="http://huey.readthedocs.org/en/latest/">huey</a></li>
<li><a href="https://github.com/ui/django-rq">django-RQ</a></li>
<li><a href="http://policystat.github.io/jobtastic/">jobtastic</a></li>
<li><a href="https://django-q.readthedocs.org/">django-Q</a></li>
</ul>
<p>扣掉不支援 Python 3 的套件<sup id="fnref:python3"><a class="footnote-ref" href="#fn:python3">4</a></sup>後,就剩 django-celery、django-RQ、django-Q 可以選。這裡面最紅也最老牌的是 django-celery ,它與 <a href="http://www.celeryproject.org/">Celery</a> 整合,功能完整且穩定,我用過也覺得十分不錯,缺點是功能太多有點複雜,加上不同 message queue 時會有很多設定要調整,需要一段時間上手。一般 Celery 常見的搭配使用 <a href="https://www.rabbitmq.com/">Rabbit-MQ</a> 和 <a href="http://redis.io/">Redis</a>,的確在 task 很多時有必要,但我們這個 build doc 一天可能才十幾次,在不隔離 build doc 環境的情況同時間的 worker 只能有一個,不會有效能上的問題。因此我傾向只要使用與 Django 同一個 database 就好,不要再有額外非 Python 的 dependency,讓 local 開發簡單一點。</p>
<p>最後選擇 <a href="https://django-q.readthedocs.org/">django-Q</a>。雖然很新但作者維護得很勤,worker 可以只用 Python 內建的 multiprocessing 完成。功能簡單卻完整,包含 monitor,跟 django-admin 整合,還可以排程。所以要啟動 django-Q 的 cluster,只要多一個 </p>
<div class="highlight"><pre><span></span><code>python manage.py qcluster
</code></pre></div>
<p>即可,十分方便。</p>
<p>怎麼使用 django-Q 就不在這篇 blog 討論範圍內了。我想我應該會投稿 PyCon TW 或 Taipei.py,到時候再整理成另一篇。Django-Q 的說明文件寫得很清楚,讀一讀應該就會了。</p>
<h2 id="autobuild-server">Autobuild server 部署</h2>
<p>(這篇文的重點其實是部署,誰曉得背景介紹可以這麼長)</p>
<p>部署 (deploy) 方法百百種,有好有壞。但至少要會一種嘛,所以這邊就用其中一種:</p>
<blockquote>
<p>nginx <-> uwsgi <-> Django</p>
</blockquote>
<p>也算很流行的組合。更完整地來說,整個處理 request 的流程經過:</p>
<blockquote>
<p>web client <-> nginx web server <-> socket <-> uwsgi <-> Django server</p>
</blockquote>
<p>基本的設定與教學來自 <a href="http://uwsgi-docs.readthedocs.org/en/latest/index.html">uWSGI</a> 官網的 <a href="http://uwsgi-docs.readthedocs.org/en/latest/tutorials/Django_and_nginx.html"><em>Setting up Django and your web server with uWSGI and nginx</em></a> 一文,搭配 <a href="http://uwsgi-docs.readthedocs.org/en/latest/Systemd.html"><em>uWGSI and Systemd</em></a> 與 <a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a> 整合。</p>
<p>這也是目前 Pydoc production 的設定,記錄一下方便未來的維護。</p>
<h3 id="_3">作業系統</h3>
<p>作業系統用 Debian Jessie,架設於 Amazon EC2 上,使用 t2.nano<sup id="fnref:ec2-nano"><a class="footnote-ref" href="#fn:ec2-nano">5</a></sup>。</p>
<p>Python web deploy 都會把套件裝在虛擬環境中,避免不同專案間互衝或與系統衝突。在 Debian 上可以用 <code>apt buid-dep python3-<pkg></code> 把 Python <pkg> 套件所需的 header 或 library 安裝好,十分簡單。</p>
<h4 id="python-35-and-apt-pinning">Python 3.5 and APT-pinning</h4>
<p>我的 code 裡用到了 <a href="https://docs.python.org/3/library/subprocess.html#subprocess.run"><code>subprocess.run</code></a>,這是 Python 3.5+ 才有的 API。但 Jessie 只有 Python 3.4,但我覺得很好用一點都不想改寫成相容舊版的 code。</p>
<p>因此需要安裝 Debian testing channel 上最新的 Python 3.5。這樣其實有安全上的疑慮,因為只有 stable channel 才有 security support,但自己編譯的問題更大,所以像 <a href="https://github.com/yyuu/pyenv">pyenv</a> 這種多 Python 版本的工具不在考慮內。</p>
<p>於是用 <a href="https://wiki.debian.org/AptPreferences">Apt-Pinning</a> 只讓 Python 3.5 相關的套件安裝 testing 的版本。首先把 testing channel 加到 <code>/etc/apt/source.list</code></p>
<div class="highlight"><pre><span></span><code>deb http://cloudfront.debian.net/debian testing main
deb-src http://cloudfront.debian.net/debian testing main
deb http://security.debian.org/ testing/updates main
deb-src http://security.debian.org/ testing/updates main
</code></pre></div>
<p>然後修改 <code>/etc/apt/preferences</code> 確定我們不會不小心裝到 testing 相關的套件,並把 Python 3.5 相關的套件設定權限 >= 990 讓它們能被自動安裝。</p>
<div class="highlight"><pre><span></span><code># Specify * rules first so later package-specfic rules can override them
Package: *
Pin: release a=testing
Pin-Priority: -10
Package: python3.5* libpython3.5*
Pin: release a=testing
Pin-Priority: 990
</code></pre></div>
<p>可以用 <code>sudo apt-cache policy <pkg-name></code> 檢查目前的規則會裝到哪個版本。</p>
<div class="highlight"><pre><span></span><code>$ sudo apt-get update
$ sudo apt-get install python3.5 python3.5-venv python3.5-dev
</code></pre></div>
<p>這樣只有 Python 3.5 相關的套件才會裝到 testing。</p>
<h4 id="postgresql">資料庫 PostgreSQL</h4>
<p>資料庫用 PostgreSQL 9.4。參照之前 blog<a href="https://blog.liang2.tw/posts/2016/01/postgresql-install/">《安裝 PostgreSQL 9 於 Debian Jessie / OSX》</a>一文設定。</p>
<h4 id="swap">Swap</h4>
<p>其實是上線不久才注意到 EC2 預設沒有 swap 空間。我很窮所以 production server 的 RAM 只有 512 MB,觀察一下有時候 build doc RAM 就全滿了,所以還是加個 swap 安心一點。</p>
<p>因為 Amazon EBS SSD I/O 數不會另外收錢(應該吧?),就建 swap file 在主硬碟裡。</p>
<p>Swap 設定的教學很多,這邊就參考 <a href="https://wiki.archlinux.org/index.php/swap">Arch Wiki</a> 上的做法,我選擇放在 <code>/var/swap.1</code>。大小設定為 RAM 的 2 倍,即 1GB。</p>
<p>首先把這個檔案建出來,權限改為 600。</p>
<div class="highlight"><pre><span></span><code>sudo<span class="w"> </span>/bin/dd<span class="w"> </span><span class="k">if</span><span class="o">=</span>/dev/zero<span class="w"> </span><span class="nv">of</span><span class="o">=</span>/var/swap.1<span class="w"> </span><span class="nv">bs</span><span class="o">=</span>1M<span class="w"> </span><span class="nv">count</span><span class="o">=</span><span class="m">1024</span>
<span class="c1"># or faster with fallocate</span>
sudo<span class="w"> </span>fallocate<span class="w"> </span>-l<span class="w"> </span>1G<span class="w"> </span>/var/swap.1
</code></pre></div>
<div class="highlight"><pre><span></span><code>sudo<span class="w"> </span>chmod<span class="w"> </span><span class="m">600</span><span class="w"> </span>/var/swap.1
</code></pre></div>
<p>再來把這個檔案改成 swap 格式並啟用它,</p>
<div class="highlight"><pre><span></span><code>sudo<span class="w"> </span>/sbin/mkswap<span class="w"> </span>/var/swap.1
sudo<span class="w"> </span>/sbin/swapon<span class="w"> </span>/var/swap.1
</code></pre></div>
<p>修改 fstab 讓每次開機都有這個 swap 設定,</p>
<div class="highlight"><pre><span></span><code># /etc/fstab
/var/swap.1 none swap defaults 0 0
</code></pre></div>
<p>用 <code>free -h</code>、<code>cat /proc/meminfo</code> 檢查此時應該有個 1GB swap 了。</p>
<h4 id="git-repo-ssh-config">Git repo ssh config</h4>
<p>再來是 code 的同步與更新。autobuild server 只要更新 source code,但 cpython-tw source 需要定時 commit 新的翻譯,因此 deploy server 會有修改 git repo 的權限。</p>
<p>不應該使用自己的 SSH key,deploy server 上應該有專屬的 deploy key,其中 cpython-tw 的 deploy key 有寫入權限(即可以 commit)。</p>
<p>查了一下,要讓不同 git repo 使用不同的 SSH key 也不複雜。以這邊的例子,先修改 <code>~/.ssh/config</code> 加入兩個新的 host,使用不同的 SSH key:</p>
<div class="highlight"><pre><span></span><code>Host github-pydoc_autobuild
HostName github.com
User git
IdentityFile /home/pydoc/.ssh/id_rsa.pydoc_autobuild
Host github-cpython_tw
HostName github.com
User git
IdentityFile /home/pydoc/.ssh/id_rsa.cpython_tw
</code></pre></div>
<p>建立對應的 SSH keypair,</p>
<div class="highlight"><pre><span></span><code>ssh-keygen<span class="w"> </span>-t<span class="w"> </span>rsa<span class="w"> </span>-f<span class="w"> </span>~/.ssh/id_rsa.pydoc_autobuild
ssh-keygen<span class="w"> </span>-t<span class="w"> </span>rsa<span class="w"> </span>-f<span class="w"> </span>~/.ssh/id_rsa.cpython_tw
</code></pre></div>
<p>把兩個 repo 的 URL host 換掉,</p>
<div class="highlight"><pre><span></span><code>git remote set-url origin git@github-pydoc_autobuild:python-doc-tw/pydoc_autobuild.git
</code></pre></div>
<p>這樣兩個 repo 會透過給定的 ssh key 連線。GitHub 會顯示每個 key 最近使用的時間,檢查時間就能確認設定正確與否(而且改 host 沒設定對應該直接連不上)。</p>
<h4 id="tmpfilesd">tmpfiles.d</h4>
<p>之後 nginx 和 uwsgi 溝通用的 socket 打算放在 <code>/run/django/xxxx.sock</code> <sup id="fnref:/run"><a class="footnote-ref" href="#fn:/run">6</a></sup>。因為只需要非 root 的權限,修改 <a href="https://www.freedesktop.org/software/systemd/man/tmpfiles.d.html">tmpfiles.d</a> 的設定,讓這個資料夾能在開機時自動建立。增加設定檔 <code>/etc/tmpfiles.d/pydoc_autobuild.conf</code></p>
<div class="highlight"><pre><span></span><code>d /run/django 0755 pydoc www-data
</code></pre></div>
<h3 id="django-stack-nginx-uwsgi">Django Stack – nginx + uWSGI</h3>
<p>在本地開發都用 <code>python manage.py runserver</code> 啟動 Django。但上線時內建的 runserver 就無法同時間服務太多人。因此需要像 nginx、uWSGI 等工具來協助。</p>
<p>參照 uWSGI <a href="http://uwsgi-docs.readthedocs.org/en/latest/tutorials/Django_and_nginx.html"><em>Setting up Django and your web server with uWSGI and nginx</em></a> 一文以及 TP 寫的 《為程式人寫的 Django Tutorial》系列文中 <a href="https://github.com/uranusjr/django-tutorial-for-programmers/blob/master/25-deploy-to-ubuntu-server.md"><em>Day 27 - Deploy to Ubuntu server</em></a> 關於部署的文章。</p>
<p>Autobuild server 有特別為 production 寫一份設定檔,切換時只要設定成 <code>settings.production</code> 即可。在 Django 設定部份,建議把所有路徑都設成絕對路徑(包含執行檔)。不然後續在設定 systemd 要調整很多環境變數,systemd 也不會帶入使用者的 PATH 變數,不用絕對路徑其實蠻麻煩的也容易錯。</p>
<h4 id="nginx">nginx 設定</h4>
<p>nginx 會接受 incoming HTTP request,需要跟 Django server 聯絡時,就會會連到 uWSGI 開的 UNIX socket。</p>
<p>我們先假設 uWSGI 這段沒問題,首先設定 nginx 本身。由於 static files 在 nginx 就直接導到對應的檔案,不會經過 uWSGI ,所以設定好 nginx 之後 pydoc 文件本身就上線了。用這個來測試設定的正確性。</p>
<p>對本網站而言,/static 導到 Django staticfiles;/3/、/3.5/ 導到 pydoc build HTML 的路徑;其餘路徑再交給 Django 處理。其中,/3.5/* 的連結將重新導向到 /3/* 上。</p>
<p>整理上述的需求,寫個 nginx 設定檔在 <code>/etc/nginx/sites-available/pydoc_autobuild.conf</code>:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Upstream Django setting; the socket nginx connects to</span>
<span class="k">upstream</span><span class="w"> </span><span class="s">django</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kn">server</span><span class="w"> </span><span class="s">unix:///run/django/pydoc_autobuild.sock</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">server</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kn">listen</span><span class="w"> </span><span class="mi">80</span><span class="p">;</span>
<span class="w"> </span><span class="kn">listen</span><span class="w"> </span><span class="mi">443</span><span class="w"> </span><span class="s">default</span><span class="w"> </span><span class="s">ssl</span><span class="p">;</span>
<span class="w"> </span><span class="kn">server_name</span><span class="w"> </span><span class="s">docs.python.org.tw</span>
<span class="w"> </span><span class="mi">52</span><span class="s">.69.170.26</span>
<span class="w"> </span><span class="p">;</span>
<span class="w"> </span><span class="kn">charset</span><span class="w"> </span><span class="s">utf-8</span><span class="p">;</span>
<span class="w"> </span><span class="kn">client_max_body_size</span><span class="w"> </span><span class="s">10M</span><span class="p">;</span><span class="w"> </span><span class="c1"># max upload size</span>
<span class="w"> </span><span class="kn">keepalive_timeout</span><span class="w"> </span><span class="mi">15</span><span class="p">;</span>
<span class="w"> </span><span class="kn">location</span><span class="w"> </span><span class="s">/static</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kn">alias</span><span class="w"> </span><span class="s">/path/to/code/pydoc_autobuild/assets</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="kn">location</span><span class="w"> </span><span class="s">/3</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kn">alias</span><span class="w"> </span><span class="s">/path/to/code/cpython-tw/Doc/build/html</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="kn">location</span><span class="w"> </span><span class="p">~</span><span class="w"> </span><span class="sr">/3\.5/(.*)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kn">return</span><span class="w"> </span><span class="mi">302</span><span class="w"> </span><span class="s">/3/</span><span class="nv">$1</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="c1"># Finally, send all non-media requests to the Django server.</span>
<span class="w"> </span><span class="kn">location</span><span class="w"> </span><span class="s">/</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kn">uwsgi_pass</span><span class="w"> </span><span class="s">django</span><span class="p">;</span>
<span class="w"> </span><span class="kn">include</span><span class="w"> </span><span class="s">/etc/nginx/uwsgi_params</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
</code></pre></div>
<p>再把檔案 soft link 到 <code>/etc/nginx/sites-enabled/</code>,更新 nginx 設定:</p>
<div class="highlight"><pre><span></span><code><span class="nb">cd</span><span class="w"> </span>/etc/nginx/sites-available/
sudo<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>pydoc_autobuild.conf<span class="w"> </span>../sites-enabled/
sudo<span class="w"> </span>systemctl<span class="w"> </span>reload<span class="w"> </span>nginx
</code></pre></div>
<p>確定 pydoc 上線就可以專心處理 uWSGI 了。</p>
<h4 id="uwsgi">uWSGI 設定</h4>
<p>uWSGI 在 VENV 外也要裝,我覺得還是用 pip 比較簡單,雖然這樣就要自己注意 uWSGI 的版本更新了:</p>
<div class="highlight"><pre><span></span><code>sudo python3.5 -m pip install uwsgi
</code></pre></div>
<p>把 uWSGI 設定存成 <code>pydoc_autobuild_uwsgi.ini</code> 並且在測試時,都使用:</p>
<div class="highlight"><pre><span></span><code>sudo uwsgi --ini pydoc_autobuild_uwsgi.ini
</code></pre></div>
<p>模擬實際上的執行方式,這樣之後改用 systemd 執行才不會又丟一堆權限的問題。設定檔的內容:</p>
<div class="highlight"><pre><span></span><code>[uwsgi]
chdir = /path/to/code/pydoc_autobuild
# Django's wsgi file
module = pydoc_autobuild.wsgi:application
env = DJANGO_SETTINGS_MODULE=pydoc_autobuild.settings.production
# the virtualenv (full path)
home = /path/to/VENV
# process-related settings
# master
master = true
# maximum number of worker processes
processes = 4
# the socket (use the full path to be safe
socket = /run/django/pydoc_autobuild.sock
# ... with appropriate permissions - may be needed
chmod-socket = 664
uid = pydoc
gid = www-data
# clear environment on exit
vacuum = true
</code></pre></div>
<p>權限上的設定可能要花點時間處理一下,nginx 使用 www-data/www-data 的身份執行,socket 要確定 nginx 能讀寫,但我的 code 放在 pydoc 使用者路徑下,用 www-data 可能會有權限的問題。建議把 uid、gid 都設定好。</p>
<p>過程中,搭配 nginx 的錯誤訊息比較好 debug:</p>
<div class="highlight"><pre><span></span><code>sudo less +F /var/log/nginx/error.log
</code></pre></div>
<p>成功後,再用 uWSGI 的 Emperor mode,把設定檔丟到一個路徑底下(該路徑稱為 vassals)。uWSGI 在 Emperor mode 時會自動把 vassals 路徑內所有設定檔都讀進來並執行。</p>
<p>這裡 vassals 路徑使用 <code>/etc/uwsgi/vassals/</code>。因為有設 uid、gid,跑的時候就不用再設了:</p>
<div class="highlight"><pre><span></span><code>sudo uwsgi --emperor /etc/uwsgi/vassals
</code></pre></div>
<p>這樣應該 Django 相關的 view 都沒問題了。接下來,要把啟動 uWSGI 的步驟交給系統來管理。</p>
<h3 id="systemd-services">Systemd services</h3>
<p>Autobuild server 包含兩個部份:Django Server 與 Django-Q cluster。所以寫成 systemd service 時會有兩個服務。</p>
<p>Debian system service 放在 <code>/etc/systemd/system/</code> 底下,因此建立 <code>uwsgi.service</code> 和 <code>qcluster.service</code> 分別管理 uWSGI Emperor mode 和 Django-Q cluster。</p>
<p><code>uwsgi.service</code> 參考 uWSGI 官網 <a href="http://uwsgi-docs.readthedocs.org/en/latest/Systemd.html"><em>Django and Systemd</em></a> 一文的設定:</p>
<div class="highlight"><pre><span></span><code><span class="k">[Unit]</span>
<span class="na">Description</span><span class="o">=</span><span class="s">uWSGI Emperor</span>
<span class="na">After</span><span class="o">=</span><span class="s">syslog.target</span>
<span class="k">[Service]</span>
<span class="na">ExecStart</span><span class="o">=</span><span class="s">/usr/local/bin/uwsgi --emperor /etc/uwsgi/vassals</span>
<span class="na">RuntimeDirectory</span><span class="o">=</span><span class="s">uwsgi</span>
<span class="na">Restart</span><span class="o">=</span><span class="s">always</span>
<span class="na">KillSignal</span><span class="o">=</span><span class="s">SIGQUIT</span>
<span class="na">Type</span><span class="o">=</span><span class="s">notify</span>
<span class="na">StandardError</span><span class="o">=</span><span class="s">syslog</span>
<span class="na">NotifyAccess</span><span class="o">=</span><span class="s">all</span>
<span class="k">[Install]</span>
<span class="na">WantedBy</span><span class="o">=</span><span class="s">multi-user.target</span>
</code></pre></div>
<p><code>qcluster.service</code> 算是自己硬寫模擬 <code>python manage.py qcluster</code> 行為。因此環境變數都要設定好(當然用絕對路徑就沒問題了,我只是覺得這樣 build log 內的執行檔路徑都很長會很醜 xd)</p>
<div class="highlight"><pre><span></span><code><span class="k">[Unit]</span>
<span class="na">Description</span><span class="o">=</span><span class="s">Django-Q Cluster for site pydoc_autobuild</span>
<span class="na">After</span><span class="o">=</span><span class="s">syslog.target</span>
<span class="na">Wants</span><span class="o">=</span><span class="s">uwsgi.service</span>
<span class="k">[Service]</span>
<span class="na">User</span><span class="o">=</span><span class="s">pydoc</span>
<span class="na">Group</span><span class="o">=</span><span class="s">www-data</span>
<span class="na">Environment</span><span class="o">=</span><span class="s">VIRTUAL_ENV=/path/to/VENV</span>
<span class="na">Environment</span><span class="o">=</span><span class="s">PATH=/path/to/VENV/bin:$PATH</span>
<span class="na">Environment</span><span class="o">=</span><span class="s">DJANGO_SETTINGS_MODULE=pydoc_autobuild.settings.production</span>
<span class="na">WorkingDirectory</span><span class="o">=</span><span class="s">/path/to/code/pydoc_autobuild</span>
<span class="na">ExecStart</span><span class="o">=</span><span class="s">/path/to/VENV/bin/python manage.py qcluster</span>
<span class="na">Restart</span><span class="o">=</span><span class="s">always</span>
<span class="na">KillSignal</span><span class="o">=</span><span class="s">SIGQUIT</span>
<span class="na">Type</span><span class="o">=</span><span class="s">simple</span>
<span class="na">NotifyAccess</span><span class="o">=</span><span class="s">none</span>
<span class="na">StandardError</span><span class="o">=</span><span class="s">syslog</span>
<span class="k">[Install]</span>
<span class="na">WantedBy</span><span class="o">=</span><span class="s">multi-user.target</span>
</code></pre></div>
<p>這樣的設定檔應該不是 systemd 的慣例,我還在想是不是應該要改寫到 user service 去(但我不會)。</p>
<p>加入到 systemd 之後管理就很簡單,啟動這兩個 service:</p>
<div class="highlight"><pre><span></span><code>sudo systemctl enable uwsgi
sudo systemctl enable qcluster
</code></pre></div>
<p>查看他們的狀態:</p>
<div class="highlight"><pre><span></span><code>sudo systemctl status uwsgi
sudo systemctl status qcluster
</code></pre></div>
<p>查看它們的 log 也變得很簡單,因為有把它們的 stderr 抓起來。systemd 好處是 rotation 等等都會幫你注意,看 log 的功能也很多。</p>
<p>例如要查最近一小時 uWSGI 的連線記錄,並在有新連線時持續更新 log:</p>
<div class="highlight"><pre><span></span><code>sudo journalctl -xef -u uwsgi --since '1 hour ago'
</code></pre></div>
<h2 id="_4">總結</h2>
<p>介紹了 <a href="https://github.com/python-doc-tw/python-doc-tw">Python 說明文件翻譯計畫</a>,線上文件autobuild server 基於 Django 與 Django-Q 的架構,以及在 Debian 上結合 nginx、uWSGI、systemd 的部署設定。</p>
<p>查資料時覺得文章還不多,只有幾篇像 <a href="https://luxagraf.net/src/how-set-django-uwsgi-systemd-debian-8"><em>How to Set Up Django with Nginx, uWSGI & systemd on Debian/Ubuntu</em></a> 的文章,剩下要自己組裝還是要花一點時間。同時也把部署 pydoc server 的設定都記在這,將來要重建也比較簡單。</p>
<p>關於說明文件翻譯,應該會再花篇文章好好寫整個計畫本身。</p>
<p>(是說如果有人能從頭看到尾的話,給個回饋吧 > <)</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:zh-Hant">
<p>八卦是,臺灣繁體中文的 language code (or locale identifier) 究竟是 zh_TW、zh-Hant、zh-Hant-TW、zh-Hant_TW、zh_Hant 還是 zh_Hant_TW?這問題本身就可以寫一篇了。<br><br>查國際規範 <a href="http://www.ietf.org/rfc/bcp/bcp47.txt">BCP 47</a> 的話,只有 <a href="http://www.iana.org/assignments/lang-tags/zh-Hant">zh-Hant</a> 和 <a href="http://www.iana.org/assignments/lang-tags/zh-Hant-TW">zh-Hant-TW</a>,更多關於標準的說明與定義可以參考 <a href="https://www.w3.org/International/articles/bcp47/"><em>Understanding the New Language Tags</em>, W3C</a> 一文。<br><br>不過現狀是很奇妙的。參考 OSX 定義 <a href="https://developer.apple.com/library/ios/documentation/MacOSX/Conceptual/BPInternational/LanguageandLocaleIDs/LanguageandLocaleIDs.html"><em>Language and Locale IDs</em></a> 的話應該是 zh_TW、zh-Hant 或 zh-Hant_TW。而在 Debain 中,所有支援的 locale 寫在 <code>/usr/share/i18n/SUPPORTED</code>,裡面只有 zh_TW,不過 Debian 只用 <code>language[_country][.charset]</code> 所以不會有定義中為 script 的 Hant,雖然在 locale 中使用底線與 <a href="http://www.ietf.org/rfc/bcp/bcp47.txt">BCP 47</a> 的定義不同。Sphinx 透過 <a href="http://babel.pocoo.org/">Babel</a> 處理 locale,但它不允許 locale 中有 <code>-</code>,因此只能考慮 zh_Hant 或 zh_Hant_TW。更有趣的是,locale 應該是 case-insensitive 所以大小寫是不重要的 XD <a class="footnote-backref" href="#fnref:zh-Hant" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:pydoc-url">
<p>其實在 <a href="https://docs.python.org/">https://docs.python.org/</a> 上面 <a href="https://docs.python.org/3/">/3/</a> 和 <a href="https://docs.python.org/3.5/">/3.5/</a> 是不同份文件,即使是同個版本號它們更新的時間不相同。蠻意外會是這樣的情況。不過我們不用搞這麼複雜,只要轉址就好。 <a class="footnote-backref" href="#fnref:pydoc-url" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:build-link">
<p>開發時一直都是用 GET,即如文中所說,有個專屬的 link。但發現會有 robot / crawler 打這些路徑,因此最後改成 POST,把 <code>{{ pagename }}</code> 用 data-* 即 <code><a href="#" data-pagename="{{ pagename }}">...</a></code> 的方式存起來,在用 jQuery 綁定 click listener。 <a class="footnote-backref" href="#fnref:build-link" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:python3">
<p>看 <a href="https://github.com/coleifer/huey">huey</a> 和 <a href="https://github.com/PolicyStat/jobtastic">jobtastic</a> master branch 上有 py3k 的 commit 但感覺是最近的事,有待觀察。 <a class="footnote-backref" href="#fnref:python3" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:ec2-nano">
<p>吐嘈一下,t2.nano vCPU 真的時快時慢,有時 build doc 幾分鐘就搞定了,有時要幾十分鐘,有一天超慢,然後又被 web crawler 抓到,讓 task queue timeout 陷入了 timeout、restart、timeout 的無限地獄…… <a class="footnote-backref" href="#fnref:ec2-nano" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:/run">
<p>/var/run = /run,這個路徑是個 tmpfs 所以每次重開機就會清空,目錄要記得重建。 <a class="footnote-backref" href="#fnref:/run" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
</ol>
</div>安裝 PostgreSQL 9 於 Debian Jessie / OSX2016-01-25T17:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-01-25:/posts/2016/01/postgresql-install/<p>平常用最多的是 <a href="https://www.sqlite.org/">SQLite</a>,但 <a href="http://www.postgresql.org/">PostgreSQL</a> 有很多好用的功能,每次要用想不起來怎麼裝。總之把相關設定的筆記存在這。</p>
<p>用 <a href="https://www.debian.org/releases/stable/">Debian Jessie</a> (Debian 8.3) 與 OSX …</p><p>平常用最多的是 <a href="https://www.sqlite.org/">SQLite</a>,但 <a href="http://www.postgresql.org/">PostgreSQL</a> 有很多好用的功能,每次要用想不起來怎麼裝。總之把相關設定的筆記存在這。</p>
<p>用 <a href="https://www.debian.org/releases/stable/">Debian Jessie</a> (Debian 8.3) 與 OSX <a href="http://brew.sh/">Homebrew</a> 舉例。不過 OSX 大概也不會沒事把 PostgreSQL 開著,主要是著重在 Debian 的環境設定上。目前 PostgreSQL 出到 9.5 但 Debian stable 是 9.4。基本設定應該完全沒差別。</p>
<div class="toc">
<ul>
<li><a href="#postgresql">安裝 PostgreSQL</a><ul>
<li><a href="#osx">安裝在 OSX</a></li>
<li><a href="#debian-jessie">安裝在 Debian Jessie</a></li>
</ul>
</li>
<li><a href="#database">初始個人的 Database</a><ul>
<li><a href="#postgresql_1">建立同使用者名稱的 PostgreSQL 帳號</a></li>
<li><a href="#database_1">建立與帳號同名稱的 database</a></li>
<li><a href="#psql">用使用者帳號連接 psql</a></li>
</ul>
</li>
<li><a href="#psql_1">psql 指令</a></li>
<li><a href="#database_2">刪除使用者、Database</a></li>
<li><a href="#_1">進階主題</a><ul>
<li><a href="#psql_2">透過 psql 創建使用者帳號、資料庫</a></li>
<li><a href="#postgresql-logging">PostgreSQL Logging</a><ul>
<li><a href="#logging-postgresql">Logging 讓 PostgreSQL 自己管</a></li>
<li><a href="#logging-systemd">Logging 透過 Systemd</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#reference">Reference</a></li>
</ul>
</div>
<h2 id="postgresql">安裝 PostgreSQL</h2>
<h3 id="osx">安裝在 OSX</h3>
<div class="highlight"><pre><span></span><code><span class="go">brew install postgresql</span>
</code></pre></div>
<p>要用的時候手動把 PostgreSQL server 打開,</p>
<div class="highlight"><pre><span></span><code><span class="go">postgres -D /usr/local/var/postgres</span>
</code></pre></div>
<p>PostgreSQL 的設定參考 Debian 的版本。</p>
<h3 id="debian-jessie">安裝在 Debian Jessie</h3>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>sudo<span class="w"> </span>apt-get<span class="w"> </span>install<span class="w"> </span>postgresql-9.4
</code></pre></div>
<p>現在系統服務都由 <a href="http://freedesktop.org/wiki/Software/systemd/">Systemd</a> 管理了,檢查 PostgreSQL 有沒有跑起來透過 <code>systemctl</code> 這指令就可以。</p>
<div class="highlight"><pre><span></span><code><span class="gp"># </span>systemctl<span class="w"> </span>status<span class="w"> </span>postgresql.service
<span class="go">● postgresql.service - PostgreSQL RDBMS</span>
<span class="go"> Loaded: loaded (/lib/systemd/system/postgresql.service; enabled)</span>
<span class="go"> Active: inactive (dead) since Mon 2016-01-25 17:26:08 CST; 4s ago</span>
<span class="go"> Process: 913 ExecStart=/bin/true (code=exited, status=0/SUCCESS)</span>
<span class="go"> Main PID: 913 (code=exited, status=0/SUCCESS)</span>
</code></pre></div>
<p>不過這 service 看不太到什麼運行的資訊,其實是個 dummy service,它會 trigger 可能很多個 PostgreSQL database cluster 什麼的。預設只有一個 <code>main</code> 的 cluster。</p>
<div class="highlight"><pre><span></span><code><span class="gp"># </span>systemctl<span class="w"> </span>status<span class="w"> </span>postgresql@9.4-main.service
<span class="go">● postgresql@9.4-main.service - PostgreSQL Cluster 9.4-main</span>
<span class="go"> Loaded: loaded (/lib/systemd/system/postgresql@.service; disabled)</span>
<span class="go"> Active: active (running) since Mon 2016-01-25 17:26:30 CST; 4min 7s ago</span>
<span class="go"> Process: 9641 ExecStop=/usr/bin/pg_ctlcluster -m fast %i stop (code=exited, status=0/SUCCESS)</span>
<span class="go"> Process: 9717 ExecStart=postgresql@%i %i start (code=exited, status=0/SUCCESS)</span>
<span class="go"> Main PID: 9723 (postgres)</span>
<span class="go"> CGroup: /system.slice/system-postgresql.slice/postgresql@9.4-main.service</span>
<span class="go"> ├─9723 /usr/lib/postgresql/9.4/bin/postgres -D /var/lib/postgresql/9.4/main -c config_file=/etc/postgr...</span>
<span class="go"> ├─9725 postgres: checkpointer process</span>
<span class="go"> ├─9726 postgres: writer process</span>
<span class="go"> ├─9727 postgres: wal writer process</span>
<span class="go"> ├─9728 postgres: autovacuum launcher process</span>
<span class="go"> └─9729 postgres: stats collector process</span>
</code></pre></div>
<h2 id="database">初始個人的 Database</h2>
<p>在 OSX 上用 homebrew 安裝 PostgreSQL 的使用者會有 superuser 的權限,反正是本地開發也沒差,建 database 等設定都比較簡單。</p>
<p>在 Debian 上的話,有這 superuser 權限的使用者為 <code>postgres</code>。所以預設使用者(這邊以 <code>vm</code> 為例)會無法連線。</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>psql
<span class="go">psql: FATAL: role "vm" does not exist</span>
</code></pre></div>
<p>切到 root 再切到 postgres 身份就能用 <code>psql</code> (PostgreSQL 的 REPL shell)連到 database。用 <code>\q</code> 就可以退出 psql。</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>sudo<span class="w"> </span>-u<span class="w"> </span>postgres<span class="w"> </span>psql
<span class="go">[sudo] password for vm:</span>
<span class="go">psql (9.4.5)</span>
<span class="go">Type "help" for help.</span>
<span class="go">postgres=# \q</span>
<span class="gp">$</span>
</code></pre></div>
<p>但用 postgres 這 superuser 去連資料庫不是很安全,一開始養成好習慣應該用個人帳號。所以接下來要完成:</p>
<ol>
<li>建立同使用者名稱的 PostgreSQL 帳號</li>
<li>建立與帳號同名稱的 database</li>
</ol>
<h3 id="postgresql_1">建立同使用者名稱的 PostgreSQL 帳號</h3>
<p>在 Debian 上可以用 <code>$USER</code> 來抓到現在登入者的帳號,即使用 sudo 切換身份這環境變數的值不會變。(讀 <a href="https://help.ubuntu.com/community/PostgreSQL">Ubuntu wiki</a> 看到的技巧)</p>
<p>擔心的話就直接在有 <code>$USER</code> 的地方打出帳號即可。先確認一下,</p>
<div class="highlight"><pre><span></span><code><span class="gp">vm@vm-debian:~$ echo $</span>USER
<span class="go">vm</span>
<span class="gp">vm@vm-debian:~$ </span>sudo<span class="w"> </span>-u<span class="w"> </span>postgres<span class="w"> </span><span class="nb">echo</span><span class="w"> </span><span class="nv">$USER</span>
<span class="go">vm</span>
</code></pre></div>
<p>建立使用者是透過 <code>createuser</code> 這指令。這是使用者帳號就不給太多權限。</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>sudo<span class="w"> </span>-u<span class="w"> </span>postgres<span class="w"> </span>createuser<span class="w"> </span>--interactive<span class="w"> </span><span class="nv">$USER</span>
<span class="go">Shall the new role be a superuser? (y/n) n</span>
<span class="go">Shall the new role be allowed to create databases? (y/n) n</span>
<span class="go">Shall the new role be allowed to create more new roles? (y/n) n</span>
</code></pre></div>
<p>這時候透過 <code>psql</code> 看就會多一個使用者。</p>
<div class="highlight"><pre><span></span><code><span class="c1">-- Run with command `sudo -u postgres psql`</span>
<span class="gp">postgres=#</span><span class="w"> </span><span class="kp">\du</span>
<span class="w"> </span><span class="n">List</span><span class="w"> </span><span class="k">of</span><span class="w"> </span><span class="n">roles</span>
<span class="w"> </span><span class="k">Role</span><span class="w"> </span><span class="k">name</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Attributes</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Member</span><span class="w"> </span><span class="k">of</span>
<span class="c1">-----------+------------------------------------------------+-----------</span>
<span class="w"> </span><span class="n">postgres</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Superuser</span><span class="p">,</span><span class="w"> </span><span class="k">Create</span><span class="w"> </span><span class="k">role</span><span class="p">,</span><span class="w"> </span><span class="k">Create</span><span class="w"> </span><span class="n">DB</span><span class="p">,</span><span class="w"> </span><span class="n">Replication</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">{}</span>
<span class="w"> </span><span class="n">vm</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="p">{}</span>
</code></pre></div>
<h3 id="database_1">建立與帳號同名稱的 database</h3>
<p>透過 <code>createdb</code> 這指令。把與帳號同名 database 的 owner 設定成該帳號。</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>sudo<span class="w"> </span>-u<span class="w"> </span>postgres<span class="w"> </span>createdb<span class="w"> </span>--owner<span class="o">=</span><span class="nv">$USER</span><span class="w"> </span><span class="nv">$USER</span>
</code></pre></div>
<p>要多建別的 database 給這帳號也沒問題,例如名為 <code>vm_database</code> 的 database,</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>sudo<span class="w"> </span>-u<span class="w"> </span>postgres<span class="w"> </span>createdb<span class="w"> </span>--owner<span class="o">=</span><span class="nv">$USER</span><span class="w"> </span>vm_database
</code></pre></div>
<h3 id="psql">用使用者帳號連接 psql</h3>
<p>這時候打 <code>psql</code> 就沒問題了。</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>psql
<span class="go">psql (9.4.5)</span>
<span class="go">Type "help" for help.</span>
<span class="go">vm=> \conninfo</span>
<span class="go">You are connected to database "vm" as user "vm" via socket in "/var/run/postgresql" at port "5432".</span>
</code></pre></div>
<p>Prompt 從 <code>#=</code> 變成 <code>=></code> 表示現在連線的使用者不是 superuser。透過 psql 的指令 <code>\l</code> 或 <code>\l+</code> 可以看現在所有的 database:</p>
<div class="highlight"><pre><span></span><code><span class="gp">vm=></span><span class="w"> </span><span class="kp">\l</span>
<span class="go"> List of databases</span>
<span class="go"> Name | Owner | Encoding | Collate | Ctype | Access privileges</span>
<span class="go">-------------+----------+----------+-------------+-------------+-----------------------</span>
<span class="go"> postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |</span>
<span class="go"> template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +</span>
<span class="go"> | | | | | postgres=CTc/postgres</span>
<span class="go"> template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +</span>
<span class="go"> | | | | | postgres=CTc/postgres</span>
<span class="go"> vm | vm | UTF8 | en_US.UTF-8 | en_US.UTF-8 |</span>
<span class="go"> vm_database | vm | UTF8 | en_US.UTF-8 | en_US.UTF-8 |</span>
<span class="go">(5 rows)</span>
</code></pre></div>
<p>連另一個 database <code>vm_database</code> 也很簡單,</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>psql<span class="w"> </span>-d<span class="w"> </span>vm_database
<span class="go">psql (9.4.5)</span>
<span class="go">Type "help" for help.</span>
<span class="go">vm_database=></span>
</code></pre></div>
<h2 id="psql_1">psql 指令</h2>
<p>psql 的指令很多,用 <code>\?</code> 可以看到列表。完整的版本可以見<a href="http://www.postgresql.org/docs/9.4/static/app-psql.html#APP-PSQL-META-COMMANDS">官網 psql meta-commands</a> 的介紹。底下列幾個常用的:</p>
<div class="highlight"><pre><span></span><code>\l # list all database
\d # list tables in current database
\du # list roles
\conninfo # show current SQL connection
\q # quit
help # print a hub message for all kinds of help
</code></pre></div>
<h2 id="database_2">刪除使用者、Database</h2>
<p>各有一個指令對應。</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>dropuser<span class="w"> </span><usr>
<span class="gp">$ </span>dropdb<span class="w"> </span><db>
</code></pre></div>
<h2 id="_1">進階主題</h2>
<h3 id="psql_2">透過 psql 創建使用者帳號、資料庫</h3>
<p>Ref: <a href="http://www.cyberciti.biz/faq/howto-add-postgresql-user-account/">http://www.cyberciti.biz/faq/howto-add-postgresql-user-account/</a></p>
<div class="highlight"><pre><span></span><code><span class="c1">-- Run with command `sudo -u postgres psql -d template1`</span>
<span class="gp">template1=#</span><span class="w"> </span><span class="k">CREATE</span><span class="w"> </span><span class="k">USER</span><span class="w"> </span><span class="o"><</span><span class="n">usr</span><span class="o">></span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="k">PASSWORD</span><span class="w"> </span><span class="s1">'<pwd>'</span><span class="p">;</span>
<span class="gp">template1=#</span><span class="w"> </span><span class="k">CREATE</span><span class="w"> </span><span class="k">DATABASE</span><span class="w"> </span><span class="o"><</span><span class="n">db</span><span class="o">></span><span class="p">;</span>
<span class="gp">template1=#</span><span class="w"> </span><span class="k">GRANT</span><span class="w"> </span><span class="k">ALL</span><span class="w"> </span><span class="k">PRIVILEGES</span><span class="w"> </span><span class="k">ON</span><span class="w"> </span><span class="k">DATABASE</span><span class="w"> </span><span class="o"><</span><span class="n">db</span><span class="o">></span><span class="w"> </span><span class="k">TO</span><span class="w"> </span><span class="o"><</span><span class="n">usr</span><span class="o">></span><span class="p">;</span>
</code></pre></div>
<h3 id="postgresql-logging">PostgreSQL Logging</h3>
<p>設定在 <code>/etc/postgresql/9.4/main/postgresql.conf</code> 裡。不同管理 log 的方式就要選擇不同的 <code>log_destination</code>。</p>
<ol>
<li>PostgreSQL 自己管:<code>stderr</code>, …</li>
<li>透過 <a href="http://freedesktop.org/wiki/Software/systemd/">Systemd</a>:<code>syslog</code></li>
</ol>
<p>不過我沒深入研究就是,看那個 conf 裡很多設定可以調整。設定修改後要 restart PostgreSQL cluster,</p>
<div class="highlight"><pre><span></span><code>sudo<span class="w"> </span>systemctl<span class="w"> </span>restart<span class="w"> </span>postgresql.service
</code></pre></div>
<h4 id="logging-postgresql">Logging 讓 PostgreSQL 自己管</h4>
<div class="highlight"><pre><span></span><code><span class="c1"># At /etc/postgresql/9.4/main/postgresql.conf</span>
<span class="na">log_destination</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">'stderr'</span><span class="w"> </span>
<span class="na">logging_collector</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">on</span><span class="w"> </span>
</code></pre></div>
<p>Log files 預設寫在 <code>/var/lib/postgresql/9.4/main/pg_log/</code>。</p>
<h4 id="logging-systemd">Logging 透過 Systemd</h4>
<p>我覺得 systemd 的優點之一就是能把 log 都集中管理,只要照它的規則,就能用一樣的方法管理 logging 是蠻方便的。</p>
<div class="highlight"><pre><span></span><code><span class="c1"># At /etc/postgresql/9.4/main/postgresql.conf</span>
<span class="na">log_destination</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">'syslog'</span><span class="w"> </span>
<span class="na">logging_collector</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">off</span><span class="w"> </span><span class="c1"># on 也只會說被導向到 syslog 了 </span>
</code></pre></div>
<p>這時候重啟服務,再看 <code>systemctl status</code> 就能看到最近的 log 了。</p>
<div class="highlight"><pre><span></span><code><span class="gp"># </span>systemctl<span class="w"> </span>status<span class="w"> </span>postgresql@9.4-main
<span class="go">● postgresql@9.4-main.service - PostgreSQL Cluster 9.4-main</span>
<span class="go"> Loaded: loaded (/lib/systemd/system/postgresql@.service; disabled)</span>
<span class="go"> Active: active (running) since Mon 2016-01-25 17:52:02 CST; 1min 13s ago</span>
<span class="go"> Process: 14632 ExecStop=/usr/bin/pg_ctlcluster -m fast %i stop (code=exited, status=0/SUCCESS)</span>
<span class="go"> Process: 14641 ExecStart=postgresql@%i %i start (code=exited, status=0/SUCCESS)</span>
<span class="go"> Main PID: 14648 (postgres)</span>
<span class="go"> CGroup: /system.slice/system-postgresql.slice/postgresql@9.4-main.service</span>
<span class="go"> ├─14648 /usr/lib/postgresql/9.4/bin/postgres -D /var/lib/postgresql/9.4/main -c config_file=/etc/postg...</span>
<span class="go"> ├─14650 postgres: checkpointer process</span>
<span class="go"> ├─14651 postgres: writer process</span>
<span class="go"> ├─14652 postgres: wal writer process</span>
<span class="go"> ├─14653 postgres: autovacuum launcher process</span>
<span class="go"> └─14654 postgres: stats collector process</span>
<span class="go">Jan 25 17:52:00 vm-debian postgres[14648]: [1-1] 2016-01-25 17:52:00 CST [14648-1] LOG: ending log output to stderr</span>
<span class="go">Jan 25 17:52:00 vm-debian postgres[14648]: [1-2] 2016-01-25 17:52:00 CST [14648-2] HINT: Future log output ...log".</span>
<span class="go">Jan 25 17:52:00 vm-debian postgres[14649]: [2-1] 2016-01-25 17:52:00 CST [14649-1] LOG: database system was...9 CST</span>
<span class="go">Jan 25 17:52:00 vm-debian postgres[14649]: [3-1] 2016-01-25 17:52:00 CST [14649-2] LOG: MultiXact member wr...abled</span>
<span class="go">Jan 25 17:52:00 vm-debian postgres[14648]: [2-1] 2016-01-25 17:52:00 CST [14648-3] LOG: database system is ...tions</span>
<span class="go">Jan 25 17:52:00 vm-debian postgres[14653]: [2-1] 2016-01-25 17:52:00 CST [14653-1] LOG: autovacuum launcher started</span>
<span class="go">Jan 25 17:52:00 vm-debian postgres[14658]: [3-1] 2016-01-25 17:52:00 CST [14658-1] [unknown]@[unknown] LOG: ...acket</span>
<span class="go">Jan 25 17:52:38 vm-debian postgres[14793]: [3-1] 2016-01-25 17:52:38 CST [14793-1] root@root FATAL: role "r...exist</span>
<span class="go">Hint: Some lines were ellipsized, use -l to show in full.</span>
</code></pre></div>
<p>或用 systemd 標準看 log 的方式 <code>journalctl</code></p>
<div class="highlight"><pre><span></span><code><span class="gp"># </span>journalctl<span class="w"> </span>-u<span class="w"> </span>postgresql@9.4-main<span class="w"> </span>
<span class="go">-- Logs begin at Mon 2016-01-25 16:46:25 CST, end at Mon 2016-01-25 19:22:07 CST. --</span>
<span class="go">Jan 25 17:47:06 vm-debian postgres[13699]: [1-1] 2016-01-25 17:47:06 CST [13699-1] LOG: redirecting log output to logging collector process</span>
<span class="go">Jan 25 17:47:06 vm-debian postgres[13699]: [1-2] 2016-01-25 17:47:06 CST [13699-2] HINT: Future log output will appear in directory "pg_log".</span>
<span class="go">Jan 25 17:47:06 vm-debian postgres[13699]: [2-1] 2016-01-25 17:47:06 CST [13699-3] LOG: ending log output to stderr</span>
<span class="go">Jan 25 17:47:06 vm-debian postgres[13699]: [2-2] 2016-01-25 17:47:06 CST [13699-4] HINT: Future log output will go to log destination "syslog".</span>
<span class="go">Jan 25 17:47:06 vm-debian postgres[13701]: [3-1] 2016-01-25 17:47:06 CST [13701-1] LOG: database system was shut down at 2016-01-25 17:47:05 CST</span>
<span class="go">Jan 25 17:47:06 vm-debian postgres[13701]: [4-1] 2016-01-25 17:47:06 CST [13701-2] LOG: MultiXact member wraparound protections are now enabled</span>
<span class="go">Jan 25 17:47:06 vm-debian postgres[13699]: [3-1] 2016-01-25 17:47:06 CST [13699-5] LOG: database system is ready to accept connections</span>
<span class="go">Jan 25 17:47:06 vm-debian postgres[13705]: [3-1] 2016-01-25 17:47:06 CST [13705-1] LOG: autovacuum launcher started</span>
<span class="go">Jan 25 17:47:07 vm-debian postgres[13710]: [4-1] 2016-01-25 17:47:07 CST [13710-1] [unknown]@[unknown] LOG: incomplete startup packet</span>
<span class="go">Jan 25 17:49:30 vm-debian postgres[14170]: [4-1] 2016-01-25 17:49:30 CST [14170-1] root@root FATAL: role "root" does not exist</span>
<span class="go">...</span>
</code></pre></div>
<h2 id="reference">Reference</h2>
<ul>
<li><a href="https://www.digitalocean.com/community/tutorials/how-to-install-and-use-postgresql-9-4-on-debian-8">How to Install and Use PostgreSQL 9.4 on Debian 8</a> by Digital Ocean</li>
<li><a href="https://wiki.archlinux.org/index.php/PostgreSQL">PostgreSQL</a> on Arch Wiki</li>
<li><a href="https://wiki.debian.org/PostgreSql">PostgreSQL</a> on Debian wiki</li>
<li><a href="https://help.ubuntu.com/community/PostgreSQL">PostgreSQL</a> on Ubuntu wiki</li>
</ul>Coding 初學指南附錄 - Bioinfo Practices using Python2016-01-21T23:30:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-01-21:/posts/2016/01/lab-coding-appendix-bioinfo-python/<p>A walk through of practices created by Rosalind Team.</p><p>Last Edited: Jan, 2016 (如果內容有誤,你可以留言,或用任何管道告訴我)</p>
<p>We are going to walk through a series of practice created by <a href="http://rosalind.info/problems/">Rosalind Team</a>.</p>
<p>Once you register an account at Rosalind, you can use their judging system to work through all problems. However, in this case you cannot arbitrarily skip easy levels and it sucks. So I’m not going to force you using the system. Luckily, in each problem one set of example data and expected output is given, which can be used for checking our answer.</p>
<p>Note: Their code assumes Python 2 but everything I mention here is Python 3.</p>
<div class="toc">
<ul>
<li><a href="#python-basics">Python Basics</a></li>
<li><a href="#bininfo-first-try">Bininfo First Try</a><ul>
<li><a href="#q-dna-counting-dna-nucleotides">Q DNA: Counting DNA Nucleotides</a></li>
<li><a href="#q-revc-the-secondary-and-tertiary-structures-of-dna">Q REVC: The Secondary and Tertiary Structures of DNA</a></li>
<li><a href="#q-gc-computing-gc-content">Q: GC: Computing GC Content</a><ul>
<li><a href="#workthrough">Workthrough</a></li>
</ul>
</li>
<li><a href="#q-next">Q: (next?)</a></li>
</ul>
</li>
</ul>
</div>
<blockquote>
<p><strong>其他 Coding 初學指南系列文章:</strong></p>
<ul>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-intro/">Introduction</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-linux/">Chapter 1 – Linux</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-text-editing/">Chapter 2 – Text Editing (Markdown, Text Editor)</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-version-control/">Chapter 3 – Version Control (Git)</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-python/">Chapter 4 – Python</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-appendix-osx-env/">Appendix 1 – OSX Development Environment</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-appendix-bioinfo-python/">Appendix 2 – Python in Bioinformatics</a></li>
</ul>
<p>或者,用 <a href="/tag/labcoding.html">labcoding</a> 這個 tag 也可以找到所有的文章。</p>
</blockquote>
<h2 id="python-basics">Python Basics</h2>
<p>Do their <a href="http://rosalind.info/problems/list-view/?location=python-village">Python Village</a> problem sets. If any topic you don’t know, go read your Python reference.</p>
<p>Should be very trivial.</p>
<h2 id="bininfo-first-try">Bininfo First Try</h2>
<h3 id="q-dna-counting-dna-nucleotides">Q DNA: Counting DNA Nucleotides</h3>
<p>Link: <a href="http://rosalind.info/problems/dna/">http://rosalind.info/problems/dna/</a></p>
<ul>
<li>Hint: use <a href="https://docs.python.org/3/library/collections.html#collections.Counter">collections.Counter</a> provided by Python’s stdlib</li>
<li>More Hint: use <code>' '.join</code> and list comprehension to output the answer</li>
</ul>
<h3 id="q-revc-the-secondary-and-tertiary-structures-of-dna">Q REVC: The Secondary and Tertiary Structures of DNA</h3>
<p>Link: <a href="http://rosalind.info/problems/revc/">http://rosalind.info/problems/revc/</a></p>
<ul>
<li>Hint: <a href="https://docs.python.org/3/library/functions.html#reversed">reversed</a> for any sequence object and a dict for nucleotide code mapping</li>
<li>More Hint: done in a list comprehension</li>
</ul>
<h3 id="q-gc-computing-gc-content">Q: GC: Computing GC Content</h3>
<p>Link: <a href="http://rosalind.info/problems/gc/">http://rosalind.info/problems/gc/</a></p>
<p>This is the first complicated problem that some abstraction should help you come up the solution. Try write some re-usable code blocks, for example, function calls and class definitions.</p>
<p>Don’t worry about the computation complexity</p>
<h4 id="workthrough">Workthrough</h4>
<p><em>You should implement by yourself before looking my solution. Also I didn’t see their official solution so my solution can differ a lot from theirs.</em></p>
<p>Intuitively, we need to implement a FASTA file parser. FASTA contains a series of sequence reads with unique ID. From a object-oriented viewpoint, we create classes <code>Read</code> for reads and <code>Fasta</code> for fasta files.</p>
<p><code>Read</code> is easy to design and understand,</p>
<div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">Read</span><span class="p">:</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">id</span><span class="p">,</span> <span class="n">seq</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">id</span> <span class="o">=</span> <span class="nb">id</span>
<span class="bp">self</span><span class="o">.</span><span class="n">seq</span> <span class="o">=</span> <span class="n">seq</span>
</code></pre></div>
<p>Since we need to compute their GC content, add a method for <code>Read</code>.</p>
<div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">Read</span><span class="p">:</span>
<span class="c1"># ... skipped</span>
<span class="k">def</span> <span class="nf">gc</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""Compute the GC content (in %) of the read."""</span>
<span class="c1"># put the logic here (think of problem Q DNA)</span>
<span class="n">gc_percent</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">seq</span><span class="p">)</span>
<span class="k">return</span> <span class="n">gc_percent</span>
</code></pre></div>
<p>Then we have to implement the FASTA parser, which reads all read entries and converts them through <code>Read</code>. In real world we are dealing with <code>myfasta.fa</code>-like files, but here the input is string.</p>
<div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">Fasta</span><span class="p">:</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">raw_str</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""Parse a FASTA formated string."""</span>
<span class="bp">self</span><span class="o">.</span><span class="n">raw_str</span> <span class="o">=</span> <span class="n">raw_str</span>
<span class="c1"># convert string into structured reads.</span>
<span class="bp">self</span><span class="o">.</span><span class="n">reads</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">parse</span><span class="p">())</span>
<span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""Parse the string and yield read in Read class."""</span>
<span class="c1"># though we have no idea right now, the code structure</span>
<span class="c1"># should be something like the following.</span>
<span class="n">raw_lines</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">raw_str</span><span class="o">.</span><span class="n">splitlines</span><span class="p">()</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">raw_lines</span><span class="p">:</span>
<span class="k">yield</span> <span class="n">Read</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
</code></pre></div>
<p>Here I use <code>yield Read(...)</code>, which may be unfamiliar for Python beginners. It turns <code>parse(self)</code> function as a generator. Generator makes you focus on the incoming data. Once data is parsed and converted, the result is immediated thrown out by <code>yield</code>. We don’t care about how to collect all the results. In our case, we catch all the results into a list by <code>list(...)</code>.</p>
<p>So how should we read FASTA file? A simple rule in this case is that every read consists by two continuous row. Also, the first row will always be the first read id.</p>
<p>All we need is read two lines at the same time. Here <a href="https://docs.python.org/3/library/functions.html#zip">a Pythonic idiom</a> is introduced. The following code read two non-overlapping lines,</p>
<div class="highlight"><pre><span></span><code><span class="k">for</span> <span class="n">first_line</span><span class="p">,</span> <span class="n">second_line</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="p">[</span><span class="nb">iter</span><span class="p">(</span><span class="n">raw_lines</span><span class="p">)]</span><span class="o">*</span><span class="mi">2</span><span class="p">):</span>
<span class="k">yield</span> <span class="n">Read</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="n">first_line</span><span class="p">,</span> <span class="n">seq</span><span class="o">=</span><span class="n">second_line</span><span class="p">)</span>
</code></pre></div>
<p>By <code>zip(*[iter(s)]*n)</code> magic, we are very close to implement a full parser. You could find a lot of <a href="http://stackoverflow.com/a/2233247">explanations</a> for this magic.</p>
<p>Read id line percedes with a <code>></code> sign, so we could use something like <code>first_line[1:]</code> or <code>first_line[len('>'):]</code> for explicity.</p>
<p>Then sorting the GC% of reads in a FASTA file is easy.</p>
<div class="highlight"><pre><span></span><code><span class="n">fasta</span> <span class="o">=</span> <span class="n">Fasta</span><span class="p">(</span><span class="s1">'...'</span><span class="p">)</span>
<span class="n">sorted_reads</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">fasta</span><span class="o">.</span><span class="n">reads</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">r</span><span class="p">:</span> <span class="n">r</span><span class="o">.</span><span class="n">gc</span><span class="p">())</span> <span class="c1"># note 1</span>
<span class="n">top_gc_read</span> <span class="o">=</span> <span class="n">sorted_reads</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># note 2</span>
<span class="nb">print</span><span class="p">(</span>
<span class="s1">'></span><span class="si">{0:s}</span><span class="se">\n</span><span class="s1">'</span>
<span class="s1">'</span><span class="si">{1:.6f}</span><span class="s1">'</span> <span class="c1"># note 3, 4</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">top_gc_read</span><span class="o">.</span><span class="n">id</span><span class="p">,</span> <span class="n">top_gc_read</span><span class="o">.</span><span class="n">gc</span><span class="p">())</span>
<span class="p">)</span>
</code></pre></div>
<p>The code above completes the following steps:</p>
<ol>
<li><code>sorted(list, key=key_func)</code> sorts the list based on the return value of key_func applied to each element.</li>
<li>or <code>top_gc_read = sorted(..., reversed=True)[0]</code></li>
<li>two string with no operands in between will be joint automatically. In this case it is exactly <code>>{0:s}\n{1:.6f}</code>. This is useful to tidy a super long string.</li>
<li><code>'...'.format()</code> fills the string with given values. See <a href="https://docs.python.org/3/library/string.html#formatspec">doc</a>.</li>
</ol>
<p>In real case FASTA can span across multiple lines, also likely the file we parse is broken. How could we modify this parser to handle these situations?</p>
<h3 id="q-next">Q: (next?)</h3>
<p>I’m super tired now so I’ll leave the rest for you. Try those problems within yellow correct ratio range.</p>Coding 初學指南附錄 - OSX 開發環境2016-01-21T23:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-01-21:/posts/2016/01/lab-coding-appendix-osx-env/<p>我自己在 OSX 上的主觀開發環境設定</p><p>Last Edited: Jan, 2016 (如果內容有誤,你可以留言,或用任何管道告訴我)</p>
<p>以下的設定都蠻主觀的,見人見智。總之我把我的環境分享出來。</p>
<div class="toc">
<ul>
<li><a href="#terminal">Terminal</a></li>
<li><a href="#homebrew-git-and-python">Homebrew, Git and Python</a></li>
<li><a href="#text-editors">Text Editors</a></li>
<li><a href="#terminal-multiplexers">Terminal Multiplexers</a></li>
<li><a href="#git-gui">Git GUI</a></li>
<li><a href="#documentation-searcher">Documentation Searcher</a></li>
<li><a href="#misc">Misc.</a></li>
</ul>
</div>
<blockquote>
<p><strong>其他 Coding 初學指南系列文章:</strong></p>
<ul>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-intro/">Introduction</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-linux/">Chapter 1 – Linux</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-text-editing/">Chapter 2 – Text Editing (Markdown, Text Editor)</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-version-control/">Chapter 3 – Version Control (Git)</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-python/">Chapter 4 – Python</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-appendix-osx-env/">Appendix 1 – OSX Development Environment</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-appendix-bioinfo-python/">Appendix 2 – Python in Bioinformatics</a></li>
</ul>
<p>或者,用 <a href="/tag/labcoding.html">labcoding</a> 這個 tag 也可以找到所有的文章。</p>
</blockquote>
<h2 id="terminal">Terminal</h2>
<p>OSX 系統有內建一個 <code>Terminals.app</code> 能像在 Linux 上一樣使用。他其實使用上沒什麼問題,不過想要調顏色,有更多自定功能的話,許多人會安裝 <a href="http://iterm2.com/">iTerm2</a>。</p>
<h2 id="homebrew-git-and-python">Homebrew, Git and Python</h2>
<p>OSX 上官方沒有一個管理套件的工具,所以社群自行開發了一個叫做 Homebrew。你可以按照<a href="http://djangogirlstaipei.herokuapp.com/tutorials/installation/">這篇教學</a>安裝 Homebrew。</p>
<p>裝好了之後你可以以下指令去看它該怎麼操作</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>brew<span class="w"> </span>--help
$<span class="w"> </span>man<span class="w"> </span>brew<span class="w"> </span><span class="c1"># for full documentation</span>
</code></pre></div>
<p>OSX 雖然內建有 git 與 python,但我們可以用 homebrew 安裝比較標準(新)的版本,</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>brew<span class="w"> </span>install<span class="w"> </span>git<span class="w"> </span>python3
</code></pre></div>
<p>如果 homebrew 有問題可以用 <code>brew doctor</code> 來檢測。把錯誤訊息問 google 通常就能找到解決方式。</p>
<h2 id="text-editors">Text Editors</h2>
<p>我最常用的是 Vim。OSX 有內建,但也可以用 homebrew 安裝。</p>
<p>除了 console based 的 Vim,OSX 上也有像 gVim 的 MacVim。一樣能用 homebrew 安裝。</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>brew<span class="w"> </span>info<span class="w"> </span>macvim<span class="w"> </span><span class="c1"># 看 MacVim 在安裝有什麼選項可以調整</span>
$<span class="w"> </span>brew<span class="w"> </span>install<span class="w"> </span>macvim<span class="w"> </span>--override-system-vim<span class="w"> </span>--custom-icons
$<span class="w"> </span>brew<span class="w"> </span>linkapps<span class="w"> </span>macvim
</code></pre></div>
<p>使用 Macvim 的時候除了 vim 之外,也可以呼叫 <code>mvim</code> 打開 MacVim。</p>
<h2 id="terminal-multiplexers">Terminal Multiplexers</h2>
<p>你有可能有聽過 screen 或者 tmux。前者在 osx 上有內建但版本很舊,在顯示顏色上會有問題,因此可以透過 homebrew 再安裝新的。但因為 screen 跟系統提供的重覆到了,所以預設不在 homebrew 的 repo 中,要先新增 repo 清單:</p>
<div class="highlight"><pre><span></span><code>brew<span class="w"> </span>tap<span class="w"> </span>homebrew/dupes
brew<span class="w"> </span>install<span class="w"> </span>screen<span class="w"> </span>tmux
</code></pre></div>
<h2 id="git-gui">Git GUI</h2>
<p>初學 Git 可能會不熟那些指令、常常不知道自己在 git log 哪個位置。這時候有個圖形化的工具會更方便了解。Git 有內建一個 gitk,但比較陽春。</p>
<p>在 OSX 上可以考慮用 <a href="http://www.sourcetreeapp.com/">SourceTree</a>。</p>
<h2 id="documentation-searcher">Documentation Searcher</h2>
<p>要一直查 Python 官網有時候還蠻麻煩的,未來學了 HTML CSS 等等不同語言或各種 Python 套件,要查個東西會很費時。所以有人開發了一個離線的 documentation 查詢器叫做 <a href="http://kapeli.com/dash">Dash</a>。他要錢但有免費版,似乎是會一直跳提示訊息。</p>
<h2 id="misc">Misc.</h2>
<ul>
<li><a href="http://www.alfredapp.com/">Alfred App</a>:一個延伸版的 Spotlight,查應用程式很快速,同時也可以跟 Dash 整合讓查 doc 更方便。</li>
<li><a href="http://macdown.uranusjr.com/">Macdown</a>:OSX 上的 markdown 編輯器。</li>
</ul>Coding 初學指南-Python2016-01-21T22:50:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-01-21:/posts/2016/01/lab-coding-python/<p>選擇 Python 作為第一個深入學習的語言有很多好處。他的語法跟英文相似、用互動式的方式來操作,方便以邊試邊學、內建的標準函式庫功能豐富、第三方套件,幾乎能用 Python 完成各種事情。</p><p>Last Edited: Jun, 2017 (如果內容有誤,你可以留言,或用任何管道告訴我)</p>
<blockquote>
<p>Python 是一種物件導向、直譯式的電腦程式語言,具有近二十年的發展歷史。它包含了一組功能完備的標準庫,能夠輕鬆完成很多常見的任務。</p>
<p>(From <a href="https://zh.wikipedia.org/wiki/Python">Wikipedia</a>)</p>
</blockquote>
<p>選擇 Python 作為第一個深入學習的語言有很多好處。他的語法跟英文相似,比起其他語言經常用到 <code>;{}()</code> 來控制語法不同的段落,Python 主要用的是空白與縮排。</p>
<p>Python 能用互動式的方式(read–eval–print loop, REPL)來操作,以邊試邊做的方法來開發很適合初學者。</p>
<p>內建的標準庫(standard library)功能很豐富,在網路、文字處理、檔案處理、甚至 GUI 介面都能用它完成。除此之外,它的第三方套件也很多,在 Linux 上很好安裝,這樣幾乎能用 Python 完成各種事情。</p>
<div class="toc">
<ul>
<li><a href="#_1">聽說系列</a><ul>
<li><a href="#python">聽說 Python 跑很慢,是不是不能用來計算/分析/大檔案?</a></li>
<li><a href="#python-2-python-3">Python 2 還是 Python 3,聽我朋友說…比較好?</a></li>
</ul>
</li>
<li><a href="#_2">相關資源</a><ul>
<li><a href="#introducing-python-python">Introducing Python(精通 Python)</a></li>
<li><a href="#python_1">Python 官網</a></li>
<li><a href="#python_2">Python 程式設計「超入門」</a></li>
<li><a href="#learning-python">Learning Python</a></li>
<li><a href="#python-cookbookpython">Python Cookbook(Python 的錦囊妙計)</a></li>
<li><a href="#fluent-python-python">Fluent Python(流暢的 Python)</a></li>
<li><a href="#moocs">MOOCs</a></li>
</ul>
</li>
<li><a href="#_3">學習目標</a></li>
</ul>
</div>
<blockquote>
<p><strong>其他 Coding 初學指南系列文章:</strong></p>
<ul>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-intro/">Introduction</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-linux/">Chapter 1 – Linux</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-text-editing/">Chapter 2 – Text Editing (Markdown, Text Editor)</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-version-control/">Chapter 3 – Version Control (Git)</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-python/">Chapter 4 – Python</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-appendix-osx-env/">Appendix 1 – OSX Development Environment</a></li>
<li><a href="#file-a1_python_bioinfo-md">Appendix 2 – Python in Bioinformatics</a></li>
</ul>
<p>或者,用 <a href="/tag/labcoding.html">labcoding</a> 這個 tag 也可以找到所有的文章。</p>
</blockquote>
<h2 id="_1">聽說系列</h2>
<p>(需要接觸過 Python 之後才能理解)</p>
<h3 id="python">聽說 Python 跑很慢,是不是不能用來計算/分析/大檔案?</h3>
<p>Python 的確執行效率比編譯式的語言差(例:C/C++、Java),但這很可能不是你程式跑得慢的主因,所以也不代表 Python 不能處理計算量高的工作。</p>
<p>碰到程式跑得比想像中慢的時候,有幾個步驟:</p>
<ol>
<li>到底是哪幾行程式跑得慢?</li>
<li>這是最佳的演算法嗎?</li>
<li>這是最有效率的 Python 語法嗎?</li>
</ol>
<p>如果用到了最後一步情況還是沒有改善的話,就可以開始把那些部份用其他語言改寫,例如:C。Python 能很容易跟 C 語言的結合。而且常見的 C 語言加速,其實都有 Python 套件能支援了,例如 Numpy,所以大部份的時間,都能在不使用 Python 以外的語言完成高量計算。</p>
<blockquote>
<p>我在實習時候,也常碰到需要優化的問題。用 Python 我能很輕鬆(一天內)把工作分配到 4 台主機 64 cores 上跑,也許方法不有效率,但比起我花幾天把 Python 改寫成 C/C++,實作更精密有效的算法(還要是 multithread),仔細處理可能的 corner case,平行化之後本來三四天的計算時間我 2 個小時就能收工。</p>
<p>更重要的是,這個實驗就只跑個兩次。</p>
</blockquote>
<p>比起計算時間,開發時間對工程師而言是更加寶貴的。尤其在實驗室,最關心的是這個方法行不行得通,程式跑得慢有很多解決的方式,例如平行化。重點在解決問題,需要用多一點的資源其實不是很重要。</p>
<p>如果問我 Python 還是 Matlab 比較快?這邊有正經的 <a href="http://www.pyzo.org/python_vs_matlab.html">Python vs Matlab</a>。一開始選 Python 慢的話有<a href="http://wiki.scipy.org/PerformancePython">很多條路可以走</a>,但 Matlab 呢? meh</p>
<p>所以 Python 跑得快不快?它單打獨鬥有極限,但它有很多快樂夥伴。O’Reilly 有本 <a href="http://shop.oreilly.com/product/0636920028963.do">High Performance Python</a> 值得一看。</p>
<h3 id="python-2-python-3">Python 2 還是 Python 3,聽我朋友說…比較好?</h3>
<p>隨著時間流逝,每過一天我都可以更確信的說<strong>「請學 Python 3」</strong>。現在有在用 Python 2 多半也是用 2.7 版本,要把 3.3+ 的程式碼改回 2.7 也不難。</p>
<p>EDIT 2017-06: Python 2.7 確定<a href="https://pythonclock.org/">在 2020 年會停止官方支援</a>,這不代表說在那年 Python 2.7 就會瞬間消失,目前世界上有非常多公司會繼續維護他們內部的 Python 2.x 程式碼,但新的專案都預設使用 3.5+ 版本開發。而市面上的書籍也已經都是針對 Python 3.x 撰寫,過往使用中文學習的障礙已經消失。</p>
<h2 id="_2">相關資源</h2>
<p>連同前幾章,如果你要在自己電腦上設定 Python 開發環境,可以參考 <a href="http://djangogirlstaipei.herokuapp.com/tutorials/">Djang Girls Taipei Tutorial</a>。另外,<a href="http://wiki.python.org.tw/Python/%E7%AC%AC%E4%B8%80%E6%AC%A1%E7%94%A8%E5%B0%B1%E4%B8%8A%E6%89%8B">Python Taiwan Wiki</a> 有更完整的 Python 學習資源列表。</p>
<p>O’Reilly(歐萊禮)的書,<a href="https://shop.oreilly.com/">官方線上商店</a>常有 50% 折扣,PDF/ePub/Mobi 格式都有, 買一次就能輕鬆在電腦、Kindle、eReader 上閱讀,能接受英文的話,十分推薦跟官方購買。中文版就以<a href="https://www.tenlong.com.tw/">天瓏書局</a>為主。它也有賣英文紙本,逛實體店很舒服。</p>
<p>EDIT 2017-06: 自本文撰寫兩年以來,市面上 Python 3.x 中文書已經非常充足,在天瓏實體店甚至有一整個專櫃提供不同難易度、各種應用的專書。底下列出的書只是我個人的推薦與偏好,建議有空到書店親自翻一翻更能找到自己喜歡的學習方式。</p>
<p>除了書籍之外,現在越來越多以影片、互動形式的教學,像 Jessica McKellar 錄製的 <a href="http://shop.oreilly.com/product/110000448.do">Introduction to Python</a> 教學影片即非常受歡迎。因為我並沒有親自玩過這些新課程,它們並沒有列於此,但都歡迎讀者嘗試。</p>
<h3 id="introducing-python-python">Introducing Python(精通 Python)</h3>
<p>O’Reilly Python 系列的書都寫得很好。這本是比較新出的,好處是它針對初學者,比較薄,能在短時間看完,文字很流暢。想要快速掌握基礎的語法的話,建議閱讀 Chp1 到 Chp7,以及 Chp8 File I/O 部份。</p>
<p>“Introducing Python”, Bill Lubanovic. <em>O’Reilly</em>, 2014.11</p>
<ul>
<li><a href="http://shop.oreilly.com/product/0636920028659.do">英文書</a></li>
<li><a href="http://www.tenlong.com.tw/items/9863477311?item_id=1007464">中文實體書</a></li>
</ul>
<h3 id="python_1">Python 官網</h3>
<p>Python 的官網除了查語言特性之外,還能用來學習怎麼使用 stdlib。Python 標準函式庫功能包山包海,在你想要做什麼之前,都應該到官網查看看是不是內建 module 就已經提供功能了。除外,還有一個簡潔的 tutorial,供初學者參考,適合有學過其他語言的人。我認為這份寫得非常好,苦於沒有中文,據以前經驗不太容易推廣,但值得一讀。</p>
<p>“Python Tutorial”, Official Python Documentation, Python Devs.</p>
<ul>
<li><a href="https://docs.python.org/3/">連結</a></li>
<li><a href="http://www.pythondoc.com/pythontutorial3/index.html">簡中翻譯</a></li>
<li><a href="https://docs.python.org.tw/3/tutorial/index.html">繁中翻譯</a>(進行中)</li>
</ul>
<h3 id="python_2">Python 程式設計「超入門」</h3>
<p>如果你完全沒有任何程式設計的基礎,例如不懂什麼是變數、如果寫程式控制電腦行為、什麼是命令列模式,那麼這本入門書應該很適合你。它用圖解的方式去解釋何謂迴圈、if-else 條件判斷、物件導向概念。這邊所列的其他「入門書」,都會假設你大概知道上述這些概念。到了最後兩章可能會突然變得複雜,這時候可以再回頭看看其他入門書,應該就會有辦法閱讀。</p>
<p>《Python 程式設計「超入門」》,鎌田正浩 著、陳禹豪、林子政 譯。旗標 2016.11</p>
<ul>
<li><a href="http://amzn.asia/dQgghO8">日文書</a> (source: Amazon)</li>
<li><a href="https://www.tenlong.com.tw/products/9789863123798">中文實體書</a></li>
</ul>
<h3 id="learning-python">Learning Python</h3>
<p>雖然名稱看起來很像是 Python 的入門書,但它的篇幅已經來到 1600 頁,實在無法推薦給初學者。它在一本書內把 Python 幾乎所有語言特性都說清楚,同時考慮到 Python 2 和 3 版本。當你想要了解,例如 MRO 的順序、何謂 unbounded, bound method,這本書詳細的程度不會讓你失望,只怕你沒空讀。</p>
<p>我當初看的是這一本 3ed 中文版(現已絕版),那時還沒有考慮 Python 3。</p>
<p>“Learning Python” 5ed, Mark Lutz. <em>O’Reilly</em>, 2013.06</p>
<ul>
<li><a href="http://shop.oreilly.com/product/0636920028154.do">英文書</a></li>
</ul>
<h3 id="python-cookbookpython">Python Cookbook(Python 的錦囊妙計)</h3>
<p>這本不是入門書但很適合深入了解 Python,並讓自己的程式碼寫得更 Pythonic。裡面介紹了很多寫法慣例 idioms,同時也有中文版。非常值得在未來比較懂 Python 時買來看。</p>
<p>作者之一 David Beazley 是 PyCon TW 2013 的 Keynote。他平常就是專門教 Python 的講師,他在 PyCon 講過的「所有 talk 與 tutorial」,如 <a href="http://www.dabeaz.com/coroutines/">concurrency</a>, <a href="http://www.dabeaz.com/modulepackage/index.html">packaging</a>, <a href="https://www.youtube.com/watch?v=MCs5OvhV9S4">async io</a> 等等都值得一看。</p>
<p>“Python Cookbook” 3ed, David Beazley and Brian K. Jones. <em>O’Reilly</em>, 2013.05</p>
<ul>
<li><a href="http://shop.oreilly.com/product/0636920027072.do">英文書</a></li>
<li><a href="http://www.tenlong.com.tw/items/9863470686">中文實體書</a></li>
</ul>
<h3 id="fluent-python-python">Fluent Python(流暢的 Python)</h3>
<p>當它是詳細、擴充版的 “Python Cookbook”,實際上書中也常常引用 David 的話。講述更多 Python 初介紹時不會深談的語言特性。如:MRO, Mixin, decorator, closure, metaprogramming</p>
<p>每章最後的 Future Reading 與 Soapbox 旁徵博引,除了更細節的參考資料,還有當初 Python 為何如此設計等考量與討論的歷史、發展、與各語言比較。非常適合做為邁向 Python core developer 的參考書。</p>
<p>“Fluent Python”, Luciano Ramalho. <em>O’Reilly</em>, 2015.07</p>
<ul>
<li><a href="http://shop.oreilly.com/product/0636920032519.do">英文書</a></li>
<li><a href="http://www.tenlong.com.tw/items/986347911X">中文實體書</a></li>
</ul>
<h3 id="moocs">MOOCs</h3>
<p>關於 MOOCs 我有看過 Codecademy Python Track 以及 Coursera “An Introduction to Interactive Programming in Python” 這兩門課。我覺得最大的缺點就是講 Python 2.7,Python 3.x 的好用功能與差異都沒提;再來講課的 code 範例並不是使用 idiomatic Python syntax,在初學就沒養成好習慣與慣用語法有點可惜。</p>
<ul>
<li>Codecademy Python Track <a href="http://www.codecademy.com/en/tracks/python">http://www.codecademy.com/en/tracks/python</a></li>
<li>Coursera: An Introduction to Interactive Programming in Python <a href="https://www.coursera.org/course/interactivepython">https://www.coursera.org/course/interactivepython</a></li>
</ul>
<h2 id="_3">學習目標</h2>
<ol>
<li>打開自己 Linux 裡的 Python3,跟著學習用的參考資料動手操作。用 REPL 以及運行腳本兩種方法來執行 Python 程式。</li>
<li>
<p>學習使用 pip 和 venv (virtualenv) 來管理 Python 套件與環境。</p>
<ul>
<li>Hint: Python 官網是你的好夥伴。你可以在<a href="https://docs.python.org/3/installing/">這裡 (pip)</a> 和<a href="https://docs.python.org/3/library/venv.html?highlight=venv">這裡 (venv)</a> 找到兩者的教學。</li>
</ul>
</li>
<li>
<p><a href="http://rg3.github.io/youtube-dl/">youtube-dl</a> 是一個用來下載 Youtube、Crunchyroll 等各大影音串流網站影片的工具。除了用 Linux 的套件管理工具安裝它,它其實是個用 Python 寫成的套件。為了避免跟 Linux 系統環境相衝,請開一個 Python 虛擬環境,並在裡面用 pip 安裝它。</p>
<ul>
<li>Note: youtube-dl 除了單純做下載串流檔之外,還支援轉檔、封裝、後製等影像處理,這需要 libav 或 ffmpeg 任一影像處理套件。在 Debian 系列的 Linux 上 libav 會好裝一點。</li>
</ul>
</li>
<li>
<p>用 Python 解決一些實驗室會碰到的 Bioinfo 問題。有個網站 Rosalind 出了一系列的題目,我選了一些讓各位練習,請參考<a href="#file-a1_python_bioinfo-md">附錄 1</a>。</p>
</li>
</ol>
<p>EDIT 2016-05-22: 把 <a href="https://www.ptt.cc/bbs/Python/M.1463750830.A.DA8.html">ptt 發文</a> 的內容更新上來,增加一些新書和中文翻譯;調整推薦的順序。<br>
EDIT 2017-06-20: 更新書籍資訊與 2/3 比較。</p>Coding 初學指南-版本控制2016-01-21T22:40:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-01-21:/posts/2016/01/lab-coding-version-control/<p>所謂的版本控制就跟玩遊戲一樣可以存取「進度點」,破關前都會保存進度,這樣破關失敗的時候可以還成到保存進度的狀態,再重新打怪。版本控制用在管理程式碼時,就方便讓自己在把 code 搞炸掉的時候,還能回到先前有保存的狀態。</p><p>Last Edited: Jan, 2016 (如果內容有誤,你可以留言,或用任何管道告訴我)</p>
<p>所謂的版本控制就跟玩遊戲一樣可以存取「進度點」,破關前都會保存進度,這樣破關失敗的時候可以還成到保存進度的狀態,再重新打怪。版本控制用在管理程式碼時,就方便讓自己在把 code 搞炸掉的時候,還能回到先前有保存的狀態。</p>
<blockquote>
<h3 id="_1">為什麼使用版本控制?</h3>
<p>在軟體開發的過程中,程式碼每天不斷地產出,過程中會發生以下情況:</p>
<ul>
<li>檔案被別人或自己覆蓋,甚至遺失</li>
<li>想復原前幾天寫的版本</li>
<li>想知道跟昨天寫的差在哪裡?</li>
<li>是誰改了這段程式碼,為什麼 ?</li>
<li>軟體發行,需要分成維護版跟開發版</li>
</ul>
<p>因此,我們希望有一種機制,能夠幫助我們:</p>
<ul>
<li>可以隨時復原修改,回到之前的版本</li>
<li>多人協作時,不會把別人的東西蓋掉</li>
<li>保留修改歷史記錄,以供查詢</li>
<li>軟體發行時,可以方便管理不同版本</li>
</ul>
<p>(From: <a href="http://dylandy.github.io/Easy-Git-Tutorial/">Git 教學研究站</a>)</p>
</blockquote>
<p>能做到版本控制的工具有很多,但目前主流就是 Git。</p>
<div class="toc">
<ul>
<li><a href="#_1">為什麼使用版本控制?</a></li>
<li><a href="#git-version-control">Git (Version Control)</a><ul>
<li><a href="#_2">操作建議</a><ul>
<li><a href="#commit-style">Commit Style</a></li>
</ul>
</li>
<li><a href="#_3">常見問題</a><ul>
<li><a href="#conflict">Conflict</a></li>
<li><a href="#push-fail">Push fail</a></li>
</ul>
</li>
<li><a href="#_4">相關資源</a><ul>
<li><a href="#code-school-try-git">Code School - Try Git</a></li>
<li><a href="#git">Git 教學研究站</a></li>
<li><a href="#code-school-git-real">Code School - Git Real</a></li>
<li><a href="#learn-git-branching">Learn Git Branching</a></li>
<li><a href="#git-tutorial-by-atlassian">Git Tutorial by Atlassian</a></li>
</ul>
</li>
<li><a href="#_5">學習目標</a></li>
</ul>
</li>
</ul>
</div>
<blockquote>
<p><strong>其他 Coding 初學指南系列文章:</strong></p>
<ul>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-intro/">Introduction</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-linux/">Chapter 1 – Linux</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-text-editing/">Chapter 2 – Text Editing (Markdown, Text Editor)</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-version-control/">Chapter 3 – Version Control (Git)</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-python/">Chapter 4 – Python</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-appendix-osx-env/">Appendix 1 – OSX Development Environment</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-appendix-bioinfo-python/">Appendix 2 – Python in Bioinformatics</a></li>
</ul>
<p>或者,用 <a href="/tag/labcoding.html">labcoding</a> 這個 tag 也可以找到所有的文章。</p>
</blockquote>
<h1 id="git-version-control">Git (Version Control)</h1>
<p>Git 是一個版本控制的工具。</p>
<p>Git 會在你的專案(repo)目錄<sup id="fnref:註1"><a class="footnote-ref" href="#fn:註1">1</a></sup>底下建一個 <code>.git</code> 的資料夾來管理這些「進度點」,而不會去動專案其他路徑裡的東西。</p>
<p>這些進度點可以傳到 server 上,別人下載下來的時候就可以除了得到現在的 code 以外,還能看到過去開發的記錄;而別人上傳了他的更新進度點之後,你抓下來就可以得到他更改的進度。這個就是「同步」的概念,多人之間能彼此共享、更新彼此開發的成果。</p>
<p>能夠處理 Git 同步操作的伺服器就叫做 git server。<a href="https://github.com/">Github</a> 就是一間公司提供免費的 git server 讓大家同步公開的 Git 專案。很多 Linux 的工具都使用 git 來讓大家合作開發,也有不少工具已經把 git server 轉到了 Github 上面。所以非常多人在用,因此建議大家不妨申請一個 GitHub 帳號。</p>
<p>Git 雖然最常用來管理程式碼,但它其實可以有效地管理任何純文字的檔案,也可以把 binary 檔案加到 repo 中。</p>
<p>(可能需要有一些 git 操作經驗才能了解以下的術語)</p>
<h2 id="_2">操作建議</h2>
<blockquote>
<p>建立多而小的進度點</p>
</blockquote>
<p>每完成一系列的更動,就趕快 add 和 commit。一開始會煩,但這是好習慣。</p>
<p>日後更了解 git 操作的時候,會學到一些進階的指令(如 <code>git rebase -i</code>)就能把多個 commit 合成為一個。但要拆開一個大的 commit 會比較複雜。</p>
<h3 id="commit-style">Commit Style</h3>
<p>常見的 commit message 大概會是一行文。如果這個更動需要說明,那就建議按照下面的格式:</p>
<ul>
<li>第一行少於 50 個字元</li>
<li>第二行留空</li>
<li>第三行以後格式隨意,但單行不超過 75 個字元</li>
<li>善用條列式說明</li>
</ul>
<p>以下是範例(From <a href="http://git-scm.com/book/ch5-2.html">Git Book</a>):</p>
<div class="highlight"><pre><span></span><code>Short (50 chars or less) summary of changes
More detailed explanatory text, if necessary. Wrap it to
about 72 characters or so. In some contexts, the first
line is treated as the subject of an email and the rest of
the text as the body. The blank line separating the
summary from the body is critical (unless you omit the body
entirely); tools like rebase can get confused if you run
the two together.
Further paragraphs come after blank lines.
- Bullet points are okay, too
- Typically a hyphen or asterisk is used for the bullet,
preceded by a single space, with blank lines in
between, but conventions vary here
</code></pre></div>
<p>ps 你可以找到很多有趣的 commit message。例如:<a href="http://www.commitlogsfromlastnight.com/">抱怨</a>。</p>
<h2 id="_3">常見問題</h2>
<h4 id="conflict">Conflict</h4>
<p>當單機只有在一個 branch 上開發的時候,很難有 conflict 的問題。但碰到多人共同開發,或把多條 branch merge 在一起時就會有 conflict。</p>
<p>Conflict 的發生,最常見的就是兩個人各自修改了同一個檔案相近位置的內容。這使得 git 在把兩個人的更動融合在一起的時候,會不知道到底要用誰的更動,這時候就無法自動處理了。</p>
<p>可以搜尋「resolve git conflict」找到相關的解決辦法。</p>
<h4 id="push-fail">Push fail</h4>
<p>這通常發在 server 上的進度點比自己本機的還要新,所以必須先把 server 上的更新同步下來。如果都是同一個 branch 的話,你可以試著用 <code>git pull --rebase</code> 去避免額外的 merge。</p>
<h2 id="_4">相關資源</h2>
<h3 id="code-school-try-git">Code School - Try Git</h3>
<p>互動式練習,能懂最基本的 Git 指令操作,日常操作也主要是這些指令。並且會帶你建立一個 GitHub 帳號。</p>
<ul>
<li><a href="https://try.github.io">https://try.github.io</a></li>
</ul>
<h3 id="git">Git 教學研究站</h3>
<p>中文的介紹,他的互動式練習就是上面 Try Git 的中文化版本。</p>
<ul>
<li><a href="http://dylandy.github.io/Easy-Git-Tutorial/index.html">http://dylandy.github.io/Easy-Git-Tutorial/index.html</a></li>
</ul>
<h3 id="code-school-git-real">Code School - Git Real</h3>
<p>更完整的互動式練習,如果全部的關卡都做完的話,大部份需要用 git 的狀況都練習過了。</p>
<ul>
<li><a href="http://gitreal.codeschool.com">http://gitreal.codeschool.com</a></li>
</ul>
<h3 id="learn-git-branching">Learn Git Branching</h3>
<p>顧名思義,是個練習操作 git branch 的線上學習網站。不過前幾個關卡在介紹 commit 相關的操作,可以試一試。真要練習可以先完成 Main 以下 levels:</p>
<ul>
<li>Introduction Sequence</li>
<li>Ramping Up</li>
<li>Moving Work Around</li>
</ul>
<p>其他稍難一點,視情況跳過。但如果想學 git 比較複雜的指令可以回來看它。</p>
<ul>
<li><a href="http://pcottle.github.io/learnGitBranching/">http://pcottle.github.io/learnGitBranching/</a></li>
</ul>
<h3 id="git-tutorial-by-atlassian">Git Tutorial by Atlassian</h3>
<p>蠻完整的教學,但可能稍難一點。</p>
<ul>
<li><a href="https://www.atlassian.com/git/tutorials">https://www.atlassian.com/git/tutorials</a></li>
</ul>
<h2 id="_5">學習目標</h2>
<ol>
<li>
<p>用 Git 管理這些練習的筆記(呈接在 Text Editors 的練習)</p>
<ul>
<li>可以試著對它做一些 git 指令操作:<ul>
<li><code>git status</code></li>
<li><code>git log --oneline --graph</code></li>
</ul>
</li>
</ul>
</li>
<li>
<p>建立 dotfiles 和 dotvim 來管理你的環境設定檔。<br>
dotfiles 就是用來儲存 <code>.xxx</code> 的檔案們,像是 <code>.bashrc</code> 、 <code>.screenrc</code> 、 <code>.tmux.conf</code> 、 <code>.gitconfig</code> 等等,一般可能存放在 <code>~/.xxx</code> 或 <code>~/.config/xxx</code> 之類。用版本控制的好處是,這樣在不同的 server 之間設定可以同步。<br>
dotvim 是存放 <code>~/.vim</code> 的 Vim 設定檔。這些設定檔可以透過 soft link 連結回他們原本應該在的位置。</p>
<p><strong>注意!永遠不要把 private key 放入版本控制中!</strong></p>
<ul>
<li>Hint: 搜尋 dotfiles 就會有很多範例(Ex <a href="https://github.com/ccwang002/dotfiles">我的</a>)</li>
</ul>
</li>
<li>
<p>建立自己的 Github 帳號,並把 dotfiles / dotvim repo 同步(<strong>push</strong>)到Github。</p>
<ul>
<li>Hint: 建立設立好 ssh key pair 使用 ssh 上傳。Github 有<a href="https://help.github.com/articles/generating-ssh-keys/">完整的教學</a>。</li>
</ul>
</li>
</ol>
<div class="footnote">
<hr>
<ol>
<li id="fn:註1">
<p>所謂的專案目錄就是下 <code>git init</code> 指令的目錄。 <a class="footnote-backref" href="#fnref:註1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Coding 初學指南-文字編輯2016-01-21T22:30:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-01-21:/posts/2016/01/lab-coding-text-editing/<p>這個章節會帶大家認識一個很簡單的純文字格式 Markdown,方便大家整理筆記。同時希望大家學會一個 terminal based 的文字編輯器,方便往後在 server 環境底下的操作。</p><p>Last Edited: Jan, 2016 (如果內容有誤,你可以留言,或用任何管道告訴我)</p>
<p>這個章節會帶大家認識一個很簡單的純文字格式 Markdown,方便大家整理筆記。同時希望大家學會一個 terminal based 的文字編輯器,方便往後在 server 環境底下的操作。</p>
<div class="toc">
<ul>
<li><a href="#markdown">Markdown</a><ul>
<li><a href="#_1">相關資源</a><ul>
<li><a href="#markdown_1">Markdown 語法</a></li>
</ul>
</li>
<li><a href="#_2">學習目標</a></li>
</ul>
</li>
<li><a href="#text-editor">Text Editor</a><ul>
<li><a href="#nano">Nano</a></li>
<li><a href="#emacs">Emacs</a></li>
<li><a href="#vim">Vim</a></li>
<li><a href="#vim_1">Vim 相關資源</a><ul>
<li><a href="#open-vim">Open Vim</a></li>
<li><a href="#vim-ptt">學習 Vim 的心法與攻略 (ptt)</a></li>
<li><a href="#vim-adventure">Vim adventure</a></li>
<li><a href="#vim_2">Vim 本身的使用手冊</a></li>
</ul>
</li>
<li><a href="#_3">學習目標</a></li>
</ul>
</li>
<li><a href="#regex">正規表示式 Regex</a><ul>
<li><a href="#regex_1">Regex 語法派別</a></li>
<li><a href="#_4">相關資源</a><ul>
<li><a href="#regex-one">Regex One</a></li>
<li><a href="#regex-101">Regex 101</a></li>
</ul>
</li>
</ul>
</li>
</ul>
</div>
<blockquote>
<p><strong>其他 Coding 初學指南系列文章:</strong></p>
<ul>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-intro/">Introduction</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-linux/">Chapter 1 – Linux</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-text-editing/">Chapter 2 – Text Editing (Markdown, Text Editor)</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-version-control/">Chapter 3 – Version Control (Git)</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-python/">Chapter 4 – Python</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-appendix-osx-env/">Appendix 1 – OSX Development Environment</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-appendix-bioinfo-python/">Appendix 2 – Python in Bioinformatics</a></li>
</ul>
<p>或者,用 <a href="/tag/labcoding.html">labcoding</a> 這個 tag 也可以找到所有的文章。</p>
</blockquote>
<h1 id="markdown">Markdown</h1>
<p>這是一個簡便的語法,它的概念是在純文字的檔案中用一些簡單的標記,就能做出大小標題、粗斜體、超連結、表格、程式碼上色等語法。</p>
<p>如果大家了解網頁的格式 <a href="https://developer.mozilla.org/en-US/docs/Web/HTML">HTML</a> 的話,那 markdown 的語法能直接對應到 HTML 的語法,所以這個格式在網路的世界十分流行。它的副檔名為 <code>.md</code>,近代程式的 REAME 許多都用 markdown 寫成(例:<code>README.md</code>)</p>
<h2 id="_1">相關資源</h2>
<h3 id="markdown_1">Markdown 語法</h3>
<ul>
<li><a href="http://markdown.tw/">http://markdown.tw/</a></li>
</ul>
<h2 id="_2">學習目標</h2>
<ol>
<li>這一系列的筆記就是用 markdown 寫成,你可以在<a href="https://github.com/ccwang002/ccwang002.github.io/tree/src/content/blogs/2016-01">這裡</a>找到它的原始檔。</li>
<li>試著把 Linux 學習過的指令,或者自己常用的組合指令用 markdown 記錄。</li>
</ol>
<h1 id="text-editor">Text Editor</h1>
<p>在 Linux 的世界很多都是純文字檔案,再加上一些規定的語法成為新的格式。前面的 markdown 就是個例子。甚至許多可以執行的程式都只是個腳本檔,能用一般的編輯器(editor)打開就能讀懂。你可以試試</p>
<div class="highlight"><pre><span></span><code><span class="c1"># cd is a shell script</span>
nano<span class="w"> </span><span class="sb">`</span>which<span class="w"> </span><span class="nb">cd</span><span class="sb">`</span><span class="w"> </span><span class="c1"># thanks TP's idea</span>
</code></pre></div>
<p>常見的 editor 有:nano、vi、vim、emacs。到底什麼是最好用的文字編輯器,這是一場永無止盡的戰爭,近年來又有 Notepad++(GUI)、Sublime Text (GUI)、Neovim 的加入,這話題將不會有結論。對初學者而言,至少學會一個 editor 是必要的。</p>
<blockquote>
<p>雖然一開始都說是介紹文字編輯器,但後來會開始學程式設計,所以最後大家在討論的都是「程式碼的編輯器」。</p>
</blockquote>
<p>當在編輯一些設定檔、程式碼時,為了避免打錯關鍵字但難以查覺,多數人會把程式碼的關鍵字上色。按照程式碼不同的屬性、功能上色之後,多數人發現能更好的理解程式的結構,因此 editor 大多帶有語法上色(syntax highlighting)。</p>
<p>除了語法上色,這些 editor 都有自己的設定檔規範,可以讓使用者自行修改 editor 的行為。把自己常見的編輯器改得合乎自己習慣,是長期生活在 terminal 世界的第一步,大家可以參考(抄)別人的範本開始。</p>
<blockquote>
<p>讓自己的編輯器有家的感覺。</p>
</blockquote>
<p>除了設定檔之外,功能多的編輯器還會有「外掛」的功能,可以讓使用者增加自己的套件。這也等大家熟悉環境之後再自行玩玩吧。</p>
<h2 id="nano">Nano</h2>
<p>這是一個操作簡單好懂的編輯器,<del>沒有語法上色</del><sup id="fnref:註1"><a class="footnote-ref" href="#fn:註1">1</a></sup>。多數的系統都有內建,所以到一個新的環境時幾乎都能使用。</p>
<p>鳥哥有教。其實直接執行它 <code>nano</code> 它的指令都會顯示在編輯畫面中。</p>
<h2 id="emacs">Emacs</h2>
<p>抱歉,我不會。但它是一個很好的編輯器。(誠徵大大補全)</p>
<h2 id="vim">Vim</h2>
<p>一個老字號但維持穩定開發的編輯器。他有個特色是編輯器的模式,有些模式能編輯文字,有些不行,但能做選取、搜尋等動作。還有特有的指令合成方式(像連續技、buff 這樣)</p>
<p>初學者通常會難以習慣,初期不熟模式、指令記不住的話會很難操作。所以建議一開始先記住最基本的指令,隨時掌握自己在的模式,日後再慢慢加深對 vim 的了解。</p>
<p>如果真的很沒概念,鳥哥也有寫介紹。</p>
<h2 id="vim_1">Vim 相關資源</h2>
<h3 id="open-vim">Open Vim</h3>
<p>互動式的線上學習網站,很短,跟著操作完能會 Vim 基本動作、存檔。</p>
<ul>
<li><a href="http://www.openvim.com/">http://www.openvim.com/</a></li>
</ul>
<h3 id="vim-ptt">學習 Vim 的心法與攻略 (ptt)</h3>
<p>了解最常用的 normal 與 insert 模式及最基本的指令。這篇的內容理解之後,就能用 vim 處理文字編輯了。</p>
<ul>
<li><a href="https://www.ptt.cc/bbs/Editor/M.1264056747.A.885.html">https://www.ptt.cc/bbs/Editor/M.1264056747.A.885.html</a></li>
</ul>
<h3 id="vim-adventure">Vim adventure</h3>
<p>如果很難學習 <code>hjkl</code>、<code>wb</code> 移動的話,這是個要用 vim 指令控制的小遊戲。</p>
<ul>
<li><a href="http://vim-adventures.com/">http://vim-adventures.com/</a></li>
</ul>
<h3 id="vim_2">Vim 本身的使用手冊</h3>
<p>可以使用 <code>vimtutor</code> 指令,或者在 vim normal 模式時鍵入 <code>:help</code></p>
<h2 id="_3">學習目標</h2>
<ol>
<li>
<p>能在 terminal 中編修一個文字檔名為 <code>foo.txt</code></p>
<ul>
<li>Hint: try nano</li>
<li><code>nano foo.txt</code></li>
</ul>
</li>
<li>
<p>搭配 root 權限修改系統的設定檔(你在鳥哥可能有經驗了)</p>
<ul>
<li>Hint: try sudo</li>
</ul>
</li>
<li>
<p>能在 console 中編寫程式碼。用 1. 的方案也可,但建議再試試看另外一個</p>
<ul>
<li>Hint: try vi, vim or emacs</li>
</ul>
</li>
<li>
<p>修改 editor 設定讓它更符合自己的習慣。</p>
<ul>
<li>Hint: for vim, try editing <code>~/.vimrc</code>; for emacs, try editing <code>~/.emacs</code></li>
</ul>
</li>
<li>
<p>用 terminal editor 使用 markdown 格式記錄這些練習的筆記與答案。</p>
</li>
</ol>
<h1 id="regex">正規表示式 Regex</h1>
<p>Vim 在 normal 模式下能用 <code>/{pattern}</code> 搜尋文中的字串。除了直接把想要查的字串寫在 pattern 裡以外,還可以設計規則找出符合 pattern 但不一樣的結果。這樣的規則稱之為正規表示式(Regular Expression, or regex)。</p>
<blockquote>
<p>想做很複雜的字串比對時,都應該考慮是否能使用 regex</p>
</blockquote>
<p>要做字串比對的地方,工具通常都會提供使用 regex,例如 <code>grep</code>、<code>sed</code>。Vim 與 Python 也都有提供 regex 的功能。</p>
<h3 id="regex_1">Regex 語法派別</h3>
<p>既然 regex 是一套字串比對的規則,就有規範它的語法。主要的 regex 語法有兩大類:</p>
<ul>
<li>BRE (basic regex)<ul>
<li>Ex. <code>[:alnum:]</code></li>
</ul>
</li>
<li>ERE (extended regex)<ul>
<li>Ex. <code>\w</code></li>
</ul>
</li>
</ul>
<p>在 Linux 指令當中通常會因為使用 regex 語法的不同分成多個指令<sup id="fnref:註2"><a class="footnote-ref" href="#fn:註2">2</a></sup>。例如 grep 使用 BRE;egrep 使用 ERE。</p>
<p>與文字編輯相關的工具,像 Vim、Python、Perl<sup id="fnref:註3"><a class="footnote-ref" href="#fn:註3">3</a></sup> 也有他們各自寫 regex 的方式,但多少都與前兩大類相似,使用時都應該先查一下他們的語法。Vim 可以用 <code>:help regex</code> 查看。</p>
<h2 id="_4">相關資源</h2>
<h3 id="regex-one">Regex One</h3>
<p>主要是介紹 pcre 的語法,每一個 example 多介紹一個新的語法。接著還有個 practical examples 練習整理不同的語法。</p>
<ul>
<li><a href="http://regexone.com/">http://regexone.com/</a></li>
</ul>
<h3 id="regex-101">Regex 101</h3>
<p>regex 很容易寫到自己都看不懂,這是一個幫助了解自己或別人寫好的 regex pattern 的網站。</p>
<ul>
<li><a href="https://regex101.com/">https://regex101.com/</a></li>
</ul>
<div class="footnote">
<hr>
<ol>
<li id="fn:註1">
<p>nano 其實有辦法做語法上色喔,詳見 <a href="https://wiki.archlinux.org/index.php/Nano">Arch wiki</a> 及 <a href="https://github.com/scopatz/nanorc">nanorc</a>。Thanks @concise <a class="footnote-backref" href="#fnref:註1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:註2">
<p>哪些指令有無支援 regex 以及支援的語法可以參考 <a href="https://www.debian.org/doc/manuals/debian-reference/ch01.en.html#_unix_text_tools">Debian Reference</a> <a class="footnote-backref" href="#fnref:註2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:註3">
<p>Perl 的 regex 語法又稱為 <strong>pcre</strong> style,常被其他工具使用。例如:php <a class="footnote-backref" href="#fnref:註3" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
</ol>
</div>Coding 初學指南-Linux2016-01-21T21:30:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-01-21:/posts/2016/01/lab-coding-linux/<p>Last Edited: Jan, 2016 (如果內容有誤,你可以留言,或用任何管道告訴我)</p>
<p>學習使用 Linux 是第一個比較大的障礙,因為會在短時間接觸到非 …</p><p>Last Edited: Jan, 2016 (如果內容有誤,你可以留言,或用任何管道告訴我)</p>
<p>學習使用 Linux 是第一個比較大的障礙,因為會在短時間接觸到非常多新的東西。後面的東西多少都與 Linux 相關,而 Linux 難的部份在開始使用 terminal 來操作「整台電腦」,對很習慣使用視窗介面的人會覺得很不直覺。好在近年幾個主流的 Linux Distribution 都有很好的圖形介面(正確稱 Desktop Environment),所以一開始能漸近地適應 terminal 操作。</p>
<blockquote>
<p>在實驗室 server 上開發,「能在 terminal 裡做事情」是必須的。</p>
</blockquote>
<div class="toc">
<ul>
<li><a href="#linuxunixbsdnix">Linux、Unix、BSD、*nix</a></li>
<li><a href="#distro">Distro 簡介</a><ul>
<li><a href="#redhat-centos-fedora">Redhat / CentOS / Fedora</a></li>
<li><a href="#debian-ubuntu">Debian / Ubuntu</a></li>
<li><a href="#archlinux">ArchLinux</a></li>
</ul>
</li>
<li><a href="#gnomekdexfcelxde">桌面環境 GNOME、KDE、XFCE、LXDE</a></li>
<li><a href="#_1">相關資源</a><ul>
<li><a href="#_2">鳥哥的私房菜</a><ul>
<li><a href="#_3">各章節重點整理</a></li>
</ul>
</li>
<li><a href="#introduction-to-linux-on-edx-course">Introduction to Linux on edX course</a></li>
<li><a href="#debian-user-manual">Debian User Manual</a><ul>
<li><a href="#chapter-highlights">Chapter Highlights</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#_4">學習目標</a></li>
</ul>
</div>
<blockquote>
<p><strong>其他 Coding 初學指南系列文章:</strong></p>
<ul>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-intro/">Introduction</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-linux/">Chapter 1 – Linux</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-text-editing/">Chapter 2 – Text Editing (Markdown, Text Editor)</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-version-control/">Chapter 3 – Version Control (Git)</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-python/">Chapter 4 – Python</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-appendix-osx-env/">Appendix 1 – OSX Development Environment</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-appendix-bioinfo-python/">Appendix 2 – Python in Bioinformatics</a></li>
</ul>
<p>或者,用 <a href="/tag/labcoding.html">labcoding</a> 這個 tag 也可以找到所有的文章。</p>
</blockquote>
<h2 id="linuxunixbsdnix">Linux、Unix、BSD、*nix</h2>
<p>Linux 和 Unix 是不同的,但對初學者來說他們的分別不容易查覺,兩者的終端指令很相似,也因此有了 *nix 的通稱。Linux 介紹書多半會把他們的歷史好好的說一遍<sup id="fnref:註2"><a class="footnote-ref" href="#fn:註2">1</a></sup>,有興趣聽軟體發展故事的就多留意這部份。</p>
<p>不完整地主要分成:</p>
<ul>
<li>Redhat, CentOS, Fedora</li>
<li>Debian, Ubuntu, Linux Mint</li>
<li>ArchLinux</li>
<li>openSUSE</li>
<li>FreeBSD, OpenBSD</li>
</ul>
<p>這些 distributions (distros)。其實 Linux、BSD (or Unix) 的系統非常多<sup id="fnref:註3"><a class="footnote-ref" href="#fn:註3">2</a></sup>,但對初入 Linux 的使用者,應該要找比較熱門的 distro 使用,才好找資料。</p>
<p>上面條列的方式是有意義的,我把類似的 distro 放在同一排,只要學了其中一個,同排的其他都很好上手。其中前兩排是兩大家族反映兩種生態。我們實驗室的大 server 用的是 CentOS;但近年來我自己的電腦已經漸漸換到 Debian 上。</p>
<h2 id="distro">Distro 簡介</h2>
<p>以下是我不負責任的主觀介紹。從介紹篇幅就知道我是個傾向 Debian 的人。</p>
<h4 id="redhat-centos-fedora">Redhat / CentOS / Fedora</h4>
<p>Redhat 是商用的版本,開源社群維護的對應版本是 CentOS。它以保守穩定著名,但相對來說新的東西在上面就要自己安裝,這對實驗室都用較新的工具來說是個減分的地方。他套件用 <code>yum xxx</code> 來操作。Fedora 上的東西會新一點但我們實驗室沒人用,所以不建議。</p>
<h4 id="debian-ubuntu">Debian / Ubuntu</h4>
<p>Debian 是另一個大家族的頭,雖然是頭但一直保持穩定開發,在說明文件、<a href="https://wiki.debian.org/">wiki</a> 上都有不錯的品質。本身有所謂的 stable、testing、unstable 並分別對應三個版本號碼與名稱。以 2016.01 為例,stable 是 jessie(8)、testing 是 stretch(9)。unstable 永遠對應到 sid。如字面上的意思代表當中套件(軟體)的新舊程度。stable 上的工具也因此會比較舊不適合實驗室使用,但 testing 還蠻適合的,我個人很推。套件用 <code>apt-get xxx</code> 來操作。</p>
<p>Debian 家族中的 Ubuntu 非常火紅,網路上教學非常多,背後也有公司加持。基本上 Debian 的優點都會傳到 Ubuntu 上。Ubuntu 雖然套件包等等都從 Debian 移植,但他有自己的版本號碼,每半年發佈一個版本。</p>
<h4 id="archlinux">ArchLinux</h4>
<p>再來的有興趣自己看,但我要額外介紹一個 ArchLinux。這是一個很自幹的系統,一點都不適合新手與懶人。但他有一個寫得很仔細完整的 <a href="https://wiki.archlinux.org/">wiki 站</a>。想要學新的套件、不會設定的話,去問 google 的時候請優先看他們的 wiki。</p>
<blockquote>
<p>查資料的時候,除了 StackOverflow 、Ubuntu 論壇之外,請多看品質優良的 <a href="https://wiki.archlinux.org/">Arch Linux</a> 和 <a href="https://wiki.debian.org/">Debian</a> 的 wiki。</p>
</blockquote>
<p>寫了這麼多,沒有給一個明確的選擇,多數人還是很難決定。所以如果你是初學者,我會建議安裝 Ubuntu,在此時請選擇 <a href="http://releases.ubuntu.com/15.10/">15.10</a> or <a href="http://releases.ubuntu.com/14.04/">14.04 LTS</a> Desktop 版本,因為他網路上的資源最豐富。</p>
<p>但我不是很喜歡 Ubuntu,所以等你有能力自己查詢 Linux 相關操作時,建議再看看別的 distro (例如我推薦的 <a href="https://www.debian.org/releases/testing/">Debian testing channel</a>)</p>
<h2 id="gnomekdexfcelxde">桌面環境 GNOME、KDE、XFCE、LXDE</h2>
<p>圖形化介面(GUI),除了使用者的應用程式外,還需要系統輔助、管理等核心套件。一系列的 GUI 套件就稱之為桌面環境。</p>
<p>Windows、OSX 在安裝系統時都會自動安裝, 即桌面環境只有一種選擇。但在 Linux 上,GUI 的安裝是選擇性的,系統能在只有單純的 terminal 介面便能完整使用(例如選擇安裝 Ubuntu Server 時),不少 server 為了效能、安全性的考量都不會裝桌面環境。</p>
<p>對 Linux 來說桌面環境是能之後再選擇安裝上去的,而且還有「不同口味」可以選擇,使用者也可以自由的移除它們(但很有可能會炸掉),常見就有 GNOME、KDE、XFCE、LXDE<sup id="fnref:註4"><a class="footnote-ref" href="#fn:註4">3</a></sup> 幾種系統能使用。例如選擇安裝 Ubuntu Server 後,想要再加上 GNOME 列圖形化介面時,</p>
<div class="highlight"><pre><span></span><code>sudo<span class="w"> </span>apt-get<span class="w"> </span>install<span class="w"> </span>ubuntu-gnome-desktop
</code></pre></div>
<p>這邊不會去細講這些實作方式的不同。簡單而言,GNOME 最流行。XFCE 使用的系統資源較少,在實驗室上的 server 常會裝這個。在 Ubuntu 上,預設是用 Unity,它是從 GNOME 沿伸出來的。</p>
<p>第一次安裝時,就使用預設的模式吧。裝好之後要換到不同的桌面環境時,需要對套件管理系統(例 apt、yum)、調整系統設定有足夠了解。</p>
<h2 id="_1">相關資源</h2>
<p>認真地說,我有點不知道 Linux 從很初學到完整學習的資源在哪裡。如果你們在學的過程中,有碰到更適合的請再告訴我(例如:留言)</p>
<h3 id="_2">鳥哥的私房菜</h3>
<ul>
<li><a href="http://linux.vbird.org/linux_basic/">鳥哥官網(基礎學習篇)</a></li>
<li><a href="http://www.tenlong.com.tw/items/9861818510?item_id=53725">實體書連結</a></li>
</ul>
<p>在台灣學 Linux 大概都會先推薦鳥哥,這應該是最多人用的完整中文資源了。如果願意啃完他,對 Linux 絕對會有足夠的認識。我大學的時候也是看這本入門的。</p>
<p>不適合的地方是鳥哥介紹 Redhat 系的操作,很多設定在 Ubuntu 上不需要或者是用別的方式去管理。例如,在 CentOS 上可能都用文字檔來修改設定,但在 Ubuntu 上可以用 <code>dpkg-reconfigure</code> 指令來更動。他示範的 OS 為 CentOS 5.x,現在 CentOS 已經 7.x 版了,許多設定也過時了,新的工具不會介紹到。</p>
<p>例如現在安裝 Linux 時,在磁區分割上都有很好的預設值,初學者可以不用再學調整 swap 等設定。同時系統也都提供使用 LVM (Logical Volume Manager) <sup id="fnref:註5"><a class="footnote-ref" href="#fn:註5">4</a></sup> 管理磁區,這些磁區 (LV) 日後能動態調整。換句話說,第五章(含)的內容都與現在使用 Linux 的方式不同,如果只是順著鳥哥書的順序一章一章看下來,會沒辦法對照自己系統操作,因為近期的 Linux 安裝只要很順的下一步就能完成了。</p>
<h4 id="_3">各章節重點整理</h4>
<p>鳥哥的內容退一千步來說都對初學者很有幫助,但為了避免各位花費無謂的時間在「對照古早與現代操作 (google 到的資料)上」,整理個表格讓大家知道每個章節什麼地方需要看。</p>
<table>
<thead>
<tr>
<th style="text-align: center;">章節</th>
<th style="text-align: left;">章節名</th>
<th>重要的內容</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center;">0</td>
<td style="text-align: left;">計算機概論</td>
<td>沒聽過 CPU、RAM、MB GB 單位就從頭看; 不然就讀資料表示方式(3)、軟體程式運作(4)</td>
</tr>
<tr>
<td style="text-align: center;">5</td>
<td style="text-align: left;">首次登入與線上求助 man page</td>
<td>文字模式下指令的下達(2)、man page 與 info page(3)、nano(4)</td>
</tr>
<tr>
<td style="text-align: center;">6</td>
<td style="text-align: left;">Linux 的檔案權限與目錄配置</td>
<td>全</td>
</tr>
<tr>
<td style="text-align: center;">7</td>
<td style="text-align: left;">Linux 檔案與目錄管理</td>
<td>除檔案隱藏與特殊屬性(4)外都重要</td>
</tr>
<tr>
<td style="text-align: center;">8</td>
<td style="text-align: left;">Linux 磁碟與檔案系統管理</td>
<td>檔案系統的簡單操作(2)</td>
</tr>
<tr>
<td style="text-align: center;">9</td>
<td style="text-align: left;">檔案與檔案系統的壓縮與打包</td>
<td>壓縮檔案的用途與技術(1)、打包指令(3)</td>
</tr>
<tr>
<td style="text-align: center;">10<sup id="fnref:*"><a class="footnote-ref" href="#fn:*">5</a></sup></td>
<td style="text-align: left;">vim 程式編輯器</td>
<td>語系編碼轉換(4.3)</td>
</tr>
<tr>
<td style="text-align: center;">11</td>
<td style="text-align: left;">認識與學習 BASH</td>
<td>全。但可視情況忽略 2.4-2.8、6.4</td>
</tr>
<tr>
<td style="text-align: center;">12<sup id="fnref:†"><a class="footnote-ref" href="#fn:†">6</a></sup></td>
<td style="text-align: left;">正規表示法與文件格式化處理</td>
<td>前言(1)</td>
</tr>
<tr>
<td style="text-align: center;">13</td>
<td style="text-align: left;">學習 Shell Scripts</td>
<td>全(等用到再看)</td>
</tr>
<tr>
<td style="text-align: center;">22</td>
<td style="text-align: left;">軟體安裝:原始碼與 Tarball</td>
<td>全(了解流程、懂有這些關鍵字就好)</td>
</tr>
<tr>
<td style="text-align: center;">23</td>
<td style="text-align: left;">軟體安裝: RPM, SRPM 與 YUM 功能</td>
<td>Ubuntu 用的是 APT<sup id="fnref:‡"><a class="footnote-ref" href="#fn:‡">7</a></sup></td>
</tr>
</tbody>
</table>
<h3 id="introduction-to-linux-on-edx-course">Introduction to Linux on edX course</h3>
<ul>
<li><a href="https://www.edx.org/course/introduction-linux-linuxfoundationx-lfs101x-2">課程連結</a></li>
</ul>
<p>Linux Foundation 所開辦的線上課程,有英文的影片和講義。還請到了 Linux Kernel 的作者 Linus Torvalds 來拍介紹片。這是真的從非常基礎開始講,我有稍微看過,但我怕難度不夠,需要再搭配其他的資源來使用。好處是初期的學習比讀鳥哥前幾章來的快非常多(鳥哥前幾章為計算機概論)。</p>
<h3 id="debian-user-manual">Debian User Manual</h3>
<p>英文的 Debian 系統使用者手冊,裡面包含了常見問題排解、各種硬體上的安裝指南、參考手冊。想要好好學習現代 Debian (Linux) 的使用方式的話,可以參考這些資源,它們還有再維護。</p>
<p>缺點是這手冊太長了,如果有碰到什麼特別想了深入了解的,建議可以看這個。</p>
<ul>
<li><a href="https://www.debian.org/doc/user-manuals">https://www.debian.org/doc/user-manuals</a></li>
<li><a href="https://www.debian.org/doc/manuals/debian-reference/index.en.html">Debain Reference</a> (online HTML)</li>
</ul>
<h4 id="chapter-highlights">Chapter Highlights</h4>
<table>
<thead>
<tr>
<th style="text-align: left;">Chp. No</th>
<th style="text-align: left;">Chp. Name</th>
<th style="text-align: left;">Highlights</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left;">1</td>
<td style="text-align: left;">GNU/Linux tutorials</td>
<td style="text-align: left;">Everything except for 1.3 Midnight Commander</td>
</tr>
<tr>
<td style="text-align: left;">2</td>
<td style="text-align: left;">Debian package management</td>
<td style="text-align: left;">Read 2.2 Basic package management operations</td>
</tr>
<tr>
<td style="text-align: left;">10</td>
<td style="text-align: left;">Data management</td>
<td style="text-align: left;">Read 10.1 Sharing, copying, and archiving</td>
</tr>
</tbody>
</table>
<h2 id="_4">學習目標</h2>
<p>因為這邊指的 Linux 算是一個蠻廣的內容,一開始學的時候很容易迷失方向。所以我額外列了幾個很重要的觀念,你應該能在學習 Linux 的初期接觸到他們:</p>
<ul>
<li>了解 <code>$PATH</code> 與程式執行位置的關係<ul>
<li>為什麼打 <code>ls</code> 可以找到這隻名為 <code>ls</code> 的程式</li>
</ul>
</li>
<li>知道 stdin、stdout、stderr;pipeline 的使用</li>
<li>知道環境變數是什麼,怎麼修改</li>
<li>了解檔案、目錄、相對路徑;權限設定</li>
<li>使用 <code><cmd> -h</code> <code><cmd> --help</code> <code>man <cmd></code> 來查看指令的功能、可下的參數<ul>
<li><cmd> = 任何在 linux 下的指令</li>
</ul>
</li>
</ul>
<p>如果你花了一個禮拜的時間,但上述的內容連聽都沒聽過(或沒什麼使用到),那很可能你學習 Linux 的方式跟我想得很不一樣,請先寫個信告訴我。上面這些觀念的學習也是漸近式的,過了一個禮拜只有聽過但不是很了解,這是很正常的現象。</p>
<ol>
<li>自己從零開始安裝一次 Linux 系統(可以用 VM)。</li>
<li>定期使用它一個星期以上(即熟悉 <code>cd</code> <code>ls</code> 等基礎指令)</li>
<li>
<p>使用 ssh 連線到遠端的 Linux。(要打開 ssh 的 port)</p>
<ul>
<li>Bonus: 在 ssh 連線時不用打密碼。</li>
<li>Bonus hint: 查 <code>authorized_keys</code>。會需要建立 ssh user identity keypair,這會在上傳 GitHub 時用到)</li>
</ul>
</li>
<li>
<p>安裝一個叫 <a href="http://hisham.hm/htop/">htop</a> 的系統監控軟體。使用它來查看系統資料的使用狀況</p>
<ul>
<li>Bonus:<ul>
<li>調整欄位的排版</li>
<li>開啟 Tree Veiw</li>
<li>選擇顯示單一使用者運行的程序(太舊的 htop 可能沒這功能)</li>
</ul>
</li>
</ul>
</li>
<li>
<p>安裝一個叫 <a href="http://aria2.sourceforge.net/">aria2</a> 的續傳軟體,他可以多線程下載 HTTP(S)、FTP、甚至 BT。今天想要下載 Debian Jessie netinst 的映像檔,使用 2 個線程同時下載。</p>
<ul>
<li>Hint: 查 <code>aria2c</code> 的 man page。</li>
</ul>
</li>
<li>
<p>學會查看系統硬碟的使用量;查看當前目錄內所有檔案的大小(絕對不是 <code>ls -l</code>)</p>
<ul>
<li>Hint: <code>df</code> 和 <code>du</code></li>
</ul>
</li>
<li>
<p>scp 是個透過 ssh 傳送一或多個檔案的指令,試著用它把自己電腦的檔案(們)傳到 server 上。</p>
<ul>
<li>Bonus:<ul>
<li>在路徑中搭配特殊字元 <code>*?</code> 傳多個檔案</li>
<li>有一個更精密的傳檔工具叫 rsync,試著改用它來傳檔。</li>
</ul>
</li>
</ul>
</li>
<li>
<p>使用 GUI 的遠端介面。這相關的技術有很多:VNC、RDP 最常見。RDP 在 windows 連接上比較順暢;VNC 在畫面傳輸比較沒效率,這會對 server 造成不小的負擔,也很容易 lag。有一個新的通訊協定叫 NX,它對畫面壓縮使用即便網速很慢依然能使用圖形介紹。<br>
試著用實作 NX 協定的軟體 X2go 做遠端桌面連線到 server。</p>
<ul>
<li>Hint: 你需要在 server 與 client 端(通常是自己的電腦)都裝上 X2go 的軟體,並會使用到 SSH 的連線設定。</li>
</ul>
</li>
<li>
<p>只用 Linux 生存一個星期以上(包含中文輸入、上網等等)</p>
</li>
</ol>
<div class="footnote">
<hr>
<ol>
<li id="fn:註2">
<p>Linux distros 源流 <a href="http://en.wikipedia.org/wiki/Linux_distribution">http://en.wikipedia.org/wiki/Linux_distribution</a> <a class="footnote-backref" href="#fnref:註2" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:註3">
<p><a href="http://distrowatch.com/">Distro Watch</a> 是一個介紹各種 Linux、BSD 系統的地方,可以來這邊看各個 distro 的介紹。 <a class="footnote-backref" href="#fnref:註3" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:註4">
<p>LXDE 原作者是 PCMan 喔,也有相當多的台灣人在維護它。 <a class="footnote-backref" href="#fnref:註4" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
<li id="fn:註5">
<p>LVM 不懂沒關係,有興趣可以參考<a href="http://linux.vbird.org/linux_basic/0420quota.php#lvm">鳥哥十五章</a>、<a href="https://wiki.archlinux.org/index.php/LVM">Arch Wiki</a> 介紹 <a class="footnote-backref" href="#fnref:註5" title="Jump back to footnote 4 in the text">↩</a></p>
</li>
<li id="fn:*">
<p>學 vim 有別的資源,詳見 <a href="https://blog.liang2.tw/posts/2016/01/lab-coding-text-editing/">2 Text Editing</a>。<br> <a class="footnote-backref" href="#fnref:*" title="Jump back to footnote 5 in the text">↩</a></p>
</li>
<li id="fn:†">
<p>正規表示(regex)很重要,但初學 Linux 時會覺得很複雜可以跳過。 <a href="https://blog.liang2.tw/posts/2016/01/lab-coding-text-editing/">2 Text Editing</a> 會再接觸到一次 vim 的 regex、<a href="https://blog.liang2.tw/posts/2016/01/lab-coding-python/">4 Python</a> 也會學到 Python 的 regex,可以等到時候再回來學 <code>sed</code>、<code>egrep</code> 等指令。<br> <a class="footnote-backref" href="#fnref:†" title="Jump back to footnote 6 in the text">↩</a></p>
</li>
<li id="fn:‡">
<p>APT 的使用教學可以參考 <a href="https://help.ubuntu.com/community/AptGet/Howto">Ubuntu 官網</a>、<a href="http://blog.longwin.com.tw/2005/05/use_apt/">網路上大大的筆記</a>。 <a class="footnote-backref" href="#fnref:‡" title="Jump back to footnote 7 in the text">↩</a></p>
</li>
</ol>
</div>Coding 初學指南-總章2016-01-21T21:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-01-21:/posts/2016/01/lab-coding-intro/<p>給實驗室新生,了解當今軟體開發流程及基本技能的背景知識與初步技能樹。</p><p>Last Edited: Jan, 2016</p>
<p>(如果內容有誤,你可以留言,或用任何管道告訴我)</p>
<p>雖然實驗室不是正規寫程式的地方,多數的人也沒把寫程式當成一回事。不過隨著要分析的資料、樣本數越來越多,多到自己電腦跑不動,server 也要跑很久的時候,就會顯示出程式的重要性。另一方面,現在的研究講求 <em>reproducibility</em>,如果要讓自己的分析在一年之後也能重現,或者讓世界上其他的研究者也能重現的話,需要基本的程式技巧。</p>
<p>這一系列的文章,目標讓實驗室的新生,不論是不是 CS 背景,都能了解當今軟體開發流程及基本技能。軟體開發需要一些背景知識,才能與開發者正常交流。這些背景知識包括:</p>
<ul>
<li>熟悉在 server 上工作(或 Linux 的使用)</li>
<li>一個很多人用的軟體工具長什麼樣</li>
<li>如何把自己的程式與別人分享</li>
<li>多人共同開發</li>
<li>其他人都怎麼樣寫程式</li>
</ul>
<p>這些內容在學校課程的訓練中較少,尤其是電機背景的話在寫程式又更為隨性,但這對一個長期的軟體專案是必需的。希望大家能養成這些習慣。</p>
<p>這些主題需要熟練到什麼程度,見人見智,深入下去每一個都夠花幾個月的時間鑽研,但以一個實驗室專題、或要完成的軟體專案來說都不實際,至少老闆一點都不在意。所以個人覺得,最低標準就是在遇到該課題不懂的地方時,「能知道怎麼下關鍵字查」且「查完的結果能看得懂」。</p>
<p>系列文原本放在 <a href="https://gist.github.com/ccwang002/368025d3c541ed983892">Github Gist</a> 上,但現在有自己的 blog 了,就整理到這邊順便更新。</p>
<div class="toc">
<ul>
<li><a href="#_1">學習方式</a><ul>
<li><a href="#_2">問:為什麼不帶大家手把手教學?</a></li>
</ul>
</li>
<li><a href="#_3">熟練來自生活中的實踐</a></li>
<li><a href="#windows">給 Windows 使用者</a></li>
<li><a href="#osx">給 OSX 使用者</a></li>
<li><a href="#_4">文章目錄</a></li>
</ul>
</div>
<h3 id="_1">學習方式</h3>
<p>每一篇都會是一個主題,主題底下會列出一些資源。主題的最後有學習目標,方便讓你評估自己學到什麼程度。學習目標會給一個明確的任務,我盡量讓它能跟(宅宅的)日常生活結合。通常只要完成前一、二個目標就行了,這也不是功課所以不用給我看。但如果你不介意給我看的話,我會分享我主觀的建議。大部份的任務是沒有唯一的正確答案,只要能解決問題都是好方法。</p>
<p>總之,這些資源不用全看,任務不用全做,大家自己斟酌要花多少時間在不同的主題上。</p>
<blockquote>
<p>挑你喜歡的東西盡量鑽,沒有 fu 的就隨便看看會了就好</p>
</blockquote>
<p>我會盡量按照難易度排列,中英文的資源都放。</p>
<h4 id="_2">問:為什麼不帶大家手把手教學?</h4>
<p>簡單地說沒空。認真地說,大家的學習速度跟底子都不一樣,同步學只是浪費各位的時間。</p>
<p>我大概沒有辦法一個一個項目帶大家練習,底下的很多連結只是提供一個學習的窗口,真正要學下去,都是要花一定時間的。所以也不要抱著「只要讀完這些文章就會了○○○」這樣的想法。</p>
<blockquote>
<p>白話的來說,這些背景知識就像遊戲的技能樹,基礎技能要先點好才能點進階技能。要把基礎技能點滿了再練等也可能,但不必要一直練等,大家未必喜歡,現實中也不許你練等不解任務。</p>
</blockquote>
<p>一開始可能碰到小問題就要查,或者要連續查很多個網頁被導向四、五次才能稍微解答自己的疑問。這個現象是非常正常的,如果大家能撐過初期這段比較挫折的時期,日後要自學軟體基本上就沒問題了。</p>
<h3 id="_3">熟練來自生活中的實踐</h3>
<p>要很快地學好程式,我推薦練習把程式應用在生活中。例如用文字命令列來下載檔案;把自己筆電變成 linux 桌面系統,練習自己編譯軟體、解決各式安裝的狀況。讓自己的電腦成為一個自己能接受的軟體開發環境,並經常的使用它,就能降低對寫程式的陌生與不知所措感。</p>
<p>上面的方法可能稍難一些,負擔比較小的可以開始做「思考練習」。思考練習包含去想生活中的大小事該怎麼寫程式來控制。例如我該怎麼設計一個電梯系統?臉書怎麼呈現大家的動態?只要大概想一想就好了,想不出來也不會怎樣,也不用特別查資料。過一段時間對程式的 sense 也會提昇。</p>
<p>如果需要更硬派的學習方式,不妨把自己電腦安裝的軟體的源始碼都拿出來看一下,加入幾個自己平常用的軟體的專案來修改它,讓它更少問題更多功能(一般叫 contribute)。也可以把實驗室有用到的工具的原始碼拿出來看一下,例如 sratoolkit、cutadapt,看看自己能不能讀懂別人的程式碼。</p>
<h3 id="windows">給 Windows 使用者</h3>
<blockquote>
<p>建議大家想辦法裝個 Linux(或用 Mac)。如果不想取代掉自己的 Windows 環境話,可以安裝 VirtualBox 裝個虛擬的 Linux,或者在 Amazon 等 VPS 架一台虛擬主機。</p>
</blockquote>
<p>Windows 因為對圖形介面(GUI)設計的很好,也不容易讓使用者用命令列模式(terminal, console)。雖然 Windows 上有像 command prompt、Powershell 之類的環境,但都很難用它來操控整個系統。而且它打從骨子就跟 Linux 不一樣,所以相關的指令不好在網上的教學文章中找到,而多數 Windows 的開發者也不喜歡用 terminal。</p>
<p>另一方面,大家對 Visual Studio 的印象都是要收錢的<sup id="fnref:註1"><a class="footnote-ref" href="#fn:註1">1</a></sup>,這是 Windows 上最完整最好用的開發環境,不想付費的情況下可能就會覺得「寫程式充滿障礙」,更何況 VS 也是圖形化的編輯器,也會不了解背後運作的方式。市面的開源軟體多半在 *nix 上開發,對 Windows 支援度差也加深這個障礙(「想要自己裝但都一堆限制又很容易失敗」)</p>
<p>不用 terminal 的壞處是很難想像自己系統中的軟體為什麼能運作。感覺要寫個有 GUI 的程式都要很厲害,跟自己學過的寫程式都不一樣。但實質上是沒有太多區別的,只是要完成一個能安裝在系統中的視窗軟體,需要的步驟多很多,一般簡單的專案都不會到那個階段。</p>
<h3 id="osx">給 OSX 使用者</h3>
<p>Mac OSX 使用者也有這樣的現象,但因為 OSX 在底層用的是跟 FreeBSD 很相似,而 FreeBSD 跟 Linux 相似,所以它的 terminal 環境是很完整的。現在軟體開發者很多人用 OSX,因此網路上 Linux、OSX 資源都很多,兩者的經驗常能自然地移植。</p>
<p>如何在 OSX 上開發程式,可以參考<a href="https://blog.liang2.tw/posts/2016/01/lab-coding-appendix-osx-env/">附錄 0</a> 的筆記,但內容很主觀,不是每個人都會像我這樣用。</p>
<h2 id="_4">文章目錄</h2>
<p>寫一寫內容也變得蠻多的,所以把它切成了幾篇文章,請按照數字順序閱讀:</p>
<ul>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-intro/">Introduction</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-linux/">Chapter 1 – Linux</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-text-editing/">Chapter 2 – Text Editing (Markdown, Text Editor)</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-version-control/">Chapter 3 – Version Control (Git)</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-python/">Chapter 4 – Python</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-appendix-osx-env/">Appendix 1 – OSX Development Environment</a></li>
<li><a href="https://blog.liang2.tw/posts/2016/01/lab-coding-appendix-bioinfo-python/">Appendix 2 – Python in Bioinformatics</a></li>
</ul>
<p>或者,用 <a href="/tag/labcoding.html">labcoding</a> 這個 tag 也可以找到所有的文章。</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:註1">
<p>Visual Studio 自 2013 後有 <a href="https://www.visualstudio.com/en-us/news/vs2013-community-vs.aspx">Community</a> 版本,免費但內容與 Professional 版本大致無異,所以未來要有 C/C++ 32/64bit Compiler 會更容易。但多數的開源軟體還沒跟進,所以很多還在用舊的 VS 版本(就要付費),這個現象還會持續一陣。學校都有買。 <a class="footnote-backref" href="#fnref:註1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Numpy Indexing2016-01-18T02:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-01-18:/posts/2016/01/numpy-index/<p>Numpy 多維度的 indexing 跟 pandas 行為不一樣,需要額外的操作。</p><p>前幾天需要寫 <a href="http://docs.scipy.org/doc/numpy/index.html">numpy</a> 時,突然發現跟 <a href="http://pandas.pydata.org/">pandas</a> 在 indexing 的行為蠻不一樣的。我感覺未來一定會忘記,先筆記起來。</p>
<p>就用時事來舉例吧,把維基百科上<a href="https://zh.wikipedia.org/wiki/2016%E5%B9%B4%E4%B8%AD%E8%8F%AF%E6%B0%91%E5%9C%8B%E7%AB%8B%E6%B3%95%E5%A7%94%E5%93%A1%E9%81%B8%E8%88%89">各政黨 2016 年臺灣立法委員提名數</a>的表格抓下來。處理原始資料的程式放到文末,做完大概長這樣:</p>
<table>
<thead>
<tr>
<th style="text-align: left;"></th>
<th style="text-align: right;">區域</th>
<th style="text-align: right;">原住民</th>
<th style="text-align: right;">不分區</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left;">中國國民黨</td>
<td style="text-align: right;">72</td>
<td style="text-align: right;">5</td>
<td style="text-align: right;">33</td>
</tr>
<tr>
<td style="text-align: left;">民主進步黨</td>
<td style="text-align: right;">60</td>
<td style="text-align: right;">2</td>
<td style="text-align: right;">34</td>
</tr>
<tr>
<td style="text-align: left;">台灣團結聯盟</td>
<td style="text-align: right;">2</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">15</td>
</tr>
<tr>
<td style="text-align: left;">親民黨</td>
<td style="text-align: right;">6</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">16</td>
</tr>
<tr>
<td style="text-align: left;">無黨團結聯盟</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">7</td>
</tr>
<tr>
<td style="text-align: left;">民國黨</td>
<td style="text-align: right;">13</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">10</td>
</tr>
<tr>
<td style="text-align: left;">綠黨社會民主黨聯盟</td>
<td style="text-align: right;">11</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">6</td>
</tr>
<tr>
<td style="text-align: left;">中華統一促進黨</td>
<td style="text-align: right;">14</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">10</td>
</tr>
<tr>
<td style="text-align: left;">時代力量</td>
<td style="text-align: right;">12</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">6</td>
</tr>
<tr>
<td style="text-align: left;">大愛憲改聯盟</td>
<td style="text-align: right;">12</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">6</td>
</tr>
</tbody>
</table>
<h3 id="pandas-indexing">Pandas indexing</h3>
<p>Pandas indexing 花俏到用<a href="http://pandas.pydata.org/pandas-docs/stable/indexing.html">一</a>、<a href="http://pandas.pydata.org/pandas-docs/stable/advanced.html">兩</a>頁也介紹不完。</p>
<p>不過今天只想說有關兩個維度以上的 indexing,例如想看國民黨、民進黨、時代力量區域與不分區的提名好了,</p>
<div class="highlight"><pre><span></span><code><span class="n">df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">8</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="p">]</span>
<span class="n">df</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span>
<span class="p">[</span><span class="s1">'中國國民黨'</span><span class="p">,</span> <span class="s1">'民主進步黨'</span><span class="p">,</span> <span class="s1">'時代力量'</span><span class="p">],</span>
<span class="p">[</span><span class="s1">'區域'</span><span class="p">,</span> <span class="s1">'不分區'</span><span class="p">]</span>
<span class="p">]</span>
</code></pre></div>
<p>上述兩個方法都能拿到一部份的表格。</p>
<table>
<thead>
<tr>
<th style="text-align: left;"></th>
<th style="text-align: right;">區域</th>
<th style="text-align: right;">不分區</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left;">中國國民黨</td>
<td style="text-align: right;">72</td>
<td style="text-align: right;">33</td>
</tr>
<tr>
<td style="text-align: left;">民主進步黨</td>
<td style="text-align: right;">60</td>
<td style="text-align: right;">34</td>
</tr>
<tr>
<td style="text-align: left;">時代力量</td>
<td style="text-align: right;">12</td>
<td style="text-align: right;">6</td>
</tr>
</tbody>
</table>
<h3 id="numpy-indexing">Numpy indexing</h3>
<p>下意識地以為 <a href="http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#arrays-indexing">numpy indexing</a> 會是一樣的,畢竟 pandas 底層就是一個 numpy array。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">arr</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">values</span>
<span class="gp">>>> </span><span class="n">arr</span><span class="p">[:</span><span class="mi">5</span><span class="p">]</span>
<span class="go">array([[72, 5, 33],</span>
<span class="go"> [60, 2, 34],</span>
<span class="go"> [ 2, 0, 15],</span>
<span class="go"> [ 6, 1, 16],</span>
<span class="go"> [ 0, 1, 7]])</span>
<span class="gp">>>> </span><span class="n">arr</span><span class="p">[[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">8</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">]]</span>
<span class="gp">...</span>
<span class="go">IndexError: shape mismatch: indexing arrays could not be broadcast </span>
<span class="go">together with shapes (3,) (2,) </span>
</code></pre></div>
<p>回去看<a href="http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#arrays-indexing">官方文件</a>才想起來, numpy 這時候是如同給定 (x, y) 座標這樣,一個個把元素選出來。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">arr</span><span class="p">[[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">8</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">]]</span>
<span class="go">[72, 2, 6]</span>
</code></pre></div>
<p>簡單的方式是分兩次選,</p>
<div class="highlight"><pre><span></span><code><span class="n">arr</span><span class="p">[[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">8</span><span class="p">],</span> <span class="p">:][:,</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">]]</span>
</code></pre></div>
<p>但這樣 numpy 會傳兩次 copy<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup> 回來,資料很大的時候就沒效率。所以要怎麼做呢?</p>
<p>參考 <a href="http://stackoverflow.com/a/30918530">Stack Overflow</a> 上的回答,底下幾種方式都可以。最簡單的方法就是透過 <a href="http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.ix_.html#numpy.ix_">numpy.ix_()</a>,</p>
<div class="highlight"><pre><span></span><code><span class="n">arr</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">ix_</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">8</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">])]</span>
</code></pre></div>
<p>如果了解 numpy broadcasting 機制的話,</p>
<div class="highlight"><pre><span></span><code><span class="c1"># index must be numpy array</span>
<span class="n">cols</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">8</span><span class="p">])</span>
<span class="n">rows</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">])</span>
<span class="n">arr</span><span class="p">[</span><span class="n">cols</span><span class="p">[:,</span> <span class="n">np</span><span class="o">.</span><span class="n">newaxis</span><span class="p">],</span> <span class="n">rows</span><span class="p">]</span>
<span class="c1"># np.newaxis is None</span>
<span class="n">arr</span><span class="p">[</span><span class="n">cols</span><span class="p">[:,</span> <span class="kc">None</span><span class="p">],</span> <span class="n">rows</span><span class="p">]</span>
</code></pre></div>
<p>或者直接把所有包含的 index 值都做出來,</p>
<div class="highlight"><pre><span></span><code><span class="n">indices</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">meshgrid</span><span class="p">(</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">8</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">],</span>
<span class="n">indexing</span><span class="o">=</span><span class="s1">'ij'</span>
<span class="p">)</span>
<span class="n">arr</span><span class="p">[</span><span class="n">indices</span><span class="p">]</span>
</code></pre></div>
<p>整理一下,這只要一段時間沒用就常會忘記。</p>
<h3 id="_1">維基原始資料處理</h3>
<p>Wikipedia 原始資料從<a href="https://zh.wikipedia.org/wiki/2016%E5%B9%B4%E4%B8%AD%E8%8F%AF%E6%B0%91%E5%9C%8B%E7%AB%8B%E6%B3%95%E5%A7%94%E5%93%A1%E9%81%B8%E8%88%89">這裡</a>取得,這就是展現 pandas 處理能力的時候了。新版本對字串處理提供更多功能,都讓我忘了底下的 numpy 對 unicode 支援其實不怎麼樣 XD</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">urllib.parse</span> <span class="kn">import</span> <span class="n">quote_plus</span>
<span class="n">dfs</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_html</span><span class="p">(</span>
<span class="s1">'https://zh.wikipedia.org/wiki/</span><span class="si">%s</span><span class="s1">'</span>
<span class="o">%</span> <span class="n">quote_plus</span><span class="p">(</span><span class="s1">'2016年中華民國立法委員選舉'</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="nb">next</span><span class="p">(</span>
<span class="n">df</span> <span class="k">for</span> <span class="n">df</span> <span class="ow">in</span> <span class="n">dfs</span>
<span class="k">if</span> <span class="s1">'立法委員政黨提名名額'</span> <span class="ow">in</span> <span class="nb">str</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">])</span>
<span class="p">)</span>
<span class="c1"># Data cleaning</span>
<span class="n">df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="p">:]</span><span class="o">.</span><span class="n">values</span>
<span class="n">df</span><span class="p">[</span><span class="s1">'政黨'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">'政黨'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span>
<span class="sa">r</span><span class="s1">'\[(註 |)\d+\]'</span><span class="p">,</span> <span class="s1">''</span>
<span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">'政黨'</span><span class="p">]</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">'-'</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">2</span><span class="p">:</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">int</span><span class="p">)</span>
</code></pre></div>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>在這情況資料會被 copy 傳回來,但如果是 <code>start:end:step</code> 的 simple indexing 就只會回傳 view。 <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Plot Sequencing Depth with Gviz2016-01-15T23:50:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-01-15:/posts/2016/01/plot-seq-depth-gviz/<p><em><strong>TL;DR</strong> Plot exome sequencing depth and coverage with genome annotation using Gviz in R. Then apply detail control on Gviz annotation track displaying.</em></p>
<p>This is an extending post from <a href="https://blog.liang2.tw/posts/2015/12/biocondutor-genomic-data/">Genomic Data Processing in Bioconductor</a>, though I haven’t finished reading all the reference in that post. The background knowledge …</p><p><em><strong>TL;DR</strong> Plot exome sequencing depth and coverage with genome annotation using Gviz in R. Then apply detail control on Gviz annotation track displaying.</em></p>
<p>This is an extending post from <a href="https://blog.liang2.tw/posts/2015/12/biocondutor-genomic-data/">Genomic Data Processing in Bioconductor</a>, though I haven’t finished reading all the reference in that post. The background knowledge of this post is basic understanding of how to deal with annotation and genome reference in Bioconductor/R. If you don’t deal with genome annotations in R before, you should find some time learning it anyway, a truly life saver.</p>
<div class="toc">
<ul>
<li><a href="#convert-sequencing-depth-to-bedgraph-format">Convert sequencing depth to BedGraph format</a></li>
<li><a href="#plot-depth-in-gviz">Plot depth in Gviz</a><ul>
<li><a href="#first-gviz-track">First Gviz track</a></li>
<li><a href="#add-genome-axis">Add genome axis</a></li>
<li><a href="#add-annotation">Add annotation</a></li>
</ul>
</li>
<li><a href="#plot-fine-tune">Plot fine tune</a><ul>
<li><a href="#genome-annotation-query-in-bioconductorr">Genome annotation query in Bioconductor/R</a><ul>
<li><a href="#via-transcripts">via transcripts()</a></li>
<li><a href="#via-exonsby">via exonsBy()</a></li>
</ul>
</li>
<li><a href="#show-only-the-annotations-of-certain-genes">Show only the annotations of certain genes</a></li>
<li><a href="#display-gene-symbols-at-annotation-track">Display gene symbols at annotation track</a></li>
</ul>
</li>
<li><a href="#summary">Summary</a></li>
<li><a href="#supplementary-plot-bam-files-directly">Supplementary - Plot BAM files directly</a><ul>
<li><a href="#fancier-alignment-display">Fancier alignment display</a></li>
</ul>
</li>
</ul>
</div>
<p>I got the chance trying new tricks today when I and other lab members were analyzing our human cancer exome sequencing data. The results were a bunch of BAM files aligned by <a href="https://github.com/lh3/bwa">BWA-MEM</a> using reference hg19.</p>
<p>We want to see how was the sequencing depth and the coverage of all exons designed to be sequenced. Roughly, this can be done in the genome viewer such as <a href="https://www.broadinstitute.org/igv/">IGV</a>.</p>
<figure>
<img src="https://blog.liang2.tw/posts/2016/01/plot-seq-depth-gviz/pics/seqdepth_IGV.png"/>
<p class="caption center">Visualize sequencing depth in IGV</p>
</figure>
<p>IGV is good for daily research, but when it comes to customization, there aren’t many options. And if the visualization is aimed for publishing, one might want the figure to be vectorized and, more importantly, <em>reproducible</em>.</p>
<p>Therefore, combining with what I learnt in <a href="https://blog.liang2.tw/posts/2015/12/biocondutor-genomic-data/">Genomic Data Processing in Bioconductor</a>, I tried to plot the sequencing depth in R with <a href="https://bioconductor.org/packages/release/bioc/html/Gviz.html">Gviz</a>. I thought learning Gviz will be demanding, since its vignette has 80 pages and the function documentation are <a href="http://rpackages.ianhowson.com/bioc/Gviz/man/GeneRegionTrack-class.html">scarily long spells</a>. But both of them turned out to be <em>really</em> helpful and informative, especially when trying to tune its behavior. Figures produced by Gviz are aesthetically pleasing, and Gviz has many features as well (still trying). I’m glad that I gave it a shot.</p>
<p>If you want to follow the code yourself, any human BAM alignment files will do. For example, the GEO dataset <a href="http://dev.3dvcell.org/geo/query/acc.cgi?acc=GSE48215">GSE48215</a> contains exome sequencing of breast cancer cell lines.</p>
<h2 id="convert-sequencing-depth-to-bedgraph-format">Convert sequencing depth to BedGraph format</h2>
<p>After a quick search, Gviz’s <a href="http://rpackages.ianhowson.com/bioc/Gviz/man/DataTrack-class.html">DataTrack</a> accepts BedGraph format. This format can display any numerical value of chromosome ranges, shown as follows,</p>
<table>
<thead>
<tr>
<th style="text-align: left;">chromosome</th>
<th>start</th>
<th>end</th>
<th style="text-align: right;">value</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left;">chr1</td>
<td>10,051</td>
<td>10,093</td>
<td style="text-align: right;">2</td>
</tr>
<tr>
<td style="text-align: left;">chr1</td>
<td>10,093</td>
<td>10,104</td>
<td style="text-align: right;">5</td>
</tr>
<tr>
<td style="text-align: left;">…</td>
<td>…</td>
<td>…</td>
<td style="text-align: right;">…</td>
</tr>
</tbody>
</table>
<p>So we need to convert the alignment result as BedGraph format, which can be done by <a href="http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html">BEDTools’ genomecov</a> command. On BEDTools’ documentation, it notes that the BAM file should be sorted.</p>
<div class="highlight"><pre><span></span><code>bedtools<span class="w"> </span>genomecov<span class="w"> </span>-bg<span class="w"> </span>-ibam<span class="w"> </span>myseq.bam<span class="w"> </span>><span class="w"> </span>myseq.bedGraph
</code></pre></div>
<p>The plain text BedGraph can be huge, pipe’d with gzip will reduce file size to around 30% of the original.</p>
<div class="highlight"><pre><span></span><code>bedtools<span class="w"> </span>genomecov<span class="w"> </span>-bg<span class="w"> </span>-ibam<span class="w"> </span>myseq.bam<span class="w"> </span><span class="p">|</span><span class="w"> </span>gzip<span class="w"> </span>><span class="w"> </span>myseq.bedGraph.gz
</code></pre></div>
<h2 id="plot-depth-in-gviz">Plot depth in Gviz</h2>
<p>R packages of human genome annotations (<a href="http://bioconductor.org/packages/release/data/annotation/html/Homo.sapiens.html">Homo.sapiens</a>) and <a href="https://bioconductor.org/packages/release/bioc/html/Gviz.html">Gviz</a> itself are required. Also, <a href="https://cran.r-project.org/web/packages/data.table/index.html">data.table</a> gives an impressed speed at reading text tables so is recommended to use. During the analysis, I happened to know that data.table supports <a href="https://github.com/Rdatatable/data.table/issues/717">reading gzip’d file through pipe</a>, which makes it more awesome.</p>
<h3 id="first-gviz-track">First Gviz track</h3>
<p>We should first start at reading our sequencing depth as BedGraph format and plot it.</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">data.table</span><span class="p">)</span>
<span class="nf">library</span><span class="p">(</span><span class="n">Gviz</span><span class="p">)</span>
<span class="n">bedgraph_dt</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">fread</span><span class="p">(</span>
<span class="w"> </span><span class="s">'./coverage.bedGraph'</span><span class="p">,</span>
<span class="w"> </span><span class="n">col.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s">'chromosome'</span><span class="p">,</span><span class="w"> </span><span class="s">'start'</span><span class="p">,</span><span class="w"> </span><span class="s">'end'</span><span class="p">,</span><span class="w"> </span><span class="s">'value'</span><span class="p">)</span>
<span class="p">)</span>
<span class="c1"># Specifiy the range to plot</span>
<span class="n">thechr</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s">"chr17"</span>
<span class="n">st</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">41176e3</span>
<span class="n">en</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">41324e3</span>
<span class="n">bedgraph_dt_one_chr</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bedgraph_dt</span><span class="p">[</span><span class="n">chromosome</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">thechr</span><span class="p">]</span>
<span class="n">dtrack</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">DataTrack</span><span class="p">(</span>
<span class="w"> </span><span class="n">range</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">bedgraph_dt_one_chr</span><span class="p">,</span>
<span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"a"</span><span class="p">,</span>
<span class="w"> </span><span class="n">genome</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">'hg19'</span><span class="p">,</span>
<span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"Seq. Depth"</span>
<span class="p">)</span>
<span class="nf">plotTracks</span><span class="p">(</span>
<span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">dtrack</span><span class="p">),</span>
<span class="w"> </span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">st</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">en</span>
<span class="p">)</span>
</code></pre></div>
<p>So we read the sequencing depth data, create a Gviz <code>DataTrack</code> holding the subset of our data on chr17, then plot Gviz tracks by <code>plotTracks</code> (though we only made one here) within a given chromosome region. Here is what we got.</p>
<figure>
<img src="https://blog.liang2.tw/posts/2016/01/plot-seq-depth-gviz/pics/seqdepth_one_track.png"/>
</figure>
<h3 id="add-genome-axis">Add genome axis</h3>
<p>The figure is a bit weird and lack of information without the genomic location.</p>
<p>Adding genomic location can be done automatically by Gviz through a new track <code>GenomeAxisTrack</code>. Also, we’d like to show which region of chromosome we are at. This can be done by adding another track, <code>IdeogramTrack</code>, to show the chromosome ideogram. Note that the latter track will download cytoband data from UCSC so the given genome must have a valid name.</p>
<div class="highlight"><pre><span></span><code><span class="n">itrack</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">IdeogramTrack</span><span class="p">(</span>
<span class="w"> </span><span class="n">genome</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"hg19"</span><span class="p">,</span><span class="w"> </span><span class="n">chromosome</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">thechr</span>
<span class="p">)</span>
<span class="n">gtrack</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">GenomeAxisTrack</span><span class="p">()</span>
<span class="nf">plotTracks</span><span class="p">(</span>
<span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">itrack</span><span class="p">,</span><span class="w"> </span><span class="n">gtrack</span><span class="p">,</span><span class="w"> </span><span class="n">dtrack</span><span class="p">),</span>
<span class="w"> </span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">st</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">en</span>
<span class="p">)</span>
</code></pre></div>
<figure>
<img src="https://blog.liang2.tw/posts/2016/01/plot-seq-depth-gviz/pics/seqdepth_with_loc.png"/>
</figure>
<p>Better now :)</p>
<h3 id="add-annotation">Add annotation</h3>
<p>Since we are using exome sequencing, the curve of sequencing depth only makes senses when combined with the transcript annotations.</p>
<p>Gviz has <code>GeneRegionTrack</code> to extract annotation from the R annotation packages. Package Homo.sapiens includes the gene annotation package using UCSC knownGene database. Adding this new track and we will have annotation on our plot.</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">TxDb.Hsapiens.UCSC.hg19.knownGene</span><span class="p">)</span>
<span class="n">txdb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">TxDb.Hsapiens.UCSC.hg19.knownGene</span>
<span class="n">grtrack</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">GeneRegionTrack</span><span class="p">(</span>
<span class="w"> </span><span class="n">txdb</span><span class="p">,</span>
<span class="w"> </span><span class="n">chromosome</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">thechr</span><span class="p">,</span><span class="w"> </span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">st</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">en</span><span class="p">,</span>
<span class="w"> </span><span class="n">showId</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span>
<span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"Gene Annotation"</span>
<span class="p">)</span>
<span class="nf">plotTracks</span><span class="p">(</span>
<span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">itrack</span><span class="p">,</span><span class="w"> </span><span class="n">gtrack</span><span class="p">,</span><span class="w"> </span><span class="n">dtrack</span><span class="p">,</span><span class="w"> </span><span class="n">grtrack</span><span class="p">),</span>
<span class="w"> </span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">st</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">en</span>
<span class="p">)</span>
</code></pre></div>
<figure>
<img src="https://blog.liang2.tw/posts/2016/01/plot-seq-depth-gviz/pics/seqdepth_with_annotation.png"/>
</figure>
<p>The plot should now be as informative as what we can get from the IGV. In fact, Gviz can plot the alignment result too. It can read the BAM file directly and show a more detailed coverage that matches what IGV can do. I’ll leave that part at the end of this post.</p>
<p>So far we’ve shown the sequencing depth of some chromosome region with annotation. However, there still leave something to be desired, mostly about the annotation:</p>
<ul>
<li>Can we show only the annotation of certain genes?</li>
<li>knownGene’s identifier is barely meaningless, can we show the gene symbol instead?</li>
</ul>
<p>So here comes the second part, annotation fine tuning.</p>
<h2 id="plot-fine-tune">Plot fine tune</h2>
<p>Say, we only care about gene <em>BRCA1</em>. So we need to get its location, or specifically, the genomic range that cover all <em>BRCA1</em> isoforms. In the following example, I will demonstrate the Gviz’s annotation fine tuning.</p>
<h3 id="genome-annotation-query-in-bioconductorr">Genome annotation query in Bioconductor/R</h3>
<p>If you are not familiar with how to query annotations in Bioconductor, it’s easier to think by breaking our goal of finding <em>BRCA1</em>’s ranges into two steps:</p>
<ol>
<li>Get the transcript IDs</li>
<li>Query the transcript locations by their IDs</li>
</ol>
<p>Getting transcript IDs given their gene symbol is a <code>select()</code> on OrganismDb object,</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Get all transcript IDs of gene BRCA1</span>
<span class="n">BRCA1_txnames</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">select</span><span class="p">(</span>
<span class="w"> </span><span class="n">Homo.sapiens</span><span class="p">,</span>
<span class="w"> </span><span class="n">keys</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"BRCA1"</span><span class="p">,</span><span class="w"> </span><span class="n">keytype</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"SYMBOL"</span><span class="p">,</span>
<span class="w"> </span><span class="n">columns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s">"ENTREZID"</span><span class="p">,</span><span class="w"> </span><span class="s">"TXNAME"</span><span class="p">)</span>
<span class="p">)</span><span class="o">$</span><span class="n">TXNAME</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="gp">> </span><span class="n">BRCA1_txnames</span>
<span class="go"> [1] "uc010whl.2" "uc002icp.4" "uc010whm.2" "uc002icu.3"</span>
<span class="go"> [5] "uc010cyx.3" "uc002icq.3" "uc002ict.3" "uc010whn.2"</span>
<span class="go"> [9] "uc010who.3" "uc010whp.2" "uc010whq.1" "uc002idc.1"</span>
<span class="go">[13] "uc010whr.1" "uc002idd.3" "uc002ide.1" "uc010cyy.1"</span>
<span class="go">[17] "uc010whs.1" "uc010cyz.2" "uc010cza.2" "uc010wht.1"</span>
</code></pre></div>
<p>Look like it has plenty of isoforms!</p>
<h4 id="via-transcripts">via <code>transcripts()</code></h4>
<p>For the transcript location, the easiest way will be querying the txDb via <code>transcript()</code>,</p>
<div class="highlight"><pre><span></span><code><span class="n">BRCA1_txs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">transcripts</span><span class="p">(</span>
<span class="w"> </span><span class="n">Homo.sapiens</span><span class="p">,</span>
<span class="w"> </span><span class="n">vals</span><span class="o">=</span><span class="nf">list</span><span class="p">(</span><span class="n">tx_name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">BRCA1_txnames</span><span class="p">),</span>
<span class="w"> </span><span class="n">columns</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="s">"TXNAME"</span><span class="p">,</span><span class="s">"SYMBOL"</span><span class="p">,</span><span class="w"> </span><span class="s">"EXONID"</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="gp">> </span><span class="n">BRCA1_txs</span>
<span class="go">GRanges object with 20 ranges and 3 metadata columns:</span>
<span class="go"> seqnames ranges strand | EXONID TXNAME SYMBOL</span>
<span class="go"> <Rle> <IRanges> <Rle> | <IntegerList> <CharacterList> <CharacterList></span>
<span class="go"> [1] chr17 [41196312, 41276132] - | 227486,227485,227482,... uc010whl.2 BRCA1</span>
<span class="go"> [2] chr17 [41196312, 41277340] - | 227487,227486,227485,... uc002icp.4 BRCA1</span>
<span class="go"> [3] chr17 [41196312, 41277340] - | 227487,227464,227463,... uc010whm.2 BRCA1</span>
<span class="go"> [4] chr17 [41196312, 41277468] - | 227489,227486,227485,... uc002icu.3 BRCA1</span>
<span class="go"> [5] chr17 [41196312, 41277468] - | 227489,227486,227482,... uc010cyx.3 BRCA1</span>
<span class="go"> ... ... ... ... ... ... ... ...</span>
<span class="go"> [16] chr17 [41243452, 41277340] - | 227487,227486,227485,... uc010cyy.1 BRCA1</span>
<span class="go"> [17] chr17 [41243452, 41277468] - | 227489,227486,227485,... uc010whs.1 BRCA1</span>
<span class="go"> [18] chr17 [41243452, 41277500] - | 227488,227486,227485,... uc010cyz.2 BRCA1</span>
<span class="go"> [19] chr17 [41243452, 41277500] - | 227488,227486,227485,... uc010cza.2 BRCA1</span>
<span class="go"> [20] chr17 [41243452, 41277500] - | 227488,227474 uc010wht.1 BRCA1</span>
<span class="go"> -------</span>
<span class="go"> seqinfo: 93 sequences (1 circular) from hg19 genome</span>
</code></pre></div>
<p>Then get the genomic range of these transcripts by <code>seqnames()</code>, <code>start()</code> and <code>end()</code> functions on the <code>GRanages</code> object,</p>
<div class="highlight"><pre><span></span><code>thechr <- as.character(unique(
seqnames(BRCA1_txs)
))
st <- min(start(BRCA1_txs)) - 2e4
en <- max(end(BRCA1_txs)) + 1e3
</code></pre></div>
<p>Some space are added at both ends so the plot won’t tightly fit all transcripts and leave some room for the transcript names.</p>
<div class="highlight"><pre><span></span><code><span class="gp">> </span><span class="nf">c</span><span class="p">(</span><span class="n">thechr</span><span class="p">,</span><span class="w"> </span><span class="n">st</span><span class="p">,</span><span class="w"> </span><span class="n">en</span><span class="p">)</span>
<span class="go">[1] "chr17" "41176312" "41323420"</span>
</code></pre></div>
<h4 id="via-exonsby">via <code>exonsBy()</code></h4>
<p>Another way to obtain the genomic range is getting the exact range of CDS (e.g. exons and UTRs) for each transcript via <code>exonsBy()</code>.</p>
<div class="highlight"><pre><span></span><code><span class="n">BRCA1_cds_by_tx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">exonsBy</span><span class="p">(</span>
<span class="w"> </span><span class="n">Homo.sapiens</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="o">=</span><span class="s">"tx"</span><span class="p">,</span><span class="w"> </span><span class="n">use.names</span><span class="o">=</span><span class="kc">TRUE</span>
<span class="p">)[</span><span class="n">BRCA1_txnames</span><span class="p">]</span>
</code></pre></div>
<p>The function returns a <code>GRangesList</code> object, a list of <code>GRanges</code> that each <code>GRanges</code> object corresponds to a transcript respectively.</p>
<div class="highlight"><pre><span></span><code><span class="gp">> </span><span class="n">BRCA1_cds_by_tx</span>
<span class="go">GRangesList object of length 20:</span>
<span class="go">$uc010whl.2</span>
<span class="go">GRanges object with 22 ranges and 3 metadata columns:</span>
<span class="go"> seqnames ranges strand | exon_id exon_name exon_rank</span>
<span class="go"> <Rle> <IRanges> <Rle> | <integer> <character> <integer></span>
<span class="go"> [1] chr17 [41276034, 41276132] - | 227486 <NA> 1</span>
<span class="go"> [2] chr17 [41267743, 41267796] - | 227485 <NA> 2</span>
<span class="go"> [3] chr17 [41258473, 41258550] - | 227482 <NA> 3</span>
<span class="go"> [4] chr17 [41256885, 41256973] - | 227481 <NA> 4</span>
<span class="go"> [5] chr17 [41256139, 41256278] - | 227480 <NA> 5</span>
<span class="go"> ... ... ... ... ... ... ... ...</span>
<span class="go"> [18] chr17 [41209069, 41209152] - | 227462 <NA> 18</span>
<span class="go"> [19] chr17 [41203080, 41203134] - | 227461 <NA> 19</span>
<span class="go"> [20] chr17 [41201138, 41201211] - | 227459 <NA> 20</span>
<span class="go"> [21] chr17 [41199660, 41199720] - | 227458 <NA> 21</span>
<span class="go"> [22] chr17 [41196312, 41197819] - | 227457 <NA> 22</span>
<span class="go">...</span>
<span class="go"><19 more elements></span>
<span class="go">-------</span>
<span class="go">seqinfo: 93 sequences (1 circular) from hg19 genome</span>
</code></pre></div>
<p><code>GRangesList</code> is not merely a R list structure, which can correctly propagate the GRanges-related functions to all the GRanges it contain.</p>
<div class="highlight"><pre><span></span><code><span class="gp">> </span><span class="nf">start</span><span class="p">(</span><span class="n">BRCA1_cds_by_tx</span><span class="p">)</span>
<span class="go">IntegerList of length 20</span>
<span class="go">[["uc010whl.2"]] 41276034 41267743 41258473 ... 41201138 41199660 41196312</span>
<span class="go">[["uc002icp.4"]] 41277199 41276034 41267743 ... 41201138 41199660 41196312</span>
<span class="go">...</span>
</code></pre></div>
<p>Here we only cares about the widest range, so the hierarchical structure is not useful. It would be better to flatten the <code>GRangesList</code> first,</p>
<div class="highlight"><pre><span></span><code><span class="gp">> </span><span class="n">BRCA1_cds_flatten</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">unlist</span><span class="p">(</span><span class="n">BRCA1_cds_by_tx</span><span class="p">)</span>
<span class="gp">> </span><span class="n">BRCA1_cds_flatten</span>
<span class="go">GRanges object with 284 ranges and 3 metadata columns:</span>
<span class="go"> seqnames ranges strand | exon_id exon_name exon_rank</span>
<span class="go"> <Rle> <IRanges> <Rle> | <integer> <character> <integer></span>
<span class="go"> uc010whl.2 chr17 [41276034, 41276132] - | 227486 <NA> 1</span>
<span class="go"> uc010whl.2 chr17 [41267743, 41267796] - | 227485 <NA> 2</span>
<span class="go"> uc010whl.2 chr17 [41258473, 41258550] - | 227482 <NA> 3</span>
<span class="go"> uc010whl.2 chr17 [41256885, 41256973] - | 227481 <NA> 4</span>
<span class="go"> uc010whl.2 chr17 [41256139, 41256278] - | 227480 <NA> 5</span>
<span class="go"> ... ... ... ... ... ... ... ...</span>
<span class="go"> uc010cza.2 chr17 [41249261, 41249306] - | 227477 <NA> 7</span>
<span class="go"> uc010cza.2 chr17 [41247863, 41247939] - | 227476 <NA> 8</span>
<span class="go"> uc010cza.2 chr17 [41243452, 41246877] - | 227474 <NA> 9</span>
<span class="go"> uc010wht.1 chr17 [41277288, 41277500] - | 227488 <NA> 1</span>
<span class="go"> uc010wht.1 chr17 [41243452, 41246877] - | 227474 <NA> 2</span>
<span class="go"> -------</span>
<span class="go"> seqinfo: 93 sequences (1 circular) from hg19 genome</span>
</code></pre></div>
<p>We have the BRCA1 genomic region, rest of the plotting is the same.</p>
<h3 id="show-only-the-annotations-of-certain-genes">Show only the annotations of certain genes</h3>
<p>Before we start to create our own annotation subset, we first take a look at what Gviz generated. The <code>GeneRegionTrack</code> track store its annotation data at slot <code>range</code>.</p>
<div class="highlight"><pre><span></span><code><span class="gp">> </span><span class="n">grtrack</span><span class="o">@</span><span class="n">range</span>
<span class="go">GRanges object with 459 ranges and 7 metadata columns:</span>
<span class="go"> seqnames ranges strand | feature id exon transcript gene symbol density</span>
<span class="go"> <Rle> <IRanges> <Rle> | <character> <character> <character> <character> <character> <character> <numeric></span>
<span class="go"> [1] chr17 [41177258, 41177364] + | utr5 unknown uc002icn.3_1 uc002icn.3 8153 uc002icn.3 1</span>
<span class="go"> [2] chr17 [41177365, 41177466] + | CDS unknown uc002icn.3_1 uc002icn.3 8153 uc002icn.3 1</span>
<span class="go"> [3] chr17 [41177977, 41178064] + | CDS unknown uc002icn.3_2 uc002icn.3 8153 uc002icn.3 1</span>
<span class="go"> [4] chr17 [41179200, 41179309] + | CDS unknown uc002icn.3_3 uc002icn.3 8153 uc002icn.3 1</span>
<span class="go"> [5] chr17 [41180078, 41180212] + | CDS unknown uc002icn.3_4 uc002icn.3 8153 uc002icn.3 1</span>
<span class="go"> ... ... ... ... ... ... ... ... ... ... ... ...</span>
<span class="go"> [455] chr17 [41277294, 41277468] - | utr5 unknown uc010cyx.3_1 uc010cyx.3 672 uc010cyx.3 1</span>
<span class="go"> [456] chr17 [41277294, 41277468] - | utr5 unknown uc002idc.1_1 uc002idc.1 672 uc002idc.1 1</span>
<span class="go"> [457] chr17 [41277294, 41277468] - | utr5 unknown uc010whr.1_1 uc010whr.1 672 uc010whr.1 1</span>
<span class="go"> [458] chr17 [41277294, 41277468] - | utr5 unknown uc010whs.1_1 uc010whs.1 672 uc010whs.1 1</span>
<span class="go"> [459] chr17 [41322143, 41322420] - | utr5 unknown uc010whp.2_1 uc010whp.2 672 uc010whp.2 1</span>
<span class="go"> -------</span>
<span class="go"> seqinfo: 1 sequence from hg19 genome; no seqlengths</span>
</code></pre></div>
<p>So we filter out unrelated ranges by checking if the value of metadata column <code>transcript</code> is one of <em>BRCA1</em>’s transcript IDs,</p>
<div class="highlight"><pre><span></span><code><span class="n">BRCA_only_range</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">grtrack</span><span class="o">@</span><span class="n">range</span><span class="p">[</span>
<span class="w"> </span><span class="nf">mcols</span><span class="p">(</span><span class="n">grtrack</span><span class="o">@</span><span class="n">range</span><span class="p">)</span><span class="o">$</span><span class="n">transcript</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">BRCA1_txnames</span>
<span class="p">]</span>
<span class="n">grtrack</span><span class="o">@</span><span class="n">range</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">BRCA_only_range</span>
</code></pre></div>
<p>or by less hacky way that use the new range to construct another <code>GeneRegionTrack</code>,</p>
<div class="highlight"><pre><span></span><code><span class="n">grtrack_BRCA_only</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">GeneRegionTrack</span><span class="p">(</span>
<span class="w"> </span><span class="n">BRCA_only_range</span><span class="p">,</span>
<span class="w"> </span><span class="n">chromosome</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">thechr</span><span class="p">,</span><span class="w"> </span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">st</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">en</span><span class="p">,</span>
<span class="w"> </span><span class="n">showId</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span>
<span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"Gene Annotation (BRCA1 only)"</span>
<span class="p">)</span>
<span class="nf">plotTracks</span><span class="p">(</span>
<span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">itrack</span><span class="p">,</span><span class="w"> </span><span class="n">gtrack</span><span class="p">,</span><span class="w"> </span><span class="n">dtrack</span><span class="p">,</span><span class="w"> </span><span class="n">grtrack_BRCA_only</span><span class="p">),</span>
<span class="w"> </span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">st</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">en</span>
<span class="p">)</span>
</code></pre></div>
<figure>
<img src="https://blog.liang2.tw/posts/2016/01/plot-seq-depth-gviz/pics/seqdepth_BRCA1_only.png"/>
</figure>
<h3 id="display-gene-symbols-at-annotation-track">Display gene symbols at annotation track</h3>
<p>It’s more obvious now about how Gviz stores the annotation. All we need is to replace the symbol name with whatever we desire.</p>
<p>First, we extract the metadata of the <code>GeneRegionTrack</code>, and query for their gene symbols. Using either the transcript ID or Entrez ID will do.</p>
<div class="highlight"><pre><span></span><code><span class="n">grtrack_range</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">grtrack</span><span class="o">@</span><span class="n">range</span>
<span class="n">range_mapping</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">select</span><span class="p">(</span>
<span class="w"> </span><span class="n">Homo.sapiens</span><span class="p">,</span>
<span class="w"> </span><span class="n">keys</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">mcols</span><span class="p">(</span><span class="n">grtrack_range</span><span class="p">)</span><span class="o">$</span><span class="n">symbol</span><span class="p">,</span>
<span class="w"> </span><span class="n">keytype</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"TXNAME"</span><span class="p">,</span>
<span class="w"> </span><span class="n">columns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s">"ENTREZID"</span><span class="p">,</span><span class="w"> </span><span class="s">"SYMBOL"</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="gp">> </span><span class="nf">head</span><span class="p">(</span><span class="n">range_mapping</span><span class="p">)</span>
<span class="go"> TXNAME SYMBOL ENTREZID</span>
<span class="go">1 uc002icn.3 RND2 8153</span>
<span class="go">2 uc002icn.3 RND2 8153</span>
<span class="go">3 uc002icn.3 RND2 8153</span>
<span class="go">4 uc002icn.3 RND2 8153</span>
<span class="go">5 uc002icn.3 RND2 8153</span>
<span class="go">6 uc002icn.3 RND2 8153</span>
</code></pre></div>
<p>Then we concatenate the information of transcript ID and gene symbol using <a href="https://cran.r-project.org/web/packages/stringr/index.html">stringr</a>.</p>
<div class="highlight"><pre><span></span><code><span class="nf">library</span><span class="p">(</span><span class="n">stringr</span><span class="p">)</span>
<span class="n">new_symbols</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">with</span><span class="p">(</span>
<span class="w"> </span><span class="n">range_mapping</span><span class="p">,</span>
<span class="w"> </span><span class="nf">str_c</span><span class="p">(</span><span class="n">SYMBOL</span><span class="p">,</span><span class="w"> </span><span class="s">" ("</span><span class="p">,</span><span class="w"> </span><span class="n">TXNAME</span><span class="p">,</span><span class="w"> </span><span class="s">")"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">""</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="gp">> </span><span class="nf">head</span><span class="p">(</span><span class="nf">unique</span><span class="p">(</span><span class="n">new_symbols</span><span class="p">))</span>
<span class="go">[1] "RND2 (uc002icn.3)" "NBR2 (uc002idf.3)" "NBR2 (uc010czb.2)"</span>
<span class="go">[4] "NBR2 (uc002idg.3)" "NBR2 (uc002idh.3)" "NBR1 (uc010czd.3)"</span>
</code></pre></div>
<p>Like how we extract <em>BRCA1</em>-only annotations, we construct a new <code>GeneRegionTrack</code>.</p>
<div class="highlight"><pre><span></span><code><span class="n">grtrack_symbol</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">GeneRegionTrack</span><span class="p">(</span>
<span class="w"> </span><span class="n">grtrack</span><span class="o">@</span><span class="n">range</span><span class="p">,</span>
<span class="w"> </span><span class="n">chromosome</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">thechr</span><span class="p">,</span><span class="w"> </span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">st</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">en</span><span class="p">,</span>
<span class="w"> </span><span class="n">showId</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span>
<span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"Gene Annotation w. Symbol"</span>
<span class="p">)</span>
<span class="nf">symbol</span><span class="p">(</span><span class="n">grtrack_symbol</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">new_symbols</span>
<span class="nf">plotTracks</span><span class="p">(</span>
<span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">itrack</span><span class="p">,</span><span class="w"> </span><span class="n">gtrack</span><span class="p">,</span><span class="w"> </span><span class="n">dtrack</span><span class="p">,</span><span class="w"> </span><span class="n">grtrack_symbol</span><span class="p">),</span>
<span class="w"> </span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">st</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">en</span>
<span class="p">)</span>
</code></pre></div>
<figure>
<img src="https://blog.liang2.tw/posts/2016/01/plot-seq-depth-gviz/pics/seqdepth_gene_symbol.png"/>
</figure>
<h2 id="summary">Summary</h2>
<p>So we’ve learnt how to plot using Gviz. You should go explore other data tracks or try to combine sequencing depth of multiple samples. I found the design of Gviz is clean and easy to modify. I think I’ll use Gviz whenever genome-related plots are needed.</p>
<p>Really glad I’ve tried it :)</p>
<h2 id="supplementary-plot-bam-files-directly">Supplementary - Plot BAM files directly</h2>
<p>We will start by replacing <code>DataTrack</code> with <code>AlignmentsTrack</code>. Also we select a smaller region this time so the read mapping can be clearly seen.</p>
<div class="highlight"><pre><span></span><code><span class="n">st</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">41.196e6L</span>
<span class="n">en</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">41.202e6L</span>
<span class="n">gtrack</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">GenomeAxisTrack</span><span class="p">(</span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="c1"># set the font size larger</span>
<span class="n">altrack</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">AlignmentsTrack</span><span class="p">(</span>
<span class="w"> </span><span class="s">"myseq.bam"</span><span class="p">,</span><span class="w"> </span><span class="n">isPaired</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">col.mates</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"deeppink"</span>
<span class="p">)</span>
<span class="nf">plotTracks</span><span class="p">(</span>
<span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">gtrack</span><span class="p">,</span><span class="w"> </span><span class="n">altrack</span><span class="p">,</span><span class="w"> </span><span class="n">grtrack</span><span class="p">),</span>
<span class="w"> </span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">st</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">en</span>
<span class="p">)</span>
</code></pre></div>
<figure>
<img src="https://blog.liang2.tw/posts/2016/01/plot-seq-depth-gviz/pics/seqdepth_BAM_default.png"/>
</figure>
<p>To plot only the coverage, set the type as <code>coverage</code>.</p>
<div class="highlight"><pre><span></span><code><span class="n">altrack</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">AlignmentsTrack</span><span class="p">(</span>
<span class="w"> </span><span class="s">"myseq.bam"</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"coverage"</span>
<span class="p">)</span>
</code></pre></div>
<figure>
<img src="https://blog.liang2.tw/posts/2016/01/plot-seq-depth-gviz/pics/seqdepth_BAM_coverage_only.png"/>
</figure>
<h3 id="fancier-alignment-display">Fancier alignment display</h3>
<p>Spend some time reading the documentation, the alignment can be much more fancier.</p>
<p>For example, when looking at a much smaller genome region, we many want to see the sequence and read mismatches. It could be done by adding a new track <code>SequenceTrack</code> to include the genome sequence,</p>
<div class="highlight"><pre><span></span><code><span class="n">small_st</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">41267.735e3L</span>
<span class="n">small_en</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">41267.805e3L</span>
<span class="nf">library</span><span class="p">(</span><span class="n">BSgenome.Hsapiens.UCSC.hg19</span><span class="p">)</span>
<span class="n">strack</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">SequenceTrack</span><span class="p">(</span>
<span class="w"> </span><span class="n">Hsapiens</span><span class="p">,</span>
<span class="w"> </span><span class="n">chromosome</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">thechr</span><span class="p">,</span><span class="w"> </span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">small_en</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">small_st</span><span class="p">,</span>
<span class="w"> </span><span class="n">cex</span><span class="o">=</span><span class="m">0.8</span>
<span class="p">)</span>
</code></pre></div>
<p>We tweak other tracks as well to make sure the figure won’t explode by too much information. Gene annotations are collapsed down to one liner. Also, aligned read’s height is increased to fit in individual letters (e.g., ATCG).</p>
<div class="highlight"><pre><span></span><code><span class="n">grtrack_small</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">GeneRegionTrack</span><span class="p">(</span>
<span class="w"> </span><span class="n">grtrack</span><span class="o">@</span><span class="n">range</span><span class="p">,</span>
<span class="w"> </span><span class="n">chromosome</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">thechr</span><span class="p">,</span>
<span class="w"> </span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">small_st</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">small_en</span><span class="p">,</span>
<span class="w"> </span><span class="n">stacking</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"dense"</span><span class="p">,</span>
<span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s">"Gene Annotation"</span>
<span class="p">)</span>
<span class="n">altrack</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">AlignmentsTrack</span><span class="p">(</span>
<span class="w"> </span><span class="s">"myseq.bam"</span><span class="p">,</span>
<span class="w"> </span><span class="n">isPaired</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span>
<span class="w"> </span><span class="n">min.height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="n">max.height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">15</span><span class="p">,</span><span class="w"> </span><span class="n">coverageHeight</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.15</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span>
<span class="p">)</span>
<span class="nf">plotTracks</span><span class="p">(</span>
<span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">gtrack</span><span class="p">,</span><span class="w"> </span><span class="n">altrack</span><span class="p">,</span><span class="w"> </span><span class="n">grtrack_small</span><span class="p">,</span><span class="w"> </span><span class="n">strack</span><span class="p">),</span>
<span class="w"> </span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">small_st</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">small_en</span>
<span class="p">)</span>
</code></pre></div>
<figure>
<img src="https://blog.liang2.tw/posts/2016/01/plot-seq-depth-gviz/pics/seqdepth_BAM_small_region.png"/>
</figure>
<p>We found a C>T SNP here!</p>Jupyter Notebook Theme2016-01-07T00:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-01-07:/posts/2016/01/jupyter-notebook-theme/<p><a href="http://jupyter.org/">Jupyter Notebook</a>,也就是以前的 IPython Notebook,應該是許多人在用 Python 做資料分析時記錄實驗步驗與結果的工具。</p>
<p>現在 <a href="http://ipython.org/">IPython</a> (v4.0+) 已經回歸到 Interactive Python Shell …</p><p><a href="http://jupyter.org/">Jupyter Notebook</a>,也就是以前的 IPython Notebook,應該是許多人在用 Python 做資料分析時記錄實驗步驗與結果的工具。</p>
<p>現在 <a href="http://ipython.org/">IPython</a> (v4.0+) 已經回歸到 Interactive Python Shell 的本質,變成只是擴充內建 Python REPL 的套件,相依的模組也清掉了。原本的 IPyton Notebook 主要是提供一個像 <a href="https://reference.wolfram.com/language/tutorial/UsingANotebookInterface.html">Mathematica Notebook</a> 的環境,功能很多就不多提。它可以用 web 或者 QT 介面來跑。</p>
<p>後來又開始整合很多語言,變成像 Julia / R / Lua 等語言都可以利用這樣的 Notebook 架構,於是 <a href="http://jupyter.org/">Jupyter</a> 就因此誕生,變成原本的 IPython 只是其中一個可能的語言 kernel。Notebook 本身可以是 R 語言或者 Julia 語言。</p>
<div class="toc">
<ul>
<li><a href="#jupyter-notebook">Jupyter Notebook</a></li>
<li><a href="#custom-theme-on-v41">Custom Theme on v4.1+</a></li>
<li><a href="#custom-theme-before-v41">Custom Theme Before v4.1</a></li>
</ul>
</div>
<h3 id="jupyter-notebook">Jupyter Notebook</h3>
<p>用 Python 裝十分簡單,</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>pip<span class="w"> </span>install<span class="w"> </span>jupyter
</code></pre></div>
<p>Jupyter 預設走 web 介面,會跑一個 tornado server 預設在 <a href="http://localhost:8888">http://localhost:8888</a> 上。</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>jupyter<span class="w"> </span>notebook
<span class="go">[I 23:50:58.449 NotebookApp] Serving notebooks from local directory: /Users/liang</span>
<span class="go">[I 23:50:58.449 NotebookApp] 0 active kernels</span>
<span class="go">[I 23:50:58.450 NotebookApp] The IPython Notebook is running at: http://localhost:8888/</span>
<span class="go">[I 23:50:58.450 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).</span>
</code></pre></div>
<p>有裝 browser 的話就會自動開一個視窗。</p>
<figure>
<img src="https://blog.liang2.tw/posts/2016/01/jupyter-notebook-theme/pics/jupyter_default_frontpage.png"/>
<p class="caption center">Jupyter Notebook Hub</p>
</figure>
<p>Notebook 預設是 <code>.ipynb</code> 的檔案。常見的內容像這樣:</p>
<figure>
<img src="https://blog.liang2.tw/posts/2016/01/jupyter-notebook-theme/pics/jupyter_default_notebook.png"/>
<p class="caption center">Jupyter Notebook Example</p>
</figure>
<h3 id="custom-theme-on-v41">Custom Theme on v4.1+</h3>
<p>EDIT: 2016-01-11 <a href="https://blog.jupyter.org/2016/01/08/notebook-4-1-release/">Jupyter Notebook 4.1</a> 已釋出,設定路徑預設都在 <code>~/.jupyter</code> 底下。</p>
<p>今天重點是換主題嘛。趕快來做吧。</p>
<p>可以透過 <code>jupyter --config-dir</code> 或 <code>jupyter --paths</code> 找到設定檔應該放的位置。</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>jupyter<span class="w"> </span>--config-dir
<span class="go">~/.jupyter</span>
<span class="gp">$ </span>jupyter<span class="w"> </span>notebook<span class="w"> </span>--generate-config<span class="w"> </span><span class="c1"># 建立 Notebook 設定檔</span>
<span class="go">Writing default config to: ~/.jupyter/jupyter_notebook_config.py</span>
</code></pre></div>
<p>如果只是改主題,不需要更動 Notebook 設定檔。</p>
<p>目前自己使用的主題來自 <a href="https://github.com/dunovank/jupyter-themes">dunovank</a>,他有收集了至少深淺兩色,應該足夠使用了。CSS 從別人的基礎上來調整也相對簡單,我自己有<a href="https://github.com/ccwang002/dotfiles/tree/master/ipy_profile/ipython3">改寫了一點</a>(忘了改什麼)。dunovank 有寫個安裝 theme 的套件,不過不用也沒關係,只要準備好 CSS 就能用。我用 Grade3 這個主題來示範。</p>
<p>把 <code>custom.css</code> 放到 <code>~/.jupyter/static/custom.css</code>。設定的目錄長得如下:</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>tree<span class="w"> </span>~/.jupyter<span class="w"> </span>-F<span class="w"> </span>-L<span class="w"> </span><span class="m">3</span>
<span class="go">~/.jupyter</span>
<span class="go">├── custom/</span>
<span class="go">│ └── custom.css -> /path/to/custom_light.css</span>
<span class="go">└── jupyter_notebook_config.py</span>
</code></pre></div>
<h3 id="custom-theme-before-v41">Custom Theme Before v4.1</h3>
<p>現在因為歷經 IPython 到 Jupyter 的過程,設定還蠻分散的。以往的設定會在 <code>~/.ipython</code>,而到了 Jupyter 之後,相關設定會在 <code>~/.jupyter</code>。有時候設定怪怪的話就兩個路徑都檢查一下吧。</p>
<p>只要把這個 CSS 放到 <code>~/.ipython/profile_default/custom.css</code> 再重開 Jupyter Notebook 就可以了<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>。效果如下:</p>
<figure>
<img src="https://blog.liang2.tw/posts/2016/01/jupyter-notebook-theme/pics/jupyter_grade3_frontpage.png"/>
<img src="https://blog.liang2.tw/posts/2016/01/jupyter-notebook-theme/pics/jupyter_grade3_notebook1.png"/>
<p class="caption center">Jupyter Notebook Theme Grade3 Demo</p>
</figure>
<figure>
<img src="https://blog.liang2.tw/posts/2016/01/jupyter-notebook-theme/pics/jupyter_grade3_notebook2.png"/>
<p class="caption center">把 Toolbar 全部 toggle 起來,以及表格的樣子。</p>
</figure>
<p>個人覺得長時間使用下來,對比度低一點對眼睛比較好。黑底也不錯,不過畫圖常常會自己帶白底,整體感覺就不是很漂亮,可能要連 matplotlib theme 一起改吧 XD</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>這路徑並不符合 Jupyter 跨 kernel 的設計理念,<del>感覺未來會改路徑</del>已經在 4.1+ 版本中完成整合,與 IPython 設定分家了。 <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Blog defaults to HTTPS2016-01-06T00:00:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2016-01-06:/posts/2016/01/blog-https/<p>簡言之,現在 blog 使用 https。一般的 http 連線會被重新導向至 https。</p>
<p>Blog 本來就是架在 <a href="https://github.com/ccwang002/ccwang002.github.io">GitHub Pages</a> 上,其實預設有 https,但在網址改成自訂 domain 之後 …</p><p>簡言之,現在 blog 使用 https。一般的 http 連線會被重新導向至 https。</p>
<p>Blog 本來就是架在 <a href="https://github.com/ccwang002/ccwang002.github.io">GitHub Pages</a> 上,其實預設有 https,但在網址改成自訂 domain 之後 https 自然就失效了。在 GitHub 上有開 issue 請他們加入 <a href="https://github.com/isaacs/github/issues/156">HTTPS support for custom domain</a> 這功能,不過目前還是需要自己想辦法。隨著 <a href="https://letsencrypt.org/">Let’s Encrypt</a> 這種服務的流行,GitHub 才會去積極尋找比較合適的解決方案吧。</p>
<div class="toc">
<ul>
<li><a href="#cloudflare-ssl-and-cdn">CloudFlare SSL and CDN</a></li>
<li><a href="#disqus">Disqus</a></li>
<li><a href="#justfont">justfont</a></li>
<li><a href="#hinet">Hinet 轉址服務</a></li>
</ul>
</div>
<h2 id="cloudflare-ssl-and-cdn">CloudFlare SSL and CDN</h2>
<p>在看那個 <a href="https://github.com/isaacs/github/issues/156">issue</a> 就可以找到其他人用 CloudFlare 的解法。概念上就再用一層 CloudFlare CDN,然後它的 CDN 有提供 https 簽章。直接看 CloudFlare 在 Crypto 頁的介紹比較快:</p>
<figure>
<img src="https://blog.liang2.tw/posts/2016/01/blog-https/pics/cloudflare_ssl.png"/>
<p class="caption center">source: <a href="https://www.cloudflare.com/ssl/"> CloudFlare one-click SSL</a></p>
</figure>
<p>所以 CloudFlare 去 cache GitHub 頁面時用的是 https,再到使用者時也是用 https。剩下就是你要不要相信 CloudFlare 了。</p>
<p>CloudFlare 的設定可以參考 <a href="https://blog.keanulee.com/2014/10/11/setting-up-ssl-on-github-pages.html">Keanu’s Blog</a>。一些重點筆記:</p>
<ul>
<li>換成 CloudFlare 的 DNS Server</li>
<li>Crypto SSL options 選 Full(不是 Strict 目前 GitHub 還不支援)</li>
<li>在 Page Rules 強迫所有 http 連結重新使用 https(例如:<code>http://blog.liang2.tw/*</code>)</li>
</ul>
<p>HTTPS 以及 DNS 的設定都需要一段時間,過幾個小時或觀察個一天再把 http 關掉。</p>
<p>Pelican 發佈設定 <code>publishconf.py</code> 管網址的 <code>SITE_URL</code> 能設成 <code>//blog.liang.tw</code> 不用帶 protocol(這麼重要的資訊沒寫在文件裡啊),這樣就能同時 serve http(s)。</p>
<p>這樣其實就完成了。但出乎意外還是有些小問題:</p>
<ul>
<li>網頁字型 <a href="http://en.justfont.com/membership">justfont</a> 要 Business Plan 才能支援 HTTPS。</li>
<li>留言系統 <a href="https://disqus.com/">Disqus</a> HTTP 和 HTTPS 竟然是當作<a href="https://github.com/aspnet/Docs/issues/623">不同留言板</a>來使用,而且要手動 merge。</li>
</ul>
<h2 id="disqus">Disqus</h2>
<p>似乎解法只有全部導向到 https。這還不能直接改 Disqus 設定,要用它的 <a href="https://help.disqus.com/customer/portal/articles/912757-url-mapper">URL Mapper</a> 下載所有留言版出現連結的 CSV 手動修改。</p>
<p>感覺很土砲。不過站上的留言不多,也不用改多少留言,很快就同步到新的位置上。</p>
<h2 id="justfont">justfont</h2>
<p>之前有贊助金萱計畫,其實有拿到兩年的 Business Plan。寫信給客服一天就改好設定了。不過之後就要多付錢啦。</p>
<h2 id="hinet">Hinet 轉址服務</h2>
<p>我沒有自己架任何 server,懶得維護。不過也很懶得打字。在其他 subdomain 都沒用的情況下,有透過 Hinet 設定 <a href="http://liang2.tw">http://liang2.tw</a> 會導向至 <a href="http://blog.liang2.tw">http://blog.liang2.tw</a> 再被導向到 https。</p>
<figure>
<img src="https://blog.liang2.tw/posts/2016/01/blog-https/pics/cloudflare_dns_setting.png"/>
<p class="caption center">CloudFlare DNS setting</p>
</figure>
<p>大概是這樣。希望能在不要自己架 server 的情況下繼續經營這個 blog。</p>Overview of Genomic Data Processing in Bioconductor2015-12-29T20:28:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2015-12-29:/posts/2015/12/biocondutor-genomic-data/<p>Notes of fundamental tools and learning resources for handling genomic data in R with Bioconductor.</p><p>Sorry for the late update. In the past two months, I finished my Ph.D. applications (hope to hear good news in the next two months) and was busy preparing the PyCon Taiwan 2016. Also, a year-long website development finally came to the end.</p>
<p>Now most things are set so I can back to writing my blog.</p>
<p>Since September, there accumulates at least 5 drafts and I don’t know when I can finish them, so I think I have to change my writing strategy. I will first publish things as soon as information collection is done, and deeper reviews will be given in the following posts. Right now I will focus on Bioconductor (and general Bioinformatics topics) and Django.</p>
<div class="toc">
<ul>
<li><a href="#bioconductor">Bioconductor</a><ul>
<li><a href="#annotation-and-genome-reference">Annotation and Genome Reference</a></li>
<li><a href="#experiment-data-storage">Experiment Data Storage</a></li>
<li><a href="#operations-on-genome">Operations on Genome</a></li>
<li><a href="#genomic-data-visualization">Genomic data visualization</a></li>
</ul>
</li>
<li><a href="#summary">Summary</a></li>
</ul>
</div>
<h2 id="bioconductor">Bioconductor</h2>
<p><a href="https://www.bioconductor.org/">Bioconductor</a> is indeed a rich resources for R both in terms of data and tools. And I found I have yet spent time seriously understanding the whole ecosystem, which I believe can drastically lighten the loading of daily analysis.</p>
<p>Bioconductor’s website is informative. If you are familar with R, you should already know that in order to understand the usage of a package, one of the best way is to read its vignettes. Packages on Bioconductor generally have vignettes, which is really helpful and the website makes them accessible. On top of that, they have <a href="https://www.bioconductor.org/help/course-materials/">Courses & Conferences</a> and <a href="https://www.bioconductor.org/help/workflows/">Workflows</a>. The former section collects all conference materials in the past few years, which contains package hands-on, analysis tutorial, and R advanced topics. It’s a hidden gem to me since I have already found numerous materials worth reading only after a glance over it. The latter one should be well-known. It gives examples of typical analysis workflows.</p>
<p>I’m interested in the following topics in Biocondutor:</p>
<ul>
<li>Annotation and genome reference (OrgDb, TxDb, OrganismDb, BSgenome)</li>
<li>Experiment data storage (ExpressionSets)</li>
<li>Operations on genome (GenomicRanges)</li>
<li>Genomic data visualization (Gviz, ggbio)</li>
</ul>
<p>Keywords in Biocondutors for each topic are attached in the parens, mostly being the package name. For each topic, I’ll put the related resources I collected in the following sections.</p>
<p>Before the listing, I found <a href="http://genomicsclass.github.io/book/">PH525x series</a> maintained by Rafael Irizarry and Michael Love from Harvard serves as a comprehensive entry point for almost every related topic. The site is the accompanied resources for their edX classes. Both of them worth taking a look.</p>
<h3 id="annotation-and-genome-reference">Annotation and Genome Reference</h3>
<ul>
<li>
<p><a href="http://genomicsclass.github.io/book/pages/annoPhen.html">Annotating phenotypes and molecular function</a> from <a href="http://genomicsclass.github.io/book/">PH525x series</a> gives a good overview and a taste of the powerful ecosystem Bioconductor provides.</p>
</li>
<li>
<p><a href="https://www.bioconductor.org/help/course-materials/2015/BioC2015/Annotation_Resources.html">Annotation Resources</a> from <a href="https://www.bioconductor.org/help/course-materials/2015/BioC2015/">BioC 2015</a> gives more extensive introduction about all available types of references from genome sequences to transcriptome and gene info.</p>
</li>
</ul>
<p>For example, human comes with</p>
<ul>
<li><a href="http://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html">org.Hs.eg.db</a></li>
<li><a href="http://bioconductor.org/packages/release/data/annotation/html/TxDb.Hsapiens.UCSC.hg38.knownGene.html">TxDb.Hsapiens.UCSC.hg38.knownGene</a></li>
<li><a href="http://bioconductor.org/packages/release/data/annotation/html/Homo.sapiens.html">Homo.sapiens</a></li>
<li><a href="http://bioconductor.org/packages/release/data/annotation/html/BSgenome.Hsapiens.UCSC.hg38.html">BSgenome.Hsapiens.UCSC.hg38</a></li>
</ul>
<h3 id="experiment-data-storage">Experiment Data Storage</h3>
<p>ExpressionSet helps store the expression experiment data, which one can combine expression values and phenotypes of the same sample. Additionally the experiment data (like descriptions of GEO dataset) can be attached as well.</p>
<ul>
<li>
<p><a href="http://genomicsclass.github.io/book/pages/eset.html">The ExpressionSet container</a> from <a href="http://genomicsclass.github.io/book/">PH525x series</a> gives an intro. It should be sufficient enough to use ExpressionSet in daily work.</p>
</li>
<li>
<p><a href="https://www.bioconductor.org/packages/release/bioc/vignettes/Biobase/inst/doc/ExpressionSetIntroduction.pdf">The ExpressionSet Introduction</a> from its package <a href="https://www.bioconductor.org/packages/release/bioc/html/Biobase.html">Biobase</a>’s vignette gives detailed explanation.</p>
</li>
</ul>
<h3 id="operations-on-genome">Operations on Genome</h3>
<p>I haven’t gone into the details, but operations about genomic ranges are often tricky and more importantly, badly optimized.</p>
<ul>
<li>
<p><a href="http://genomicsclass.github.io/book/pages/iranges_granges.html">IRanges and GRanges</a> and <a href="http://genomicsclass.github.io/book/pages/operateGRanges.html">GRanges operations</a> from <a href="http://genomicsclass.github.io/book/">PH525x series</a> give the overview of using the package <a href="https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html">GenomicRanges</a>.</p>
</li>
<li>
<p><a href="https://www.bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.pdf">An Introduction to Genomic Ranges Classes</a>, a <a href="https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html">GenomicRanges</a> vignette, gives a detailed view.</p>
</li>
<li>
<p>Also, their paper, <a href="http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003118">“Software for Computing and Annotating Genomic Ranges”, <em>PLOS One</em></a> should be another overview source of the package.</p>
</li>
<li>
<p><a href="https://cran.r-project.org/web/packages/data.table/index.html">data.table</a>’s <code>foverlap</code> function worth the comparison, since I already use it and I know it is <a href="https://github.com/Rdatatable/data.table/wiki/talks/EARL2014_OverlapRangeJoin_Arun.pdf">blazingly fast</a>. <code>foverlap</code> handles the overlapping of integer ranges so it can be applied to genomic operation. Its code is quite complex so its mechanism is still a myth to me. I’d like to see its comparison with using database like SQLite.</p>
</li>
</ul>
<h3 id="genomic-data-visualization">Genomic data visualization</h3>
<p>Basically I can find two packages:</p>
<ul>
<li><a href="https://bioconductor.org/packages/release/bioc/html/Gviz.html">Gviz</a></li>
<li><a href="https://bioconductor.org/packages/release/bioc/html/ggbio.html">ggbio</a></li>
</ul>
<p>Don’t know their difference yet. Both of them can produce well-done figures. But I think I have some experience with ggbio, which was a bit tricky to use. So for now I will go for Gviz.</p>
<ul>
<li><a href="https://www.bioconductor.org/help/course-materials/2012/BiocEurope2012/GvizEuropeanBioc2012.pdf">Visualizing genomic features with the Gviz package</a> given at Bioc Europe 2012 has a decent introduction about Gviz.</li>
<li><a href="https://bioconductor.org/packages/release/bioc/vignettes/Gviz/inst/doc/Gviz.pdf">The Gviz User Guide</a> looks very comprehensive, which also cover usage with expression and alignment results.</li>
</ul>
<h2 id="summary">Summary</h2>
<p>These resources should be enough for weeks of trying. It’s excited to find so many useful tools.</p>
<p>So, good luck to me for my Ph.D. application, PyCon Taiwan 2016, and a shorter blog posting frequency.</p>Customize Django User Model2015-11-04T18:23:00-06:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2015-11-04:/posts/2015/11/django-custom-user/<p>Django 帳號的欄位定義在 <code>django.contrib.auth</code> 的 <a href="https://docs.djangoproject.com/en/1.8/ref/contrib/auth/#fields">User</a> 中,對使用者而言包含了:username*、first_name、last_name、email、password*。同時對開發者來說,還有:</p>
<ul>
<li>指定 Group 和 Permission</li>
<li>是否為 staff、superuser</li>
<li>帳 …</li></ul><p>Django 帳號的欄位定義在 <code>django.contrib.auth</code> 的 <a href="https://docs.djangoproject.com/en/1.8/ref/contrib/auth/#fields">User</a> 中,對使用者而言包含了:username*、first_name、last_name、email、password*。同時對開發者來說,還有:</p>
<ul>
<li>指定 Group 和 Permission</li>
<li>是否為 staff、superuser</li>
<li>帳號開通、最後一次登入時間</li>
</ul>
<p>內建的帳號功能應該很實用,安全性也很好。所以一般來說都不會去改它。</p>
<p>如果只是想要幫 User 加個 profile,例如生日、來自哪個星球等欄位,也不需要改寫 User。參考官網 <a href="https://docs.djangoproject.com/en/1.8/topics/auth/customizing/#extending-the-existing-user-model">Extending the existing User model</a>,只需要建一個 one-to-one relationship 指到 User 就好了:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">django.conf</span> <span class="kn">import</span> <span class="n">settings</span>
<span class="k">class</span> <span class="nc">UserProfile</span><span class="p">(</span><span class="n">models</span><span class="o">.</span><span class="n">Model</span><span class="p">):</span>
<span class="n">user</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">OneToOneField</span><span class="p">(</span><span class="n">settings</span><span class="o">.</span><span class="n">AUTH_USER_MODEL</span><span class="p">)</span>
<span class="n">birth</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">DateField</span><span class="p">()</span>
<span class="n">orig_planet</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">CharField</span><span class="p">(</span><span class="n">max_length</span><span class="o">=</span><span class="mi">255</span><span class="p">)</span>
</code></pre></div>
<p>但 Django 預設用 username 來登入,如果要改用 email 登入要怎麼做?</p>
<h3 id="email">改用 Email 做帳號登入</h3>
<p>因為 User 是個很重要的 model,所以改寫時要注意相容性的問題。其實官網也有教學 <a href="https://docs.djangoproject.com/en/1.8/topics/auth/customizing/#specifying-a-custom-user-model">Specifying a custom User model</a>,不過這教學比前面長很多。</p>
<p>網路上已經有人 @jcugat 做了一個套件 <a href="https://github.com/jcugat/django-custom-user">django-custom-user</a>,他實作了 <code>EmailUser</code> 即用 email 作為帳號登入。已經把所有苦工都做好了,所以如果想要再加上自己的欄位等等,可以繼承他的 <code>AbstractEmailUser</code>。</p>
<p>其實如果看完自定 User 之後,寫好 User Model 不難,比較複雜的是像創建、修改 User 以及 admin 的設定。除了讀這個套件的 source code 之後,<a href="http://stackoverflow.com/questions/15012235">這串 Stack Overflow 討論</a>也提到了不同的實作方式。Django 這部份的 source code 蠻好讀的,也可以看一下。</p>
<p>因為之後要做 Email 認証,應該會用 <a href="https://github.com/pennersr/django-allauth">django-allauth</a> 做。感覺很久沒發文了,應該要把文章拆短才對 XD</p>數位時代的生氣2015-10-10T00:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2015-10-10:/posts/2015/10/digital-anger/<p>在數位時代,很用力地的生氣是件很困難的事。</p>
<p>首先溝通媒介就不太鼓勵這種事。想要在 FB 上放地圖砲狂譙一件事,但 FB 會 …</p><p>在數位時代,很用力地的生氣是件很困難的事。</p>
<p>首先溝通媒介就不太鼓勵這種事。想要在 FB 上放地圖砲狂譙一件事,但 FB 會因為太激烈的發言,所以真正看到的人其實不多,或者只會推播到擁有相同偏好的那群人,好像達不到「譙人」的目標。其實絕大多數的社群網站或多或少都有這種「取暖的功能」。</p>
<p>再來,撇開生氣的對象,不論是一個群體,或者是私下的幾個人之間,數位時代的生氣還是很不一樣的。先從一般的情緒表達說起吧,數位時代的工具們,大幅地加強抒發自己日常的情緒的表現力。從傳統文書多少能使用的顏文字,到表情符號,到貼圖、會動的貼圖,人們越來越有辦法把日常中再小的情緒波動,用很誇張的方式表現。</p>
<p>例如,發現垃圾車今天早來的十分鐘:</p>
<ol>
<li>今天垃圾車竟然早來十分鐘(怒)</li>
<li>今天垃圾車竟然早來十分鐘 Σ(°Д°;</li>
<li>今天垃圾車竟然早來十分鐘 !!!!!!!!(附上抱頭痛哭的動圖)</li>
</ol>
<p>但很極端的負面情緒呢?</p>
<blockquote>
<p>今天醒來發現總統跟國會最大黨,都還是國民黨</p>
</blockquote>
<p>這樣僅陳述了事實,但沒有表達出自己心中的憤怒。</p>
<blockquote>
<p>乾瞪著眼躺在床上未能起身,咬著牙憤恨地向天花板空揮拳頭,雙手交替著直到手臂也失去了力量。</p>
</blockquote>
<p>……干我屁事,而且動作太長了像在寫小說,可能也超過一些媒體的字數限制。顏文字好像也不痛不癢,用<code>(╯‵□′)╯︵┴─┴</code> (翻桌)表達這樣的情緒?只會感覺很「不正式」頗具戲謔之意,但真正憤怒的人是不會開玩笑的。</p>
<p>想了想,好像還是只能回歸到傳統畫信時代的方式,外加實體的動作。像是失聯,解除朋友,Unfollow,或者改用更激進的手段與詞彙,髒話就很有效。</p>
<p>但這樣的推論讓我很好奇,以前的人是怎麼跟別人吵架的。</p>
<p>其實讓我觀察到生氣在不同媒介的表達,並不是緣於前述狀況。這起源於一次我在看老人家用 FB 訊息吵架。例如年輕人失約了,老人就會很生氣的拿出……手機敲那個人,我猜他想說的是</p>
<blockquote>
<p>全部人都在等你一個了!…(下略五十字氣話)</p>
</blockquote>
<p>但最後送出的訊息是</p>
<blockquote>
<p>在等你 恩</p>
</blockquote>
<p>也太溫柔了。原因很簡單,老人很難連續打超過十個正確字,在爆怒的狀態,能靜靜地(手寫)選字五個字就了不起了,那個「恩」也只是不小心按到。所以就會看到一個人嘴巴上很生氣的霹靂啪啦迸出一堆字句,但手機上打的、傳過去的卻沒什麼。</p>
<p>意外地是,這樣的媒介反而充當一個不錯的情緒緩衝,對老人們來說。本來很氣的,在送出訊息,另一端回覆之後,不知怎麼地就平復了許多。</p>
<p>那年輕人呢?</p>
<p>還沒有足夠的觀察記錄(大家也不要這樣戰我 > <)但我想,明年大家都投柱柱姐就能收集足夠的樣本了(笑)</p>
<p>EDIT 2015-10-10: 可能還沒機會投柱柱姐呢。世局改變得太快。</p>
<p>/</p>
<p>題目是在旅程中想到的。</p>
<p>一開始是想到中文的「數位時代的生氣」,但總覺得不夠帥氣,改取了個 dichotomized affection,dicho- 是個在寫論文學到的詞,其實跟 binarize 是差不多的概念,但想了想,這篇要講的也不是在情感單位裡找個 threshold 去區分 0/1,再改成了 polarized affection,但想了想,這個情境應該比較像無法原原本本地被傳輸訊號,應該是個 low-pass filter……是說我在幹麻,中文曖昧地乾脆,這樣就好了 :P</p>
<p>/</p>
<p>這是篇舊文,寫於 2015-08-26。</p>
<p>今天看到 Facebook 要推新更多樣的表情符號,而不再只是讚。於是就想起了它。也順便看看部落格貼這種文章會有什麼回響。</p>
<blockquote>
<p>Today, Facebook is taking the wraps off what form the new Like may take. It is rolling out “Reactions,” a new set of six emoji that will sit alongside the original thumbs-up to let users quickly respond with love, laughter, happiness, shock, sadness and anger.
(Source: <a href="http://techcrunch.com/2015/10/08/with-reactions-facebook-supercharges-the-like-button-with-6-empathetic-emoji">TechCrunch</a>)</p>
</blockquote>
<p>/</p>
<p>那時候有人回覆說應該要考慮髒話的形況。</p>
<p>髒話應該就是最有效傳達情緒的用詞了,但我媽有教我不能講髒話,所以不在考慮範圍內 xd</p>
<p>例如(假設)</p>
<ol>
<li>明明說今年可以畢業,結果老闆不肯簽離校單,幹</li>
<li>去你的明明說今年可以離開這他X的鬼地方,結果那個死老頭不肯簽離校單,幹這什麼屎情況</li>
</ol>
<p>我覺得還是沒有比表情符號來得有效,或者說,情緒密度不夠高,而且想要達到更深的幹意(?),常見就多加幾個髒話。</p>
<p>但我不確定大家平常怎麼閱讀的,除了很多匿名討論區外,網路上大家也不用指著對方鼻子講,話講起來都沒什顧忌,這些看久之了後,對這種用詞會自動濾掉,不然就是很難知道對方到底有多生氣。</p>
<p>對我來說,改變對某人的認知遠比起他使用的髒話來得震撼。「哇原來這個人生氣起來這麼可怕(筆記)」,好像是跨過絕對不能跨過的紅線。但這變成是討論「數位時代的底線」或「數位時代的形象」了,遠超過我目前思考範疇啊。</p>用 Django 與 SQLite 架抽籤網站2015-10-04T14:55:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2015-10-04:/posts/2015/10/django-draw-member/<p>把之前用 Flask 架的抽籤網站改用 Django 實作,也藉這個機會比較一下兩個 Framework 設計概念的不同。</p><h4 id="_1">前情提要</h4>
<p>我把 LoveLive! 兩季看完了!μ’s 在第一季的成長充滿感動啊。<strong>\真姫最高/</strong></p>
<p>……呃好啦,之前講了<a href="../../09/flask-draw-member">用 Flask 去架一個抽籤網站</a>。不過我們最終的目標是用 Django 嘛,所以接下來就要改寫。也藉這個機會比較一下兩個 Framework 設計概念的不同(<del>例如 Django 一開始寫有多冗</del>、<del>Flask 寫到最後有多冗</del>)。</p>
<h3 id="from-flask-to-django">From Flask to Django</h3>
<p>為了轉換但又不要一下子把所有 Django 的功能都放進來,中間過程有很多「不常見的寫法」。想要直接寫 Django best practice 的話,可以參考 TP 大大的<a href="https://github.com/uranusjr/django-tutorial-for-programmers">《為程式人寫的 Django Tutorial 》</a>,他的規劃是 30 個單元做一個訂餐系統。</p>
<p>過程中會用到很多 Django API,沒有解釋的話可以到<a href="https://docs.djangoproject.com/en/1.8/">官網</a>去查使用。另外我發現如果能用 debugger 去 trace Django 執行的流程能幫助理解,想要一個精美的 debugger 的話可以裝像 PyCharm 的 IDE。</p>
<p>整體的規劃會漸近把 Django 的功能加進來,依序應該是:</p>
<ul>
<li>Django View, Template</li>
<li>Django Model, ORM</li>
<li>Django Form</li>
<li>(Django Admin 沒有用到)</li>
</ul>
<p>如果看 <a href="https://docs.djangoproject.com/en/1.8/">Django doc</a> 首頁的話,也是分這幾個部份,雖然這篇文章並不會把所有概念都介紹一遍。</p>
<p>另外,在改寫的時候會跳過用 raw SQL,因為完全不用 ORM 有點難銜接其他 Django 部份。有興趣的話在講完 Model 之後可以參考 Details。</p>
<div class="toc">
<ul>
<li><a href="#_1">前情提要</a></li>
<li><a href="#from-flask-to-django">From Flask to Django</a></li>
<li><a href="#django">Django 初始設定</a><ul>
<li><a href="#django-server">Django server</a></li>
<li><a href="#django-app">第一個 Django app</a></li>
<li><a href="#django-settings">Django settings</a></li>
<li><a href="#database-migration">Database Migration</a></li>
</ul>
</li>
<li><a href="#url-dispatcher">URL dispatcher</a></li>
<li><a href="#django-model-and-orm">Django Model and ORM</a><ul>
<li><a href="#migration-the-tracker-of-model-changes">Migration the tracker of model changes</a></li>
<li><a href="#orm-queries-in-shell">ORM queries in shell</a></li>
<li><a href="#data-in-orm-and-fixtures">Data in ORM and fixtures</a></li>
</ul>
</li>
<li><a href="#django-template">Django Template</a></li>
<li><a href="#more-on-djangos-model-template-and-view-mtv">More on Django’s model, template and view (MTV)</a></li>
<li><a href="#django-form">Django Form</a><ul>
<li><a href="#more-django-form-in-view">More Django form in view</a></li>
</ul>
</li>
<li><a href="#_2">總結</a></li>
<li><a href="#details">Details</a><ul>
<li><a href="#raw-sql">Raw SQL</a></li>
<li><a href="#better-queryset">Better QuerySet</a></li>
<li><a href="#timezone">Timezone</a></li>
<li><a href="#post-form-and-csrf">POST form and CSRF</a></li>
</ul>
</li>
</ul>
</div>
<h3 id="django">Django 初始設定</h3>
<p>一樣開一個 Python 虛擬環境(這時候就是它的好處了,能把不同專案的套件隔離)。</p>
<div class="highlight"><pre><span></span><code>pip install django pytz ipython pyyaml
</code></pre></div>
<p><a href="http://pythonhosted.org/pytz/">pytz</a> 在<a href="../../09/datetime-sqlite">前一篇</a>已經介紹過,是處理時區的套件。<a href="http://ipython.org/">IPython</a> 全名是 Interactive Python,同樣是 Python shell 但提供了很多附加功能,最常用的應該是自動補完。<a href="http://pyyaml.org/">PyYAML</a> 用來處理 YAML 物件,可裝可不裝,不裝之後的例子就用 JSON 即可。</p>
<p>我們的專案根目錄是 <code>demo_django_draw_member</code>。因為 Django 的設定很多,先在這目錄下用 <a href="https://docs.djangoproject.com/en/1.8/ref/contrib/admin/">django-admin</a> 把基本的架構建起來。我們建了一個名為 <code>draw_site</code> 的專案(Project)。</p>
<div class="highlight"><pre><span></span><code><span class="o">(</span>VENV<span class="o">)</span><span class="w"> </span>$<span class="w"> </span>django-admin<span class="w"> </span>startproject<span class="w"> </span>draw_site
</code></pre></div>
<p>執行完之後應該會多出一堆檔案,結構如下。注意到有兩層 <code>draw_site</code>。</p>
<div class="highlight"><pre><span></span><code>demo_django_draw_member/
└── draw_site/
├── draw_site/
│ ├── __init__.py
│ ├── settings.py
│ ├── urls.py
│ └── wsgi.py
└── manage.py*
</code></pre></div>
<p>之後工作的目錄其實是 <code>demo_django_draw_member/draw_site/</code>,也就是有 <code>manage.py</code> 的那層目錄,之後的路徑都是相對於 <code>demo_django_draw_member/draw_site/</code>。介紹一下每個檔案。</p>
<ul>
<li><code>manage.py</code> 之後就會取代 django-admin 的功能。兩者最大的差別是 manage.py 知道 project 的設定。</li>
<li><code>draw_site/settings.py</code> 裡面存著 Django 的各種設定,像 secret key、database、template engine、app 等。</li>
<li><code>draw_site/urls.py</code> 裡面存著 URL dispatching 設定,即哪個路徑要用哪個 function 去處理。</li>
<li><code>draw_site/wsgi.py</code> <a href="http://wsgi.org/">WSGI</a> 是規範 Python web server 的標準,通常不會動這個檔案就不細提。Flask、Django 都是相容 WSGI 的實作。</li>
</ul>
<p>一個 Django 由一個 project 和很多個 apps 所組成。每個 app 就專注在網站的某個功能上,各自包著各自需要的 database schema、template、view logics。這樣的好處是同樣的功能就不用重寫,同時在很大的網站時這樣的結構有助於管理運作的邏輯。</p>
<h4 id="django-server">Django server</h4>
<p>先把 Django 跑起來看看吧。</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>python<span class="w"> </span>manage.py<span class="w"> </span>runserver
<span class="go">...</span>
<span class="go">Django version 1.8.5, using settings 'draw_site.settings'</span>
<span class="go">Starting development server at http://127.0.0.1:8000/</span>
</code></pre></div>
<figure class="align-center">
<img src="https://blog.liang2.tw/posts/2015/10/django-draw-member/pics/django_initial.png"/>
<figcaption>Django Hello World</figcaption>
</figure>
<p>這是 Django 內建在什麼 URL 都沒設定時的歡迎畫面。看到這個至少表示基本的 settings 正常。Django 跟 Flask 一樣,內建的 server 會在 source code 有改變的時候 reload,所以一直開著跑也可以。</p>
<h4 id="django-app">第一個 Django app</h4>
<p>我們的網站只會用到一個 app,把它建出來取名為 <code>draw_member</code>。</p>
<div class="highlight"><pre><span></span><code>python<span class="w"> </span>manage.py<span class="w"> </span>startapp<span class="w"> </span>draw_member
</code></pre></div>
<div class="highlight"><pre><span></span><code>demo_django_draw_member/
└── draw_site/
├── draw_member/
│ ├── __init__.py
│ ├── admin.py
│ ├── migrations/
│ ├── models.py
│ ├── tests.py
│ └── views.py
├── draw_site/
│ └── ...
└── manage.py*
</code></pre></div>
<p>可以看到 app 與 project 的架構是不一樣的。</p>
<p>要把這個新的 app 加到 project 裡,修改 <code>draw_site/settings.py</code>。</p>
<div class="highlight"><pre><span></span><code><span class="c1"># draw_site/settings.py</span>
<span class="n">INSTALLED_APPS</span> <span class="o">=</span> <span class="p">(</span>
<span class="s1">'draw_member'</span><span class="p">,</span> <span class="c1"># 加這一行</span>
<span class="s1">'django.contrib.admin'</span><span class="p">,</span>
<span class="s1">'django.contrib.auth'</span><span class="p">,</span>
<span class="s1">'django.contrib.contenttypes'</span><span class="p">,</span>
<span class="s1">'django.contrib.sessions'</span><span class="p">,</span>
<span class="s1">'django.contrib.messages'</span><span class="p">,</span>
<span class="s1">'django.contrib.staticfiles'</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div>
<p>預設其實裝了很多 app。暫時不理他們是什麼。</p>
<h4 id="django-settings">Django settings</h4>
<p>先簡單介紹一下 <code>draw_site/settings.py</code>。除了剛剛用到 INSTALLED_APPS,講幾個跟這邊比較有關的參數。</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Database</span>
<span class="c1"># https://docs.djangoproject.com/en/1.8/ref/settings/#databases</span>
<span class="n">DATABASES</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">'default'</span><span class="p">:</span> <span class="p">{</span>
<span class="s1">'ENGINE'</span><span class="p">:</span> <span class="s1">'django.db.backends.sqlite3'</span><span class="p">,</span>
<span class="s1">'NAME'</span><span class="p">:</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">BASE_DIR</span><span class="p">,</span> <span class="s1">'db.sqlite3'</span><span class="p">),</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1"># Internationalization</span>
<span class="c1"># https://docs.djangoproject.com/en/1.8/topics/i18n/</span>
<span class="n">LANGUAGE_CODE</span> <span class="o">=</span> <span class="s1">'en-us'</span>
<span class="n">TIME_ZONE</span> <span class="o">=</span> <span class="s1">'UTC'</span>
<span class="n">USE_TZ</span> <span class="o">=</span> <span class="kc">True</span>
</code></pre></div>
<p>DATABSES 裡定義了使用的資料庫。預設會使用 <code>db.sqlite3</code> 這個 SQLite 資料庫。</p>
<p>再來是語言、時區的設定。預設是 UTC 並且使用 timezone,也就是 server 的時間都是用 UTC 記錄的。</p>
<h4 id="database-migration">Database Migration</h4>
<p>在什麼 code 都還沒寫之前,介紹一個 database 觀念:<a href="https://docs.djangoproject.com/en/1.8/topics/migrations/">migration</a>。</p>
<p>在之前的例子可以知道,我們會先設計一個資料庫該存什麼東西,整個網站流程會怎麼用這些資料,這些形成 table schema。但是隨著時間,可能網站有新的功能,很難說完全不去更動 schema。</p>
<p>更動 schema 不是件簡單的事,如果是上 production 的網站,資料庫會有運作以來累積的資料,總不能 schema 改了這些資料就丟掉吧?而且在網站開發的時候,在不同版本的(或不同人開發的)code 就可能有不同的 schema。要怎麼確保 code 與 database 的狀態就要靠 migration。</p>
<p>……一開始就這麼複雜?好啦我們的例子沒有用到 migration 大多數的功能,只有用它 initiate database。內建的 app 都有自己的 database schema,可以用它把資料庫的 table 建出來。</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>python<span class="w"> </span>manage.py<span class="w"> </span>migrate
<span class="go">Operations to perform:</span>
<span class="go"> Synchronize unmigrated apps: messages, staticfiles</span>
<span class="go"> Apply all migrations: sessions, auth, contenttypes, admin</span>
<span class="go">Synchronizing apps without migrations:</span>
<span class="go"> Creating tables...</span>
<span class="go"> Running deferred SQL...</span>
<span class="go"> Installing custom SQL...</span>
<span class="go">Running migrations:</span>
<span class="go"> Rendering model states... DONE</span>
<span class="go"> Applying contenttypes.0001_initial... OK</span>
<span class="go"> Applying auth.0001_initial... OK</span>
<span class="go"> Applying admin.0001_initial... OK</span>
<span class="go"> Applying contenttypes.0002_remove_content_type_name... OK</span>
<span class="go"> Applying auth.0002_alter_permission_name_max_length... OK</span>
<span class="go"> Applying auth.0003_alter_user_email_max_length... OK</span>
<span class="go"> Applying auth.0004_alter_user_username_opts... OK</span>
<span class="go"> Applying auth.0005_alter_user_last_login_null... OK</span>
<span class="go"> Applying auth.0006_require_contenttypes_0002... OK</span>
<span class="go"> Applying sessions.0001_initial... OK</span>
</code></pre></div>
<p>migration 就會一步步把 database 調整到符合現在 code 的狀態,這些調整就會記錄在 <code><app>/migrations/</code> 底下,等等就會看到了。</p>
<h3 id="url-dispatcher">URL dispatcher</h3>
<p>我們接下來要改首頁,把 Django 預設的 <code>/</code> 首頁換成 Hello World。</p>
<p>Flask URL routing 是直接用 decorator 寫在 view function 上面。幫大家回顧一下:</p>
<div class="highlight"><pre><span></span><code><span class="nd">@app</span><span class="o">.</span><span class="n">route</span><span class="p">(</span><span class="s1">'/'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">index</span><span class="p">():</span>
<span class="k">return</span> <span class="s2">"<p>Hello World!</p>"</span>
</code></pre></div>
<p>Django 的 view 和 URL 是分開的,首先是 view:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># draw_member/views.py</span>
<span class="kn">from</span> <span class="nn">django.shortcuts</span> <span class="kn">import</span> <span class="n">render</span> <span class="c1"># 先暫時留著</span>
<span class="kn">from</span> <span class="nn">django.http</span> <span class="kn">import</span> <span class="n">HttpResponse</span>
<span class="k">def</span> <span class="nf">home</span><span class="p">(</span><span class="n">request</span><span class="p">):</span>
<span class="k">return</span> <span class="n">HttpResponse</span><span class="p">(</span><span class="s2">"<p>Hello World!</p>"</span><span class="p">)</span>
</code></pre></div>
<p>結構上大同小異(也因為有 <a href="http://wsgi.org/">WSGI</a> 規範的關係啦)。</p>
<p>再來是 URL 設定。我們先把 URL 加在 project 設定。這邊可能覺得設定有點分散比較怪,等一下再把它放到 app 裡面。</p>
<div class="highlight"><pre><span></span><code><span class="c1"># draw_site/urls.py</span>
<span class="sd">"""draw_site URL Configuration</span>
<span class="sd">The `urlpatterns` list routes URLs to views. For more information please see:</span>
<span class="sd"> https://docs.djangoproject.com/en/1.8/topics/http/urls/</span>
<span class="sd">...</span>
<span class="sd">"""</span>
<span class="kn">from</span> <span class="nn">django.conf.urls</span> <span class="kn">import</span> <span class="n">include</span><span class="p">,</span> <span class="n">url</span>
<span class="kn">from</span> <span class="nn">django.contrib</span> <span class="kn">import</span> <span class="n">admin</span>
<span class="kn">from</span> <span class="nn">draw_member.views</span> <span class="kn">import</span> <span class="n">home</span>
<span class="n">urlpatterns</span> <span class="o">=</span> <span class="p">[</span>
<span class="n">url</span><span class="p">(</span><span class="sa">r</span><span class="s1">'^$'</span><span class="p">,</span> <span class="n">home</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s2">"home"</span><span class="p">),</span>
<span class="n">url</span><span class="p">(</span><span class="sa">r</span><span class="s1">'^admin/'</span><span class="p">,</span> <span class="n">include</span><span class="p">(</span><span class="n">admin</span><span class="o">.</span><span class="n">site</span><span class="o">.</span><span class="n">urls</span><span class="p">)),</span>
<span class="p">]</span>
</code></pre></div>
<p>概念也很簡單,把要的 view function 從 app import 進來(所以 app 目錄是個 Python module,底下會 <code>__init__.py</code>),給一個 regex 表示的路徑,後面放上處理 function 以及一個 optional 的名字,這個名字就代表了這個 URL 路徑,之後可以反查。</p>
<p>測一下確認設定都是正確的。</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>curl<span class="w"> </span>-XGET<span class="w"> </span><span class="s2">"localhost:8000"</span>
<span class="go"><p>Hello World!</p></span>
</code></pre></div>
<p>再看一下 <code>draw_site/urls.py</code>,可以看到 Django 預設放了個 <code>/admin</code> 後面用的是 <code>include(app.urls)</code>,表示這一整包只要是 admin/ 開頭的 URL 都交給 admin.site.urls 去處理路徑。這樣方便 app 在不同網站中重覆利用,因為可能放的路徑都不一樣,但一個 app 內的 URL 處理會有一致性。</p>
<p>馬上來改寫一下。首先在 app <strong>draw_member</strong> 底下加一個 <code>urls.py</code>。</p>
<div class="highlight"><pre><span></span><code><span class="c1"># draw_member/urls.py</span>
<span class="kn">from</span> <span class="nn">django.conf.urls</span> <span class="kn">import</span> <span class="n">include</span><span class="p">,</span> <span class="n">url</span>
<span class="kn">from</span> <span class="nn">.views</span> <span class="kn">import</span> <span class="n">home</span> <span class="c1"># explicit relative import</span>
<span class="n">urlpatterns</span> <span class="o">=</span> <span class="p">[</span>
<span class="n">url</span><span class="p">(</span><span class="sa">r</span><span class="s1">'^$'</span><span class="p">,</span> <span class="n">home</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s2">"home"</span><span class="p">),</span>
<span class="p">]</span>
</code></pre></div>
<p>基本上格式就是照抄原本就有的。因為放在同個 app 裡面了,import view 時就可以用 explicit relative import(這不是 relative import 喔)</p>
<p>原本的 urls.py 就改成把 URL 的處理「dispatch」給這個 app,改成底下這樣。</p>
<div class="highlight"><pre><span></span><code># draw_site/urls.py
from django.conf.urls import include, url
from django.contrib import admin
urlpatterns = [
url(r'^admin/', include(admin.site.urls)),
url(r'^', include('draw_member.urls')),
]
</code></pre></div>
<p><code>r'^'</code> 代表從根目錄就交給這個 app 去管理,也因為這樣比較專一的路徑要放前面,像是 /admin。用字串表示在執行的時候才 import 這個 module,不想也可以拿掉字串把 app import 進來。</p>
<p>以上就是最基本的 <a href="https://docs.djangoproject.com/en/1.8/topics/http/urls/">URL dispatching</a>。</p>
<h3 id="django-model-and-orm">Django Model and ORM</h3>
<p>接著處理資料庫的問題。當然可以在 Django 裡面寫 raw SQL,但這邊提供另一個想法:Object-relational Mapping (ORM)。ORM 把資料用物件導向的方式整理,把 SQL、table、database 的細節交給 ORM engine 去翻譯。這可以在找到非常多介紹,直接跳到實作。</p>
<pre style="font-family: Consolas, 'Courier New', monospace">
┌─────────────────────┐
│ members │
├─────────────────────┤
│ id INTEGER │ <─┐
│ name TEXT │ │
│ group_name TEXT │ │
└─────────────────────┘ │
│
┌─────────────────────┐ │
│ draw_histories │ │ foreign
├─────────────────────┤ │ key
│ memberid INTEGER │ ──┘
│ time DATETIME │
└─────────────────────┘
</pre>
<p>回想一下我們的 schema 設計。改用 ORM 來思考我們就會有成員(Member)以及抽籤歷史(History)兩大 models。<strong>Member</strong> 記錄了名字與所屬團體;<strong>History</strong> 會記錄時間、這筆抽籤是屬於哪個成員的。</p>
<p>在 Django 中,model 定義在 <code>models.py</code> 裡面,馬上來寫寫看。</p>
<div class="highlight"><pre><span></span><code><span class="c1"># draw_members/models.py</span>
<span class="kn">from</span> <span class="nn">django.db</span> <span class="kn">import</span> <span class="n">models</span>
<span class="kn">from</span> <span class="nn">django.utils.timezone</span> <span class="kn">import</span> <span class="n">now</span>
<span class="k">class</span> <span class="nc">Member</span><span class="p">(</span><span class="n">models</span><span class="o">.</span><span class="n">Model</span><span class="p">):</span>
<span class="n">name</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">CharField</span><span class="p">(</span><span class="n">max_length</span><span class="o">=</span><span class="mi">256</span><span class="p">)</span>
<span class="n">group_name</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">CharField</span><span class="p">(</span><span class="n">max_length</span><span class="o">=</span><span class="mi">256</span><span class="p">)</span>
<span class="k">def</span> <span class="fm">__str__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="s1">'</span><span class="si">%s</span><span class="s1"> of </span><span class="si">%s</span><span class="s1">'</span> <span class="o">%</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">group_name</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">History</span><span class="p">(</span><span class="n">models</span><span class="o">.</span><span class="n">Model</span><span class="p">):</span>
<span class="n">member</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">ForeignKey</span><span class="p">(</span><span class="n">Member</span><span class="p">,</span> <span class="n">related_name</span><span class="o">=</span><span class="s2">"draw_histories"</span><span class="p">)</span>
<span class="c1"># now() will return datetime.utcnow()</span>
<span class="n">time</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">DateTimeField</span><span class="p">(</span><span class="n">default</span><span class="o">=</span><span class="n">now</span><span class="p">)</span>
<span class="k">def</span> <span class="fm">__str__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="s1">'</span><span class="si">%s</span><span class="s1"> at </span><span class="si">%s</span><span class="s1">'</span> <span class="o">%</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">member</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">time</span><span class="p">)</span>
</code></pre></div>
<p>一個 class 裡的屬性就對應到一個欄位(Field),欄位會有他的型別以及資料庫實作上的限制(例如字串有上限,當然也可以不設)。Field type 可以參考<a href="https://docs.djangoproject.com/en/1.8/ref/models/fields/#field-types">官網</a>。</p>
<p><strong>Member</strong> 底下都是字串所以是 <code>CharField</code>。 <strong>History</strong> 稍微複雜一點,時間的記錄 date 用 <code>DateTimeField</code>,這樣欄位拿回來就會轉換成 Python datetime object;另一個 member 用的是 <code>ForeignKey</code>,也就是 relationship field,來表示這筆抽籤屬於拿個成員。後面的 <code>related_name</code> 提供了反查功能,也就是能從一個 member 去查他所有的 histories。</p>
<p>同時先寫好兩個 class 底下的 <code>__str__</code>,這樣等下在 Python shell 操作時容易辨認每個物件的內容。</p>
<h4 id="migration-the-tracker-of-model-changes">Migration the tracker of model changes</h4>
<p>多說無用,馬上來試一試。</p>
<p>……等等,想到 migration 了嗎?每次更動 database model 都要跑 migration,確保 code 與資料庫狀態一致。</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>python<span class="w"> </span>manage.py<span class="w"> </span>makemigrations<span class="w"> </span>draw_member
<span class="go">python manage.py makemigrations draw_member</span>
<span class="go">Migrations for 'draw_member':</span>
<span class="go"> 0001_initial.py:</span>
<span class="go"> - Create model History</span>
<span class="go"> - Create model Member</span>
<span class="go"> - Add field member to history</span>
</code></pre></div>
<p>可以看到 Django 很聰明的知道我們多定義了兩個 models,裡面有些對應到資料庫的欄位型態。這些資訊會寫在 migration file 裡面,</p>
<div class="highlight"><pre><span></span><code><span class="c1"># draw_member/migrations/0001_initial.py</span>
<span class="k">class</span> <span class="nc">Migration</span><span class="p">(</span><span class="n">migrations</span><span class="o">.</span><span class="n">Migration</span><span class="p">):</span>
<span class="n">dependencies</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">]</span>
<span class="n">operations</span> <span class="o">=</span> <span class="p">[</span>
<span class="n">migrations</span><span class="o">.</span><span class="n">CreateModel</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s1">'History'</span><span class="p">,</span>
<span class="n">fields</span><span class="o">=</span><span class="p">[</span>
<span class="p">(</span><span class="s1">'id'</span><span class="p">,</span> <span class="n">models</span><span class="o">.</span><span class="n">AutoField</span><span class="p">(</span><span class="n">serialize</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">primary_key</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">verbose_name</span><span class="o">=</span><span class="s1">'ID'</span><span class="p">,</span> <span class="n">auto_created</span><span class="o">=</span><span class="kc">True</span><span class="p">)),</span>
<span class="p">(</span><span class="s1">'time'</span><span class="p">,</span> <span class="n">models</span><span class="o">.</span><span class="n">DateTimeField</span><span class="p">(</span><span class="n">default</span><span class="o">=</span><span class="n">django</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">timezone</span><span class="o">.</span><span class="n">now</span><span class="p">)),</span>
<span class="p">],</span>
<span class="p">),</span>
<span class="n">migrations</span><span class="o">.</span><span class="n">CreateModel</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s1">'Member'</span><span class="p">,</span>
<span class="n">fields</span><span class="o">=</span><span class="p">[</span>
<span class="p">(</span><span class="s1">'id'</span><span class="p">,</span> <span class="n">models</span><span class="o">.</span><span class="n">AutoField</span><span class="p">(</span><span class="n">serialize</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">primary_key</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">verbose_name</span><span class="o">=</span><span class="s1">'ID'</span><span class="p">,</span> <span class="n">auto_created</span><span class="o">=</span><span class="kc">True</span><span class="p">)),</span>
<span class="p">(</span><span class="s1">'name'</span><span class="p">,</span> <span class="n">models</span><span class="o">.</span><span class="n">CharField</span><span class="p">(</span><span class="n">max_length</span><span class="o">=</span><span class="mi">256</span><span class="p">)),</span>
<span class="p">(</span><span class="s1">'group_name'</span><span class="p">,</span> <span class="n">models</span><span class="o">.</span><span class="n">CharField</span><span class="p">(</span><span class="n">max_length</span><span class="o">=</span><span class="mi">256</span><span class="p">)),</span>
<span class="p">],</span>
<span class="p">),</span>
<span class="n">migrations</span><span class="o">.</span><span class="n">AddField</span><span class="p">(</span>
<span class="n">model_name</span><span class="o">=</span><span class="s1">'history'</span><span class="p">,</span>
<span class="n">name</span><span class="o">=</span><span class="s1">'member'</span><span class="p">,</span>
<span class="n">field</span><span class="o">=</span><span class="n">models</span><span class="o">.</span><span class="n">ForeignKey</span><span class="p">(</span><span class="n">to</span><span class="o">=</span><span class="s1">'draw_member.Member'</span><span class="p">,</span> <span class="n">related_name</span><span class="o">=</span><span class="s1">'draw_histories'</span><span class="p">),</span>
<span class="p">),</span>
<span class="p">]</span>
</code></pre></div>
<p>注意到 Django ORM 自動幫我們加了 <code>id</code> 這個 primary key,等等就會用到。Migration 裡面的細節等對 Django 更熟了之後就能慢慢了解了。</p>
<p>有了新的 migration 就要同步資料庫的狀態,</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>python<span class="w"> </span>manage.py<span class="w"> </span>migrate
<span class="go">...</span>
<span class="go">Running migrations:</span>
<span class="go"> Rendering model states... DONE</span>
<span class="go"> Applying draw_member.0001_initial... OK</span>
</code></pre></div>
<h4 id="orm-queries-in-shell">ORM queries in shell</h4>
<p>接下來我們操作一下 ORM。</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>python<span class="w"> </span>manage.py<span class="w"> </span>shell
</code></pre></div>
<p>就會打開一個 Python shell。如果裝了 IPython 就會打開 IPython shell。
這個與一般的有什麼差別呢?他會帶有 Django project 的設定。如果是從一般的 shell 可以先跑以下的指令來達到相同的效果。</p>
<div class="highlight"><pre><span></span><code><span class="go">$ DJANGO_SETTINGS_MODULE="draw_site.settings" python</span>
<span class="gp">>>> </span><span class="kn">import</span> <span class="nn">django</span>
<span class="gp">>>> </span><span class="n">django</span><span class="o">.</span><span class="n">setup</span><span class="p">()</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code>In [1]: from draw_member.models import Member, History
In [2]: m1 = Member(name="高坂 穂乃果", group_name="μ's")
In [4]: m2 = Member(name="平沢 唯", group_name="K-ON!")
In [5]: m1, m2
Out[5]: (<Member: 高坂 穂乃果 of μ's>, <Member: 平沢 唯 of K-ON!>)
In [7]: m1.save()
In [8]: m2.save()
In [6]: h1 = History(member=m1)
In [9]: h1.save()
</code></pre></div>
<p>使用上就把資料當作物件來操作,如同 ORM 字面的意思。注意只有在 <code>.save()</code> 才真正被存到資料裡。拿沒有存的 object 來操作 database 就會出現 exception。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">h_failed</span> <span class="o">=</span> <span class="n">History</span><span class="p">(</span><span class="n">member</span><span class="o">=</span><span class="n">Member</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">'FF'</span><span class="p">,</span> <span class="n">group_name</span><span class="o">=</span><span class="s1">'f'</span><span class="p">))</span>
<span class="gp">>>> </span><span class="n">h_failed</span><span class="o">.</span><span class="n">save</span><span class="p">()</span>
<span class="gt">Traceback (most recent call last):</span>
<span class="c">...</span>
<span class="gr">IntegrityError</span>: <span class="n">NOT NULL constraint failed: draw_member_history.member_id</span>
</code></pre></div>
<p>覺得麻煩的話,用 <code>Model.objects.create()</code> 就可以一步搞定。正確的存好之後,現在資料庫已經有資料了。我們可以先在 SQLite 裡確認。</p>
<div class="highlight"><pre><span></span><code><span class="go">-- sqlite3 db.sqlite3</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="p">.</span><span class="n">header</span><span class="w"> </span><span class="k">on</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">draw_member_member</span><span class="p">;</span>
<span class="go">id|name|group_name</span>
<span class="go">1|高坂 穂乃果|μ's</span>
<span class="go">2|平沢 唯|K-ON!</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">draw_member_history</span><span class="p">;</span>
<span class="go">id|time|member_id</span>
<span class="go">1|2015-10-05 15:17:32.061384|1</span>
</code></pre></div>
<p>透過像剛剛 object 的操作,我們也能建出如同手寫 SQL 一樣的資料庫,當然像 <code>id</code>、<code>member_id</code> 這些欄位是 ORM engine 自動幫我們做出來的,這些可以自訂,不過預設的行為不難理解。</p>
<p>要怎麼從 ORM 像剛剛下 SQL 一樣撈資料呢?</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">draw_member.models</span> <span class="kn">import</span> <span class="n">Member</span><span class="p">,</span> <span class="n">History</span>
<span class="gp">>>> </span><span class="n">Member</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">all</span><span class="p">()</span>
<span class="go">[<Member: 高坂 穂乃果 of μ's>, <Member: 平沢 唯 of K-ON!>]</span>
<span class="gp">>>> </span><span class="n">History</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">all</span><span class="p">()</span>
<span class="go">[<History: 高坂 穂乃果 at 2015-10-05 15:17:32.061384+00:00>]</span>
</code></pre></div>
<p>資料透過 <code>Model.objects</code> 這個 Manager 去查詢,細節就去看 Django 關於 <a href="https://docs.djangoproject.com/en/1.8/topics/db/queries">Making queries</a> 的內容吧。查詢資料庫就會回傳 <a href="https://docs.djangoproject.com/en/1.8/ref/models/querysets/#django.db.models.query.QuerySet">QuerySet</a>,這並不會真的去「查」資料庫,但先把指令存著等真的要用到值時才去計算,也就是 lazy evaluation。</p>
<p>QuerySet 底下就有很多對應到 SQL 指令的查詢,像是拿回所有 objects 的 <code>QuerySet.all()</code>,前面已經用過了。或者篩選的 <code>QuerySet.filter()</code>,</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">Member</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">group_name</span><span class="o">=</span><span class="s1">'K-ON!'</span><span class="p">)</span>
<span class="go">[<Member: 平沢 唯 of K-ON!>]</span>
<span class="gp">>>> </span><span class="n">Member</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">group_name__contains</span><span class="o">=</span><span class="s1">'!'</span><span class="p">)</span>
<span class="go">[<Member: 平沢 唯 of K-ON!>]</span>
</code></pre></div>
<p>其中 <code><field>__contains</code> 就是 Django ORM 為了實做像 SQL <code>LIKE</code> 指令的對應欄位。</p>
<p>先講幾個有關的,首先每個 Model 都有個 primary key <code>pk</code>,預設指到 <code>Model.id</code> 這個欄位上,另用 <code>QuerySet.get()</code> 可以拿到單一物件,這時候萬用的 pk 就派上用場了。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">Member</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">pk</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="go"><Member: 高坂 穂乃果 of μ's></span>
</code></pre></div>
<p>查 relation 也很簡單,</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">h1</span> <span class="o">=</span> <span class="n">History</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">pk</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">h1</span><span class="o">.</span><span class="n">member</span>
<span class="go"><Member: 高坂 穂乃果 of μ's></span>
<span class="gp">>>> </span><span class="n">h1</span><span class="o">.</span><span class="n">member</span><span class="o">.</span><span class="n">name</span>
<span class="go">'高坂 穂乃果'</span>
</code></pre></div>
<p>還記得之前設得 <code>related_name="draw_histories"</code>,表示我們能從 <strong>Member</strong> 反查回去該人相關的歷史,</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">m1</span> <span class="o">=</span> <span class="n">Member</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">pk</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">m1</span><span class="o">.</span><span class="n">draw_histories</span><span class="o">.</span><span class="n">all</span><span class="p">()</span>
<span class="go">[<History: 高坂 穂乃果 at 2015-10-05 15:17:32.061384+00:00>]</span>
</code></pre></div>
<p>最後我們來刪資料,</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">Member</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">all</span><span class="p">()</span><span class="o">.</span><span class="n">delete</span><span class="p">()</span>
<span class="gp">>>> </span><span class="n">History</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">all</span><span class="p">()</span><span class="o">.</span><span class="n">delete</span><span class="p">()</span>
</code></pre></div>
<p>當然一開始我們可以暴力把 <code>db.sqlite3</code> 整個刪掉再重新 <code>python manage.py migrate</code> 一次就可以讓 database 對應的 table 都建立好,不過只適用於 SQLite 而已。或者,正確的「清空資料庫」做法是用 <code>flush</code> 指令,</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>python<span class="w"> </span>manage.py<span class="w"> </span>flush
<span class="go">You have requested a flush of the database.</span>
<span class="go">This will IRREVERSIBLY DESTROY all data currently in the 'draw_site/db.sqlite3' database,</span>
<span class="go">and return each table to an empty state.</span>
<span class="go">Are you sure you want to do this?</span>
<span class="go"> Type 'yes' to continue, or 'no' to cancel: yes</span>
<span class="go">Installed 0 object(s) from 0 fixture(s)</span>
<span class="go">Installed 0 object(s) from 0 fixture(s)</span>
</code></pre></div>
<h4 id="data-in-orm-and-fixtures">Data in ORM and fixtures</h4>
<p>我們把 <code>members.csv</code> 的資料填到資料庫吧。這邊就不用細說了。</p>
<div class="highlight"><pre><span></span><code>In [1]: import csv
In [2]: with open('../../draw_member/members.csv', newline='') as f:
...: csv_reader = csv.DictReader(f)
...: members = [
...: (row['名字'], row['團體'])
...: for row in csv_reader
...: ]
In [3]: from draw_member.models import Member
In [4]: for m in members:
...: Member(name=m[0], group_name=m[1]).save()
...:
</code></pre></div>
<p>可以自己檢查一下是不是 14 個人都寫到資料庫了。</p>
<p>不過現在有個問題是,之後可能會常常把資料庫砍掉重練,或者要把這些(或很多來源)的資料讀到資料庫,每次都重新讀寫也是可以,但有沒有別的做法能把資料先存起來?</p>
<p>這邊就要介紹 <a href="https://docs.djangoproject.com/en/1.8/howto/initial-data/#providing-initial-data-with-fixtures">Django fixtures</a> 了。他能把資料庫的資料存成 JSON、YAML(需要 <a href="http://pyyaml.org/">PyYAML</a>)等格式。</p>
<p>一般 fixtures 是被在 <code><app>/fixtures/</code> 目錄底下,記得先把目錄建出來。</p>
<div class="highlight"><pre><span></span><code>mkdir<span class="w"> </span>draw_member/fixtures
</code></pre></div>
<p>根據 database 的內容建立 fixtures 可以使用 <code>dumpdata</code> 指令:</p>
<div class="highlight"><pre><span></span><code>python<span class="w"> </span>manage.py<span class="w"> </span>dumpdata<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--format<span class="o">=</span>yaml<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--indent<span class="o">=</span><span class="m">4</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--output<span class="w"> </span>draw_member/fixtures/anime_members.yaml
<span class="w"> </span>draw_member.Member<span class="w"> </span><span class="se">\</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="c1"># draw_member/fixtures/anime_members.yaml</span>
<span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">fields</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">{</span><span class="nt">group_name</span><span class="p">:</span><span class="w"> </span><span class="s">"</span><span class="se">\u03BC</span><span class="s">'s"</span><span class="p p-Indicator">,</span><span class="nt"> name</span><span class="p">:</span><span class="w"> </span><span class="s">"</span><span class="se">\u9AD8\u5742</span><span class="nv"> </span><span class="se">\u7A42\u4E43\u679C</span><span class="s">"</span><span class="p p-Indicator">}</span>
<span class="w"> </span><span class="nt">model</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">draw_member.member</span>
<span class="w"> </span><span class="nt">pk</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">1</span>
<span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">fields</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">{</span><span class="nt">group_name</span><span class="p">:</span><span class="w"> </span><span class="s">"</span><span class="se">\u03BC</span><span class="s">'s"</span><span class="p p-Indicator">,</span><span class="nt"> name</span><span class="p">:</span><span class="w"> </span><span class="s">"</span><span class="se">\u7D62\u702C</span><span class="nv"> </span><span class="se">\u7D75\u91CC</span><span class="s">"</span><span class="p p-Indicator">}</span>
<span class="w"> </span><span class="nt">model</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">draw_member.member</span>
<span class="w"> </span><span class="nt">pk</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">2</span>
<span class="c1"># ...</span>
</code></pre></div>
<p>用 JSON 輸出也可以,改成 <code>--format=json</code> 就可以了</p>
<div class="highlight"><pre><span></span><code><span class="p">[</span>
<span class="p">{</span>
<span class="w"> </span><span class="nt">"model"</span><span class="p">:</span><span class="w"> </span><span class="s2">"draw_member.member"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"pk"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"fields"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="nt">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"\u9ad8\u5742 \u7a42\u4e43\u679c"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"group_name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"\u03bc's"</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">},</span>
</code></pre></div>
<p>我們可以用 <code>python manage.py flush</code> 把資料庫清掉,模擬資料的讀入。</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>python<span class="w"> </span>manage.py<span class="w"> </span>loaddata<span class="w"> </span>anime_members.yaml
<span class="go">Installed 14 object(s) from 1 fixture(s)</span>
</code></pre></div>
<p>這樣資料的存取就介紹得差不多了。更多的細節可以參考<a href="https://docs.djangoproject.com/en/1.8/#the-model-layer">官網 model layer</a> 的說明。</p>
<h3 id="django-template">Django Template</h3>
<p>在進行下去之前,先確認我們的目錄結構是一樣的。</p>
<div class="highlight"><pre><span></span><code>demo_django_draw_member/
└── draw_site/
├── db.sqlite3
├── draw_member/
│ ├── __init__.py
│ ├── admin.py
│ ├── fixtures/
│ │ ├── anime_members.json
│ │ └── anime_members.yaml
│ ├── migrations/
│ │ ├── 0001_initial.py
│ │ └── __init__.py
│ ├── models.py
│ ├── tests.py
│ ├── urls.py
│ └── views.py
├── draw_site/
│ ├── __init__.py
│ ├── settings.py
│ ├── urls.py
│ └── wsgi.py
└── manage.py*
</code></pre></div>
<p>Django 的 template 預設是放在 <code><app>/templates/</code> 底下。不過為了在跨 app 時不要衝到名字,我們會多包一層 app 為名的資料夾。</p>
<div class="highlight"><pre><span></span><code>mkdir<span class="w"> </span>-p<span class="w"> </span>draw_member/templates/draw_member
</code></pre></div>
<p>它跟 Flask 用的 Jinja2 templates 乍看下非常類似(Jinja2 模仿 Django template),兩者最大的差別是在 Jinja2 裡能很自由的使用 Python function,不過 Django 靠的是 template tag 以及 filter。我們的例子兩者是沒差多少。</p>
<p>一樣先把 <code>base.html</code> 以及 <code>home.html</code> 做出來。我們也先把 Form 寫上了,暫時先用 GET。</p>
<div class="highlight"><pre><span></span><code><span class="c">{# draw_member/templates/draw_member/base.html #}</span>
<span class="cp"><!DOCTYPE html></span>
<span class="p"><</span><span class="nt">html</span> <span class="na">lang</span><span class="o">=</span><span class="s">"en"</span><span class="p">></span>
<span class="p"><</span><span class="nt">head</span><span class="p">></span>
<span class="p"><</span><span class="nt">meta</span> <span class="na">charset</span><span class="o">=</span><span class="s">"UTF-8"</span><span class="p">></span>
<span class="p"><</span><span class="nt">meta</span> <span class="na">name</span><span class="o">=</span><span class="s">"viewport"</span> <span class="na">content</span><span class="o">=</span><span class="s">"width=device-width"</span><span class="p">></span>
<span class="p"><</span><span class="nt">title</span><span class="p">></span><span class="cp">{%</span> <span class="k">block</span> <span class="nv">title</span> <span class="cp">%}</span>抽籤系統<span class="cp">{%</span> <span class="k">endblock</span> <span class="nv">title</span> <span class="cp">%}</span><span class="p"></</span><span class="nt">title</span><span class="p">></span>
<span class="p"></</span><span class="nt">head</span><span class="p">></span>
<span class="p"><</span><span class="nt">body</span><span class="p">></span>
<span class="cp">{%</span> <span class="k">block</span> <span class="nv">content</span> <span class="cp">%}{%</span> <span class="k">endblock</span> <span class="nv">content</span> <span class="cp">%}</span>
<span class="p"><</span><span class="nt">hr</span><span class="p">></span>
<span class="p"><</span><span class="nt">h3</span><span class="p">></span>功能列<span class="p"></</span><span class="nt">h3</span><span class="p">></span>
<span class="p"><</span><span class="nt">ul</span><span class="p">></span>
<span class="p"><</span><span class="nt">li</span><span class="p">><</span><span class="nt">a</span> <span class="na">href</span><span class="o">=</span><span class="s">"</span><span class="cp">{%</span> <span class="k">url</span> <span class="s1">'home'</span> <span class="cp">%}</span><span class="s">"</span><span class="p">></span>首頁(抽籤)<span class="p"></</span><span class="nt">a</span><span class="p">></</span><span class="nt">li</span><span class="p">></span>
<span class="p"><</span><span class="nt">li</span><span class="p">><</span><span class="nt">a</span> <span class="na">href</span><span class="o">=</span><span class="s">"</span><span class="cp">{%</span> <span class="k">url</span> <span class="s1">'history'</span> <span class="cp">%}</span><span class="s">"</span><span class="p">></span>歷史記錄<span class="p"></</span><span class="nt">a</span><span class="p">></</span><span class="nt">li</span><span class="p">></span>
<span class="p"></</span><span class="nt">ul</span><span class="p">></span>
<span class="p"></</span><span class="nt">body</span><span class="p">></span>
<span class="p"></</span><span class="nt">html</span><span class="p">></span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="c">{# draw_member/templates/draw_member/home.html #}</span>
<span class="cp">{%</span> <span class="k">extends</span> <span class="s1">'draw_member/base.html'</span> <span class="cp">%}</span>
<span class="cp">{%</span> <span class="k">block</span> <span class="nv">content</span> <span class="cp">%}</span>
<span class="p"><</span><span class="nt">h1</span><span class="p">></span>來抽出快樂的夥伴吧!<span class="p"></</span><span class="nt">h1</span><span class="p">></span>
<span class="p"><</span><span class="nt">p</span><span class="p">></span>選擇要被抽的團體<span class="p"></</span><span class="nt">p</span><span class="p">></span>
<span class="p"><</span><span class="nt">form</span> <span class="na">action</span><span class="o">=</span><span class="s">"</span><span class="cp">{%</span> <span class="k">url</span> <span class="s1">'draw'</span> <span class="cp">%}</span><span class="s">"</span> <span class="na">method</span><span class="o">=</span><span class="s">"get"</span><span class="p">></span>
<span class="p"><</span><span class="nt">label</span> <span class="na">for</span><span class="o">=</span><span class="s">"group_name"</span><span class="p">></span>團隊名稱:<span class="p"></</span><span class="nt">label</span><span class="p">></span>
<span class="p"><</span><span class="nt">input</span> <span class="na">type</span><span class="o">=</span><span class="s">"radio"</span> <span class="na">name</span><span class="o">=</span><span class="s">"group_name"</span> <span class="na">value</span><span class="o">=</span><span class="s">"μ's"</span><span class="p">></span>μ's
<span class="p"><</span><span class="nt">input</span> <span class="na">type</span><span class="o">=</span><span class="s">"radio"</span> <span class="na">name</span><span class="o">=</span><span class="s">"group_name"</span> <span class="na">value</span><span class="o">=</span><span class="s">"K-ON!"</span><span class="p">></span>K-ON!
<span class="p"><</span><span class="nt">input</span> <span class="na">type</span><span class="o">=</span><span class="s">"radio"</span> <span class="na">name</span><span class="o">=</span><span class="s">"group_name"</span> <span class="na">value</span><span class="o">=</span><span class="s">"ALL"</span> <span class="na">checked</span><span class="p">></span>(全)
<span class="p"><</span><span class="nt">input</span> <span class="na">type</span><span class="o">=</span><span class="s">"submit"</span> <span class="na">value</span><span class="o">=</span><span class="s">"Submit"</span><span class="p">></span>
<span class="p"></</span><span class="nt">form</span><span class="p">></span>
<span class="cp">{%</span> <span class="k">endblock</span> <span class="nv">content</span> <span class="cp">%}</span>
</code></pre></div>
<p>整體的概念應該很好理解。<code>{% url 'xxxx' %}</code> 就是 URL resolver,還記得在 <code>urls.py</code> 的設定時有給個 <code>name</code> 參數嗎,這邊就會根據那個名字回傳正確的網址。</p>
<p>順便更新一下 URL 把這些 view 先加好,不然等下 runserver 會說找不到這些網址。</p>
<div class="highlight"><pre><span></span><code><span class="c1"># draw_members/urls.py</span>
<span class="kn">from</span> <span class="nn">django.conf.urls</span> <span class="kn">import</span> <span class="n">include</span><span class="p">,</span> <span class="n">url</span>
<span class="kn">from</span> <span class="nn">.views</span> <span class="kn">import</span> <span class="n">home</span><span class="p">,</span> <span class="n">draw</span><span class="p">,</span> <span class="n">history</span>
<span class="n">urlpatterns</span> <span class="o">=</span> <span class="p">[</span>
<span class="n">url</span><span class="p">(</span><span class="sa">r</span><span class="s1">'^$'</span><span class="p">,</span> <span class="n">home</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s2">"home"</span><span class="p">),</span>
<span class="n">url</span><span class="p">(</span><span class="sa">r</span><span class="s1">'^draw/$'</span><span class="p">,</span> <span class="n">draw</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s2">"draw"</span><span class="p">),</span>
<span class="n">url</span><span class="p">(</span><span class="sa">r</span><span class="s1">'^history/$'</span><span class="p">,</span> <span class="n">history</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s2">"history"</span><span class="p">)</span>
<span class="p">]</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="c1"># draw_members/views.py</span>
<span class="kn">from</span> <span class="nn">django.shortcuts</span> <span class="kn">import</span> <span class="n">render</span>
<span class="kn">from</span> <span class="nn">django.http</span> <span class="kn">import</span> <span class="n">HttpResponse</span>
<span class="k">def</span> <span class="nf">home</span><span class="p">(</span><span class="n">request</span><span class="p">):</span>
<span class="k">return</span> <span class="n">HttpResponse</span><span class="p">(</span><span class="s2">"<p>Hello World!</p>"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">draw</span><span class="p">(</span><span class="n">request</span><span class="p">):</span>
<span class="k">return</span> <span class="n">HttpResponse</span><span class="p">(</span><span class="s2">"<p>Draw</p>"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">history</span><span class="p">(</span><span class="n">request</span><span class="p">):</span>
<span class="k">return</span> <span class="n">HttpResponse</span><span class="p">(</span><span class="s2">"<p>History</p>"</span><span class="p">)</span>
</code></pre></div>
<p>緊接著改寫我們的首頁,讓它用上 <code>home.html</code>。</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">home</span><span class="p">(</span><span class="n">request</span><span class="p">):</span>
<span class="k">return</span> <span class="n">render</span><span class="p">(</span><span class="n">request</span><span class="p">,</span> <span class="s1">'draw_member/home.html'</span><span class="p">)</span>
</code></pre></div>
<figure class="align-center">
<img src="https://blog.liang2.tw/posts/2015/10/django-draw-member/pics/django_home.png"/>
<figcaption>加上 template 的首頁</figcaption>
</figure>
<p>Template 更多的說明可以參考<a href="https://docs.djangoproject.com/en/1.8/#the-template-layer">官網 template layer</a> 的說明。</p>
<h3 id="more-on-djangos-model-template-and-view-mtv">More on Django’s model, template and view (MTV)</h3>
<p>我們把最重要的抽籤功能實作出來吧。</p>
<p>這邊需要理解的就是,Django 會把傳到 GET / POST 的參數以 dict 存在 <code>request.GET</code> / <code>request.POST</code> 裡面,<code>@require_GET</code> 限制只能使用 GET 去溝通。</p>
<p>其他的邏輯都是照抄以前的。</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">random</span>
<span class="kn">from</span> <span class="nn">django.shortcuts</span> <span class="kn">import</span> <span class="n">render</span>
<span class="kn">from</span> <span class="nn">django.http</span> <span class="kn">import</span> <span class="n">HttpResponse</span><span class="p">,</span> <span class="n">Http404</span>
<span class="kn">from</span> <span class="nn">django.views.decorators.http</span> <span class="kn">import</span> <span class="n">require_GET</span>
<span class="kn">from</span> <span class="nn">.models</span> <span class="kn">import</span> <span class="n">Member</span><span class="p">,</span> <span class="n">History</span>
<span class="nd">@require_GET</span>
<span class="k">def</span> <span class="nf">draw</span><span class="p">(</span><span class="n">request</span><span class="p">):</span>
<span class="c1"># Retrieve all related members</span>
<span class="n">group_name</span> <span class="o">=</span> <span class="n">request</span><span class="o">.</span><span class="n">GET</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'group_name'</span><span class="p">,</span> <span class="s1">'ALL'</span><span class="p">)</span>
<span class="k">if</span> <span class="n">group_name</span> <span class="o">==</span> <span class="s1">'ALL'</span><span class="p">:</span>
<span class="n">valid_members</span> <span class="o">=</span> <span class="n">Member</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">all</span><span class="p">()</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">valid_members</span> <span class="o">=</span> <span class="n">Member</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">group_name</span><span class="o">=</span><span class="n">group_name</span><span class="p">)</span>
<span class="c1"># Raise 404 if no members are found given the group name</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">valid_members</span><span class="o">.</span><span class="n">exists</span><span class="p">():</span>
<span class="k">raise</span> <span class="n">Http404</span><span class="p">(</span><span class="s2">"No member in group '</span><span class="si">%s</span><span class="s2">'"</span> <span class="o">%</span> <span class="n">group_name</span><span class="p">)</span>
<span class="c1"># Lucky draw</span>
<span class="n">lucky_member</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">valid_members</span><span class="p">)</span>
<span class="c1"># Update history</span>
<span class="n">draw_history</span> <span class="o">=</span> <span class="n">History</span><span class="p">(</span><span class="n">member</span><span class="o">=</span><span class="n">lucky_member</span><span class="p">)</span>
<span class="n">draw_history</span><span class="o">.</span><span class="n">save</span><span class="p">()</span>
<span class="k">return</span> <span class="n">HttpResponse</span><span class="p">(</span>
<span class="s2">"<p></span><span class="si">{0.name}</span><span class="s2">(團體:</span><span class="si">{0.group_name}</span><span class="s2">)</p>"</span>
<span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">lucky_member</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div>
<p>用 ORM 寫起來比 raw SQL 乾淨多了,不過一開始要把對應的 function 都記起來就是。
馬上測試一下,一樣偷懶先不去寫 template。</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>curl<span class="w"> </span>-XGET<span class="w"> </span><span class="s2">"localhost:8000/draw/?group=ALL"</span>
<span class="go"><p>小泉 花陽(團體:μ's)</p></span>
</code></pre></div>
<p>如果是從首頁去點的,觀察一下網址的變化。例如:<code>http://localhost:8000/draw/?group_name=K-ON!</code>,可以看到 form 的選項直接寫在網址列。這是使用 POST 與 GET 最大的不同。</p>
<p>再來把歷史記錄的部份也寫一下,也把 template 都補上。</p>
<div class="highlight"><pre><span></span><code><span class="c">{# draw_member/templates/history.html #}</span>
<span class="cp">{%</span> <span class="k">extends</span> <span class="s1">'draw_member/base.html'</span> <span class="cp">%}</span>
<span class="cp">{%</span> <span class="k">block</span> <span class="nv">title</span> <span class="cp">%}</span>抽籤歷史<span class="cp">{%</span> <span class="k">endblock</span> <span class="nv">title</span> <span class="cp">%}</span>
<span class="cp">{%</span> <span class="k">block</span> <span class="nv">content</span> <span class="cp">%}</span>
<span class="p"><</span><span class="nt">h1</span><span class="p">></span>抽籤歷史(最近 10 筆)<span class="p"></</span><span class="nt">h1</span><span class="p">></span>
<span class="p"><</span><span class="nt">table</span><span class="p">></span>
<span class="p"><</span><span class="nt">thead</span><span class="p">></span>
<span class="p"><</span><span class="nt">tr</span><span class="p">></span>
<span class="p"><</span><span class="nt">th</span><span class="p">></span>名字<span class="p"></</span><span class="nt">th</span><span class="p">></span>
<span class="p"><</span><span class="nt">th</span><span class="p">></span>團體<span class="p"></</span><span class="nt">th</span><span class="p">></span>
<span class="p"><</span><span class="nt">th</span><span class="p">></span>抽中時間<span class="p"></</span><span class="nt">th</span><span class="p">></span>
<span class="p"></</span><span class="nt">tr</span><span class="p">></span>
<span class="p"></</span><span class="nt">thead</span><span class="p">></span>
<span class="p"><</span><span class="nt">tbody</span><span class="p">></span>
<span class="cp">{%</span> <span class="k">for</span> <span class="nv">history</span> <span class="k">in</span> <span class="nv">recent_histories</span> <span class="cp">%}</span>
<span class="p"><</span><span class="nt">tr</span><span class="p">></span>
<span class="p"><</span><span class="nt">td</span><span class="p">></span><span class="cp">{{</span> <span class="nv">history.member.name</span> <span class="cp">}}</span><span class="p"></</span><span class="nt">td</span><span class="p">></span>
<span class="p"><</span><span class="nt">td</span><span class="p">></span><span class="cp">{{</span> <span class="nv">history.member.group_name</span> <span class="cp">}}</span><span class="p"></</span><span class="nt">td</span><span class="p">></span>
<span class="p"><</span><span class="nt">td</span><span class="p">></span><span class="cp">{{</span> <span class="nv">history.time</span><span class="o">|</span><span class="nf">date</span><span class="s2">:"r"</span><span class="cp">}}</span><span class="p"></</span><span class="nt">td</span><span class="p">></span>
<span class="p"></</span><span class="nt">tr</span><span class="p">></span>
<span class="cp">{%</span> <span class="k">endfor</span> <span class="cp">%}</span>
<span class="p"></</span><span class="nt">tbody</span><span class="p">></span>
<span class="p"></</span><span class="nt">table</span><span class="p">></span>
<span class="cp">{%</span> <span class="k">endblock</span> <span class="nv">content</span> <span class="cp">%}</span>
</code></pre></div>
<p>history.html 與本來 Flask 不一樣的地方,在用上了 <code>date:"r"</code> 的 filter,傳的參數接在 <code>:</code> 之後。也更新對應 view 的動作,</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">history</span><span class="p">(</span><span class="n">request</span><span class="p">):</span>
<span class="n">recent_draws</span> <span class="o">=</span> <span class="n">History</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">order_by</span><span class="p">(</span><span class="s1">'-time'</span><span class="p">)</span><span class="o">.</span><span class="n">all</span><span class="p">()[:</span><span class="mi">10</span><span class="p">]</span>
<span class="k">return</span> <span class="n">render</span><span class="p">(</span><span class="n">request</span><span class="p">,</span> <span class="s1">'draw_member/history.html'</span><span class="p">,</span> <span class="p">{</span>
<span class="s1">'recent_histories'</span><span class="p">:</span> <span class="n">recent_draws</span><span class="p">,</span>
<span class="p">})</span>
</code></pre></div>
<figure class="align-center">
<img src="https://blog.liang2.tw/posts/2015/10/django-draw-member/pics/django_history.png"/>
</figure>
<p>可以看到預設用的是 UTC 時區,時區的轉換細節放到文末吧。我們可以在 view 裡更改要呈現的時區,</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">django.utils.timezone</span> <span class="kn">import</span> <span class="n">activate</span>
<span class="k">def</span> <span class="nf">history</span><span class="p">(</span><span class="n">request</span><span class="p">):</span>
<span class="n">activate</span><span class="p">(</span><span class="s1">'Asia/Taipei'</span><span class="p">)</span>
<span class="c1"># ...</span>
</code></pre></div>
<figure class="align-center">
<img src="https://blog.liang2.tw/posts/2015/10/django-draw-member/pics/django_history_tz.png"/>
</figure>
<p>這樣基本功能就搞定啦!細節一樣參考<a href="https://docs.djangoproject.com/en/1.8/#the-view-layer">官網 view layer</a> 的說明。</p>
<h3 id="django-form">Django Form</h3>
<p>直接把 form 寫在 template 裡面也是可以,有時候 form 可能跟 model 息息相關,而且 form input 多了之後每個欄位都要自己讀寫也太不直覺。想要驗証使用者的 input 的話就更複雜了。</p>
<p>於是就有了 Django Form。馬上來看用起來是怎麼樣。</p>
<div class="highlight"><pre><span></span><code># draw_member/forms.py
from django import forms
class DrawForm(forms.Form):
GROUP_CHOICES = [
("μ's", "μ's"),
("K-ON!", "K-ON!"),
("ALL", "(全)"),
]
group = forms.ChoiceField(
choices=GROUP_CHOICES,
label='團隊名稱',
label_suffix=':',
widget=forms.RadioSelect,
initial='ALL'
)
</code></pre></div>
<p>建了一個新的 form class,像 Model 一樣,裡面規定了每個欄位的屬性。我們這邊只有一個 <code>group</code> 是個單選的 ChoiceField,<code>choices</code> 是個 list of two-item tuples,第一個是內部的值,第二個是顯示的字。其他的都是細節的調整。</p>
<p>把這個 form 用到 view 裡面。新建一個 form object <code>form</code>,然後把這個變數 <code>form</code> 傳進 template 裡面。</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">.forms</span> <span class="kn">import</span> <span class="n">DrawForm</span>
<span class="k">def</span> <span class="nf">home</span><span class="p">(</span><span class="n">request</span><span class="p">):</span>
<span class="n">form</span> <span class="o">=</span> <span class="n">DrawForm</span><span class="p">()</span>
<span class="k">return</span> <span class="n">render</span><span class="p">(</span><span class="n">request</span><span class="p">,</span> <span class="s1">'draw_member/home.html'</span><span class="p">,</span> <span class="p">{</span>
<span class="s1">'form'</span><span class="p">:</span> <span class="n">form</span><span class="p">,</span>
<span class="p">})</span>
</code></pre></div>
<p>再來修改 template,就不用自己寫 form 的內容了,改成 <code>{{ form }}</code> Django 就會自動產生。</p>
<div class="highlight"><pre><span></span><code><span class="c">{# draw_member/home.html #}</span>
<span class="cp">{%</span> <span class="k">extends</span> <span class="s1">'draw_member/base.html'</span> <span class="cp">%}</span>
<span class="cp">{%</span> <span class="k">block</span> <span class="nv">content</span> <span class="cp">%}</span>
<span class="p"><</span><span class="nt">h1</span><span class="p">></span>來抽出快樂的夥伴吧!<span class="p"></</span><span class="nt">h1</span><span class="p">></span>
<span class="p"><</span><span class="nt">p</span><span class="p">></span>選擇要被抽的團體<span class="p"></</span><span class="nt">p</span><span class="p">></span>
<span class="p"><</span><span class="nt">form</span> <span class="na">action</span><span class="o">=</span><span class="s">"</span><span class="cp">{%</span> <span class="k">url</span> <span class="s1">'draw'</span> <span class="cp">%}</span><span class="s">"</span> <span class="na">method</span><span class="o">=</span><span class="s">"get"</span><span class="p">></span>
<span class="cp">{{</span> <span class="nv">form</span> <span class="cp">}}</span>
<span class="p"><</span><span class="nt">input</span> <span class="na">type</span><span class="o">=</span><span class="s">"submit"</span> <span class="na">value</span><span class="o">=</span><span class="s">"Submit"</span><span class="p">></span>
<span class="p"></</span><span class="nt">form</span><span class="p">></span>
<span class="cp">{%</span> <span class="k">endblock</span> <span class="nv">content</span> <span class="cp">%}</span>
</code></pre></div>
<figure class="align-center">
<img src="https://blog.liang2.tw/posts/2015/10/django-draw-member/pics/django_form.png"/>
</figure>
<p>不過這個長得跟我們原本的 form 不一樣嘛。好在 Django form 是很彈性的,form 在被 render 成 HTML 時可以提供細節的調整,大家可以參考<a href="https://docs.djangoproject.com/en/1.8/topics/forms/#form-rendering-options">官網 Form rendering options</a> 調整。我直接給調好的結果吧。</p>
<div class="highlight"><pre><span></span><code> <span class="p"><</span><span class="nt">form</span> <span class="na">action</span><span class="o">=</span><span class="s">"</span><span class="cp">{%</span> <span class="k">url</span> <span class="s1">'draw'</span> <span class="cp">%}</span><span class="s">"</span> <span class="na">method</span><span class="o">=</span><span class="s">"get"</span><span class="p">></span>
<span class="cp">{{</span> <span class="nv">form.group.label_tag</span> <span class="cp">}}</span>
<span class="cp">{%</span> <span class="k">for</span> <span class="nv">radio</span> <span class="k">in</span> <span class="nv">form.group</span> <span class="cp">%}</span>
<span class="cp">{{</span> <span class="nv">radio.tag</span> <span class="cp">}}{{</span> <span class="nv">radio.choice_label</span> <span class="cp">}}</span>
<span class="cp">{%</span> <span class="k">endfor</span> <span class="cp">%}</span>
<span class="p"><</span><span class="nt">input</span> <span class="na">type</span><span class="o">=</span><span class="s">"submit"</span> <span class="na">value</span><span class="o">=</span><span class="s">"Submit"</span><span class="p">></span>
<span class="p"></</span><span class="nt">form</span><span class="p">></span>
</code></pre></div>
<p>用結果去對照每個 <code>{{ ... }}</code> 部件對應的 HTML 元素吧。</p>
<h4 id="more-django-form-in-view">More Django form in view</h4>
<p>Form 的功能可不只這樣,可以在創建 DrawForm 時直接把 <code>request.GET</code> 傳入。</p>
<div class="highlight"><pre><span></span><code><span class="c1"># draw_member/views.py</span>
<span class="k">def</span> <span class="nf">draw</span><span class="p">(</span><span class="n">request</span><span class="p">):</span>
<span class="c1"># Retrieve all related members</span>
<span class="n">form</span> <span class="o">=</span> <span class="n">DrawForm</span><span class="p">(</span><span class="n">request</span><span class="o">.</span><span class="n">GET</span><span class="p">)</span>
<span class="k">if</span> <span class="n">form</span><span class="o">.</span><span class="n">is_valid</span><span class="p">():</span>
<span class="n">group_name</span> <span class="o">=</span> <span class="n">form</span><span class="o">.</span><span class="n">cleaned_data</span><span class="p">[</span><span class="s1">'group'</span><span class="p">]</span>
<span class="k">if</span> <span class="n">group_name</span> <span class="o">==</span> <span class="s1">'ALL'</span><span class="p">:</span>
<span class="n">valid_members</span> <span class="o">=</span> <span class="n">Member</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">all</span><span class="p">()</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">valid_members</span> <span class="o">=</span> <span class="n">Member</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">group_name</span><span class="o">=</span><span class="n">group_name</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="c1"># Raise 404 if no members are found given the group name</span>
<span class="k">raise</span> <span class="n">Http404</span><span class="p">(</span><span class="s2">"No member in group '</span><span class="si">%s</span><span class="s2">'"</span> <span class="o">%</span>
<span class="n">form</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'group'</span><span class="p">,</span> <span class="s1">''</span><span class="p">))</span>
<span class="c1"># Lucky draw</span>
<span class="n">lucky_member</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">valid_members</span><span class="p">)</span>
<span class="c1"># ...</span>
</code></pre></div>
<p>用 <code>form.is_valid()</code> 可以驗証每個欄位的資料是不是正確的。</p>
<p>我們也順便把 /draw 加上 template 吧。</p>
<div class="highlight"><pre><span></span><code><span class="c">{# draw_member/draw.html #}</span>
<span class="cp">{%</span> <span class="k">extends</span> <span class="s1">'draw_member/base.html'</span> <span class="cp">%}</span>
<span class="cp">{%</span> <span class="k">block</span> <span class="nv">title</span> <span class="cp">%}</span>抽籤結果<span class="cp">{%</span> <span class="k">endblock</span> <span class="nv">title</span> <span class="cp">%}</span>
<span class="cp">{%</span> <span class="k">block</span> <span class="nv">content</span> <span class="cp">%}</span>
<span class="p"><</span><span class="nt">h1</span><span class="p">></span>抽籤結果<span class="p"></</span><span class="nt">h1</span><span class="p">></span>
<span class="p"><</span><span class="nt">p</span><span class="p">></span><span class="cp">{{</span> <span class="nv">lucky_member.name</span> <span class="cp">}}</span>(團體:<span class="cp">{{</span> <span class="nv">lucky_member.group_name</span> <span class="cp">}}</span>)<span class="p"></</span><span class="nt">p</span><span class="p">></span>
<span class="cp">{%</span> <span class="k">endblock</span> <span class="nv">content</span> <span class="cp">%}</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="c1"># draw_member/views.py</span>
<span class="k">def</span> <span class="nf">draw</span><span class="p">():</span>
<span class="c1"># ...</span>
<span class="k">return</span> <span class="n">render</span><span class="p">(</span><span class="n">request</span><span class="p">,</span> <span class="s1">'draw_member/draw.html'</span><span class="p">,</span> <span class="p">{</span>
<span class="s1">'lucky_member'</span><span class="p">:</span> <span class="n">lucky_member</span>
<span class="p">})</span>
</code></pre></div>
<p>更多 Forms 的介紹一樣參考<a href="https://docs.djangoproject.com/en/1.8/#forms">官網</a>。</p>
<h3 id="_2">總結</h3>
<p>做完的成品在 <a href="https://github.com/ccwang002/draw_member_django">Github</a> 上,參考 README 就可以設定好環境了。</p>
<p>這樣就把 Django 最基本的 Model, View, Template, Form 幾個大部份體驗一遍了。可以感覺出來 Django 提供的功能比 Flask 多很多,但也代表要花更多的時候學習使用它。其實改寫到最後我們的 code 非常少,可以為了結構化的 code 還比較多。</p>
<p>當然這不代表就學會 Django 了。最後來介紹幾個可以接續學習的 Django 資源:</p>
<ul>
<li><a href="https://github.com/uranusjr/django-tutorial-for-programmers">《為程式人寫的 Django Tutorial 》</a>是個真正從零到一的 30 天學習規劃(雖然我學了好幾個月 T___T),有了這個抽籤程式的概念再去讀一次應該會更清楚整個 Django 的設計。作者:Tzu-ping Chung (@uranusjr)</li>
<li><a href="http://masteringdjango.com"><em>Mastering Django: Core</em></a>, the successor to <a href="http://www.djangobook.com/en/2.0/index.html"><em>The Django Book</em></a> last updated in 2009, is the definitive guide to Django targeting the latest Django version 1.8 at the time of writing.</li>
</ul>
<p>更多的 Django 技能樹選擇請見 TP 的 <a href="https://github.com/uranusjr/django-tutorial-for-programmers/blob/master/30-moving-on.md">lesson 30</a>。</p>
<h3 id="details">Details</h3>
<p>跟 Flask 一樣,底下記錄一些細節或改善等等為了避免篇幅過長(已經太長了)而移至此的段落。</p>
<h4 id="raw-sql">Raw SQL</h4>
<p>在介紹 Django Model 的時候直接用了 ORM,但實際上 Django 是可以寫 raw SQL 了,而且還有「聰明版」的 raw SQL 能夠拿回對應的 model object。馬上來看怎麼回事。</p>
<p>先來看聰明版的 raw SQL,使用 <code>Model.objects.raw</code> 拿回所有團體是 K-ON 類的成員。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="nb">list</span><span class="p">(</span><span class="n">Member</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">raw</span><span class="p">(</span><span class="s2">"""</span>
<span class="gp">... </span><span class="s2">SELECT id, name, group_name</span>
<span class="gp">... </span><span class="s2">FROM draw_member_member</span>
<span class="gp">... </span><span class="s2">WHERE group_name LIKE 'K-ON</span><span class="si">%%</span><span class="s2">'</span>
<span class="gp">... </span><span class="s2">"""</span><span class="p">))</span>
<span class="go">[<Member: 平沢 唯 of K-ON!>,</span>
<span class="go"> <Member: 秋山 澪 of K-ON!>,</span>
<span class="go"> <Member: 田井中 律 of K-ON!>,</span>
<span class="go"> <Member: 琴吹 紬 of K-ON!>,</span>
<span class="go"> <Member: 中野 梓 of K-ON!>]</span>
</code></pre></div>
<p>會回傳一個 RawQuerySet,裡面其實也是 Member objects,這是靠 Django 去認對應的 primary key,也就是說在 raw() SQL query 裡一定要放 primary key。注意那個 <code>%</code> 需要被 escape 因為 raw() 的 SQL query 是能放參數的(就像 Python 內建 str %-formatting)。</p>
<p>不過我們怎麼知道 Member 是存在哪個 table 呢?預設是 <code><app>_<model></code>,但資訊在 meta options 裡的 <code>db_table</code>,也能被覆寫。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">Member</span><span class="o">.</span><span class="n">_meta</span><span class="o">.</span><span class="n">db_table</span>
<span class="go">'draw_member_member'</span>
</code></pre></div>
<p>因為 Member 裡面有像 name、group_name 等欄位,在下 query 的時候不一定都會寫在 SELECT 裡面把拿值回來,那麼這些欄位就是 deferred 狀態,只有在真的拿值時才會去跟 database 要。一般使用不會有感覺兩者的差異。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">m</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">Member</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">raw</span><span class="p">(</span>
<span class="gp">... </span> <span class="s2">"SELECT id FROM draw_member_member"</span>
<span class="gp">... </span><span class="p">))[</span><span class="mi">0</span><span class="p">]</span>
<span class="gp">>>> </span><span class="nb">type</span><span class="p">(</span><span class="n">m</span><span class="p">)</span>
<span class="go">draw_member.models.Member_Deferred_group_name_name</span>
<span class="gp">>>> </span><span class="n">m</span><span class="o">.</span><span class="n">get_deferred_fields</span><span class="p">()</span>
<span class="go">{'group_name', 'name'}</span>
</code></pre></div>
<p>但我就是不想用 ORM,速度慢,也沒辦法寫複雜的 query(戰)。這就回歸到最傳統的 database connection, cursor 這些概念,就像沒有 SQLAlchemy 的 Flask。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">django.db</span> <span class="kn">import</span> <span class="n">connection</span>
<span class="gp">>>> </span><span class="n">c</span> <span class="o">=</span> <span class="n">connection</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
<span class="gp">>>> </span><span class="nb">list</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">"""</span>
<span class="gp">... </span><span class="s2">SELECT name</span>
<span class="gp">... </span><span class="s2">FROM draw_member_member</span>
<span class="gp">... </span><span class="s2">WHERE group_name LIKE </span><span class="si">%s</span>
<span class="gp">... </span><span class="s2">"""</span><span class="p">,</span> <span class="p">[</span><span class="s2">"K-ON"</span><span class="p">]))</span>
<span class="go">[('平沢 唯',), ('秋山 澪',), ('田井中 律',), ('琴吹 紬',), ('中野 梓',)]</span>
<span class="gp">>>> </span><span class="nb">list</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">"""</span>
<span class="gp">... </span><span class="s2">SELECT member_id, time</span>
<span class="gp">... </span><span class="s2">FROM draw_member_history</span>
<span class="gp">... </span><span class="s2">LIMIT 3</span>
<span class="gp">... </span><span class="s2">"""</span><span class="p">))</span>
<span class="go">[(8, datetime.datetime(2015, 10, 5, 17, 36, 41, 608078, tzinfo=<UTC>)),</span>
<span class="go"> (11, datetime.datetime(2015, 10, 5, 17, 37, 26, 164830, tzinfo=<UTC>)),</span>
<span class="go"> (11, datetime.datetime(2015, 10, 5, 17, 37, 37, 483697, tzinfo=<UTC>))]</span>
</code></pre></div>
<p>Here you go.</p>
<h4 id="better-queryset">Better QuerySet</h4>
<p>看過了 raw SQL 之後,我們來想想 ORM 的改善吧。雖然說每次要查詢的時候像寫 SQL 一樣把 query 組合出來也可以,但用 ORM 的好處應該是能把這些實作細節跟「包裝起來」。例如最近 n 次抽籤記錄、所有成員的團體名稱(目前是寫死在 DrawForm 裡面)。</p>
<p>這時候就可以把常用的 query 變成一個 method,例如最近 10 次抽籤記錄就只要用 <code>History.objects.recent(10)</code> 就可以了。</p>
<p>這其實有很多做法,像是寫一個 classmethod、Override default Manager、Override default QuerySet。哪個方法比較好呢?在 <a href="http://stackoverflow.com/a/2213341">StackOverflow</a>、<a href="https://groups.google.com/forum/#!topic/django-users/0WSdnWFTuUg">mail list</a> 都有討論。基本上都能達到相同的效果,但後兩者的做法是比較偏好的,因為 Manager(or QuerySet for Django 1.7+) 負責處理 model 對應到的 database table 等級的操作,但 classmethod 應該是處理已經從 table row 中拿出的一個 model object 相關的操作。如果把同樣性質的 code 放在一起,就應該使用 Manager(QuerySet)。</p>
<p>而且 TP 也在 Gitter 上開示了,就是這樣(結案)。來改寫 model。</p>
<div class="highlight"><pre><span></span><code><span class="c1"># draw_member/models.py</span>
<span class="k">class</span> <span class="nc">MemberQuerySet</span><span class="p">(</span><span class="n">models</span><span class="o">.</span><span class="n">QuerySet</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">unique_groups</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">values_list</span><span class="p">(</span><span class="s1">'group_name'</span><span class="p">,</span> <span class="n">flat</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span><span class="o">.</span><span class="n">distinct</span><span class="p">()</span>
<span class="k">class</span> <span class="nc">HistoryQuerySet</span><span class="p">(</span><span class="n">models</span><span class="o">.</span><span class="n">QuerySet</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">recent</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">order_by</span><span class="p">(</span><span class="s1">'-time'</span><span class="p">)[:</span><span class="n">n</span><span class="p">]</span>
<span class="k">class</span> <span class="nc">Member</span><span class="p">(</span><span class="n">models</span><span class="o">.</span><span class="n">Model</span><span class="p">):</span>
<span class="c1"># ...</span>
<span class="n">objects</span> <span class="o">=</span> <span class="n">MemberQuerySet</span><span class="o">.</span><span class="n">as_manager</span><span class="p">()</span>
<span class="k">class</span> <span class="nc">History</span><span class="p">(</span><span class="n">models</span><span class="o">.</span><span class="n">Model</span><span class="p">):</span>
<span class="c1"># ...</span>
<span class="n">objects</span> <span class="o">=</span> <span class="n">HistoryQuerySet</span><span class="o">.</span><span class="n">as_manager</span><span class="p">()</span>
</code></pre></div>
<p>在 Member 我們定義了一個 <code>unique_groups</code> 拿回所有團體的名稱;在 History 定義了 <code>recent</code> 拿出按時間排序最前面 n 個。新定義的 <code>QuerySet.as_manager()</code> 就取代掉本來的 <code>Model.objects</code>。</p>
<p>接著來改寫 view 把之前寫的 query 換掉。</p>
<div class="highlight"><pre><span></span><code><span class="c1">#draw_member/views.py</span>
<span class="k">def</span> <span class="nf">history</span><span class="p">(</span><span class="n">request</span><span class="p">):</span>
<span class="c1"># ...</span>
<span class="n">recent_draws</span> <span class="o">=</span> <span class="n">History</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">recent</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
<span class="c1"># ...</span>
</code></pre></div>
<p>這樣就簡潔一點。再來順便把 form 改得比較彈性,不要把團體名寫死。</p>
<div class="highlight"><pre><span></span><code><span class="c1">#draw_member/forms.py</span>
<span class="kn">from</span> <span class="nn">.models</span> <span class="kn">import</span> <span class="n">Member</span>
<span class="k">def</span> <span class="nf">member_group_choices</span><span class="p">():</span>
<span class="n">valid_groups</span> <span class="o">=</span> <span class="n">Member</span><span class="o">.</span><span class="n">objects</span><span class="o">.</span><span class="n">unique_groups</span><span class="p">()</span>
<span class="n">choices</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">grp</span> <span class="ow">in</span> <span class="n">valid_groups</span><span class="p">:</span>
<span class="n">choices</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">grp</span><span class="p">,</span> <span class="n">grp</span><span class="p">))</span>
<span class="n">choices</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="s1">'ALL'</span><span class="p">,</span> <span class="s1">'(全)'</span><span class="p">))</span>
<span class="k">return</span> <span class="n">choices</span>
<span class="k">class</span> <span class="nc">DrawForm</span><span class="p">(</span><span class="n">forms</span><span class="o">.</span><span class="n">Form</span><span class="p">):</span>
<span class="n">group</span> <span class="o">=</span> <span class="n">forms</span><span class="o">.</span><span class="n">ChoiceField</span><span class="p">(</span>
<span class="n">choices</span><span class="o">=</span><span class="n">member_group_choices</span><span class="p">,</span>
<span class="c1"># ...</span>
<span class="p">)</span>
</code></pre></div>
<h4 id="timezone">Timezone</h4>
<p>感覺最近一直在寫<a href="../../09/datetime-sqlite/">時區相關的東西</a>啊。基本上 server 記錄的時間都用 UTC 問題就少很多,但最後還是要呈現一個使用者用的時區。</p>
<p>但問題是 HTTP header 裡面並沒有這樣的資訊,所以一來用 geoip 去猜,二來用寫個 javascript 在使用者載入的時候去判斷時區,總之是個要另外記錄的東西。細節<a href="https://docs.djangoproject.com/en/1.8/topics/i18n/timezones/#selecting-the-current-time-zone">官網上也有說明</a>。</p>
<p>在文中是使用 <code>activate('Aisa/Taipei')</code> 把時區改成 UTC+8。這邊介紹另一個方式,是寫在 template 裡面的。</p>
<div class="highlight"><pre><span></span><code><span class="c">{# draw_member/templates/draw_member/history.html #}</span>
<span class="cp">{%</span> <span class="k">block</span> <span class="nv">content</span> <span class="cp">%}</span>
<span class="cp">{%</span> <span class="k">load</span> <span class="nv">tz</span> <span class="cp">%}</span>
<span class="p"><</span><span class="nt">h1</span><span class="p">></span>抽籤歷史(最近 10 筆)<span class="p"></</span><span class="nt">h1</span><span class="p">></span>
<span class="p"><</span><span class="nt">table</span><span class="p">></span>
<span class="c">{# ... #}</span>
<span class="p"><</span><span class="nt">tbody</span><span class="p">></span>
<span class="cp">{%</span> <span class="k">timezone</span> <span class="s1">'Asia/Taipei'</span> <span class="cp">%}</span>
<span class="cp">{%</span> <span class="k">for</span> <span class="nv">history</span> <span class="k">in</span> <span class="nv">recent_histories</span> <span class="cp">%}</span>
<span class="p"><</span><span class="nt">tr</span><span class="p">></span><span class="c">{# ... #}</span><span class="p"></</span><span class="nt">tr</span><span class="p">></span>
<span class="cp">{%</span> <span class="k">endfor</span> <span class="cp">%}</span>
<span class="cp">{%</span> <span class="k">endtimezone</span> <span class="cp">%}</span>
<span class="p"></</span><span class="nt">tbody</span><span class="p">></span>
<span class="p"></</span><span class="nt">table</span><span class="p">></span>
<span class="cp">{%</span> <span class="k">endblock</span> <span class="nv">content</span> <span class="cp">%}</span>
</code></pre></div>
<h4 id="post-form-and-csrf">POST form and CSRF</h4>
<p>忘記講了,我們的 form 目前是用 <code>action="get"</code>,當然可以改回用 POST,也很簡單,就 GET 換成 POST 就好了。</p>
<div class="highlight"><pre><span></span><code><span class="c1"># draw_site/views.py</span>
<span class="kn">from</span> <span class="nn">django.views.decorators.http</span> <span class="kn">import</span> <span class="n">require_POST</span>
<span class="nd">@require_POST</span>
<span class="k">def</span> <span class="nf">draw</span><span class="p">(</span><span class="n">request</span><span class="p">):</span>
<span class="c1"># Retrieve all related members</span>
<span class="n">form</span> <span class="o">=</span> <span class="n">DrawForm</span><span class="p">(</span><span class="n">request</span><span class="o">.</span><span class="n">POST</span><span class="p">)</span>
<span class="c1"># ...</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="c">{# draw_site/templates/home.html #}</span>
<span class="p"><</span><span class="nt">form</span> <span class="na">action</span><span class="o">=</span><span class="s">"</span><span class="cp">{%</span> <span class="k">url</span> <span class="s1">'draw'</span> <span class="cp">%}</span><span class="s">"</span> <span class="na">method</span><span class="o">=</span><span class="s">"post"</span><span class="p">></span>
</code></pre></div>
<p>馬上來試試看。</p>
<figure class="align-center">
<img src="https://blog.liang2.tw/posts/2015/10/django-draw-member/pics/django_csrf_failed.png"/>
<figcaption>POST form without CSRF token</figcaption>
</figure>
<p>拿到了一個 403 Forbidden “CSRF verification failed.”。CSRF (Cross Site Request Forgery) 在 <a href="https://zh.wikipedia.org/wiki/%E8%B7%A8%E7%AB%99%E8%AF%B7%E6%B1%82%E4%BC%AA%E9%80%A0">wiki</a> 有比較完整的介紹,這是一種攻擊手法,在使用者登入網站之後(session 為登入狀態),偽造一個跟網站上一樣的 form 來偽裝使用者的行為。例如購票系統買票,如果沒檢查的話,我可以拿使用者的 session 去網站上隨便買票,網站都會認為是使用者在操作。</p>
<p>因此 <a href="https://docs.djangoproject.com/en/1.8/ref/csrf/">CSRF token</a> 用來防範這個偽造,在產生 form 的時候,網站會再產生一個欄位的值,這個欄位的值每次都會改變,這樣就能確定這個 form 是從網站上拿到的。Django 處理 CSRF protection 是透過 <a href="https://docs.djangoproject.com/en/1.8/topics/http/middleware/">Middleware</a>,一個以前沒有提到的概念,表示他是比較底層的東西。相對而言,也不用改我們的 code,在這個例子就只要把 <code>{% csrf_token %}</code> 加到 form 裡面就可以了。</p>
<div class="highlight"><pre><span></span><code><span class="c">{# draw_site/templates/home.html #}</span>
<span class="p"><</span><span class="nt">form</span> <span class="na">action</span><span class="o">=</span><span class="s">"</span><span class="cp">{%</span> <span class="k">url</span> <span class="s1">'draw'</span> <span class="cp">%}</span><span class="s">"</span> <span class="na">method</span><span class="o">=</span><span class="s">"post"</span><span class="p">></span>
<span class="c">{# ... #}</span>
<span class="cp">{%</span> <span class="k">csrf_token</span> <span class="cp">%}</span>
<span class="p"><</span><span class="nt">input</span> <span class="na">type</span><span class="o">=</span><span class="s">"submit"</span> <span class="na">value</span><span class="o">=</span><span class="s">"Submit"</span><span class="p">></span>
<span class="p"></</span><span class="nt">form</span><span class="p">></span>
</code></pre></div>Datetime in SQLite and Python2015-09-28T12:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2015-09-28:/posts/2015/09/datetime-sqlite/<p>整理在 Python 中處理時區的問題,並如何自 SQLite 存取考慮時區的時間</p><p>要正確處理時間並不容易。承接<a href="../flask-draw-member">我們先前的例子</a>,其實是直接把時間轉換出來的字串存在 SQLite 裡。這有幾個問題。</p>
<p>首先是時區的問題。我們直接把 server 所在時區的時間存到資料庫去,台北的時區為 <a href="https://en.wikipedia.org/wiki/Asia/Taipei">Asia/Taipei</a> (UTC+8)。如果今天 server 跑到另一個時區,例如東京 Asia/Tokyo (UTC+9) 好了,這時候資料庫裡就包含了兩個時區的時間,但因為是字串是完全看不出差異的。</p>
<p>再來用字串存時間也有一些問題。首先是排序,雖然我們的例子是能正確的排序,但如果時間格式換了(像 <code>%H:%M:%S %Y-%m-%d</code>)那就不一定。再來可以看到後續想處理時間就會比較複雜。不過這一部份是因為 SQLite 沒有專門處理日期時間的資料型態,像 PostgreSQL 就能看得懂。</p>
<p>所以想要正確處理時間有幾個要點:</p>
<ul>
<li>存到資料庫的時間應該要 UTC 來表示</li>
<li>在處理時間時(排序、顯示、處理時區),應該轉成正確的資料格式(例如 <a href="https://docs.python.org/3.5/library/datetime.html#datetime-objects">datetime</a>)</li>
<li>呈現給使用者時再轉換到該人(或 server)所在時區</li>
</ul>
<p>底下是比較正確處理時間的方式。</p>
<div class="toc">
<ul>
<li><a href="#timezone">時區(Timezone)</a></li>
<li><a href="#datetime-in-sqlite-again">Datetime in SQLite, again</a><ul>
<li><a href="#python-3-timezone">Python 3 內建 timezone 支援</a></li>
</ul>
</li>
<li><a href="#python-sqlite-adapter">讓 Python 內建 SQLite adapter 支援時區</a></li>
<li><a href="#_1">總結</a></li>
</ul>
</div>
<h3 id="timezone">時區(Timezone)</h3>
<p>我們都還沒有處理過時區。時區在 Python 內建的 datetime 只是個「概念」,也就是說,使用者可以傳進去不同的時區(存在 <code>datetime.tzinfo</code> 中),Python 能針對有提供時區的 datetime 做正確的判斷。但台北的時區是多少,紐約的時區是多少它不知道。</p>
<p>為什麼不處理各地時區這麼重要的概念?因為時間變動的速度很快,加上日光節約時間每年可能都不一樣,Python 下一版還沒出時區的資訊已經更新了很多次。</p>
<p>因此在 Python 中實際上時區處理靠得是第三方套件 <a href="http://pythonhosted.org/pytz/">pytz</a>。像安裝 Flask 一樣,用 <code>pip install pytz</code> 就可以了。</p>
<p>實際操作看看。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>
<span class="gp">>>> </span><span class="n">datetime</span><span class="o">.</span><span class="n">now</span><span class="p">()</span> <span class="c1"># local time</span>
<span class="go">datetime.datetime(2015, 9, 29, 16, 33, 39, 537111)</span>
<span class="gp">>>> </span><span class="n">datetime</span><span class="o">.</span><span class="n">utcnow</span><span class="p">()</span> <span class="c1"># UTC time</span>
<span class="go">datetime.datetime(2015, 9, 29, 8, 33, 39, 538745)</span>
</code></pre></div>
<p>首先,可以看到 datetime 本身提供了 <code>now()</code> 以及 <code>utcnow()</code> 兩個 function 來拿到現在的時間。台北是 UTC+8 所以時間比 UTC 時間字面上快 8 小時。注意到兩個回傳的 datetime 物件都沒有包含時區的資訊。</p>
<p>處理時間原則上都以 UTC 為基準。我們建立一個 UTC 的現在時間存在變數 <code>utcnow</code>,並且用 pytz 處理時間。Import pytz 進來,並且定義了兩個時區:UTC 以及 TPE(台北時間)。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="kn">import</span> <span class="nn">pytz</span>
<span class="gp">>>> </span><span class="n">utc</span> <span class="o">=</span> <span class="n">pytz</span><span class="o">.</span><span class="n">utc</span>
<span class="gp">>>> </span><span class="n">tpe</span> <span class="o">=</span> <span class="n">pytz</span><span class="o">.</span><span class="n">timezone</span><span class="p">(</span><span class="s1">'Asia/Taipei'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">utcnow</span> <span class="o">=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">utcnow</span><span class="p">()</span>
<span class="gp">>>> </span><span class="n">utcnow</span>
<span class="go">datetime.datetime(2015, 9, 29, 8, 38, 14, 738241)</span>
<span class="gp">>>> </span><span class="n">utc</span><span class="o">.</span><span class="n">fromutc</span><span class="p">(</span><span class="n">utcnow</span><span class="p">)</span>
<span class="go">datetime.datetime(2015, 9, 29, 8, 38, 14, 738241, tzinfo=<UTC>)</span>
<span class="gp">>>> </span><span class="n">tpe</span><span class="o">.</span><span class="n">fromutc</span><span class="p">(</span><span class="n">utcnow</span><span class="p">)</span>
<span class="go">datetime.datetime(2015, 9, 29, 16, 38, 14, 738241, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>)</span>
<span class="gp">>>> </span><span class="n">utc</span><span class="o">.</span><span class="n">fromutc</span><span class="p">(</span><span class="n">utcnow</span><span class="p">)</span> <span class="o">==</span> <span class="n">tpe</span><span class="o">.</span><span class="n">fromutc</span><span class="p">(</span><span class="n">utcnow</span><span class="p">)</span>
<span class="go">True</span>
</code></pre></div>
<p>用 pytz 定義的時區處理 datetime 之後就會多了 <code>tzinfo</code> 的資訊。這時也能正確比較不同時區的時間。</p>
<p>如何處理一個任意定義的時間呢?例如 2016 年台北元旦好了,</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="nb">str</span><span class="p">(</span><span class="n">datetime</span><span class="p">(</span><span class="mi">2016</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span>
<span class="go">'2016-01-01 00:00:00'</span>
<span class="gp">>>> </span><span class="n">tpe_2016_newyear</span> <span class="o">=</span> <span class="n">tpe</span><span class="o">.</span><span class="n">localize</span><span class="p">(</span><span class="n">datetime</span><span class="p">(</span><span class="mi">2016</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span>
<span class="gp">>>> </span><span class="n">tpe_2016_newyear</span>
<span class="go">datetime.datetime(2016, 1, 1, 0, 0, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>)</span>
<span class="gp">>>> </span><span class="n">utc</span><span class="o">.</span><span class="n">normalize</span><span class="p">(</span><span class="n">tpe_2016_newyear</span><span class="p">)</span>
<span class="go">datetime.datetime(2015, 12, 31, 16, 0, tzinfo=<UTC>)</span>
</code></pre></div>
<p>使用 <code>.localize(<datetime>)</code> 給予一個初始沒有時區資訊的 <code>datetime</code> 時區。有了時區之後,要在不同時區間轉換就使用 <code>.normalize(<datetime>)</code>。</p>
<p>可以再查查當台北 2016 元旦時,美國東岸時間是幾點。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">est</span> <span class="o">=</span> <span class="n">pytz</span><span class="o">.</span><span class="n">timezone</span><span class="p">(</span><span class="s1">'US/Eastern'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">est</span><span class="o">.</span><span class="n">normalize</span><span class="p">(</span><span class="n">tpe_2016_newyear</span><span class="p">)</span>
<span class="go">datetime.datetime(2015, 12, 31, 11, 0, tzinfo=<DstTzInfo 'US/Eastern' EST-1 day, 19:00:00 STD>)</span>
</code></pre></div>
<p>以後要看球賽轉播、重要發表就不會再搞不清楚時間了。</p>
<h3 id="datetime-in-sqlite-again">Datetime in SQLite, again</h3>
<p>我們會處理 datetime 與時區了,那麼就來改寫一下本來 SQLite 存時間的方式。其實 Python datetime 支援 SQLite 轉換,同樣從<a href="https://docs.python.org/3.5/library/sqlite3.html#default-adapters-and-converters">Python module 說明文件</a>裡面拿出來的。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>
<span class="gp">>>> </span><span class="kn">import</span> <span class="nn">sqlite3</span>
<span class="gp">>>> </span><span class="kn">import</span> <span class="nn">pytz</span>
<span class="gp">>>> </span><span class="n">utc</span> <span class="o">=</span> <span class="n">pytz</span><span class="o">.</span><span class="n">utc</span>
<span class="gp">>>> </span><span class="n">tpe</span> <span class="o">=</span> <span class="n">pytz</span><span class="o">.</span><span class="n">timezone</span><span class="p">(</span><span class="s1">'Asia/Taipei'</span><span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">db</span> <span class="o">=</span> <span class="n">sqlite3</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span>
<span class="gp">... </span> <span class="s2">"test.db"</span><span class="p">,</span>
<span class="gp">... </span> <span class="n">detect_types</span><span class="o">=</span><span class="n">sqlite3</span><span class="o">.</span><span class="n">PARSE_DECLTYPES</span><span class="o">|</span><span class="n">sqlite3</span><span class="o">.</span><span class="n">PARSE_COLNAMES</span>
<span class="gp">... </span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s1">'CREATE TABLE test(dt timestamp)'</span><span class="p">)</span>
<span class="go"><sqlite3.Cursor object at 0x10a59b960></span>
</code></pre></div>
<p>資料欄位的設為 <code>timestamp</code>,並且在連接的時候設定 <code>PARSE_DECLTYPES</code> 及 <code>PARSE_COLNAMES</code>,稍後可以看到他們的效果。
趕快把時間存進去吧。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">utcnow</span> <span class="o">=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">utcnow</span><span class="p">()</span>
<span class="gp">>>> </span><span class="n">utcnow</span>
<span class="go">datetime.datetime(2015, 9, 29, 12, 48, 16, 671538)</span>
<span class="gp">>>> </span><span class="k">with</span> <span class="n">db</span><span class="p">:</span>
<span class="gp">... </span> <span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span>
<span class="gp">... </span> <span class="s1">'INSERT INTO test(dt) VALUES (?)'</span><span class="p">,</span>
<span class="gp">... </span> <span class="p">(</span><span class="n">utcnow</span><span class="p">,</span> <span class="p">)</span>
<span class="gp">... </span> <span class="p">)</span>
<span class="gp">...</span>
<span class="go"><sqlite3.Cursor object at 0x1082380a0></span>
<span class="gp">>>> </span><span class="n">tpe_2016_newyear</span> <span class="o">=</span> <span class="n">tpe</span><span class="o">.</span><span class="n">localize</span><span class="p">(</span><span class="n">datetime</span><span class="p">(</span><span class="mi">2016</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span>
<span class="gp">>>> </span><span class="n">utc</span><span class="o">.</span><span class="n">normalize</span><span class="p">(</span><span class="n">tpe_2016_newyear</span><span class="p">)</span>
<span class="go">datetime.datetime(2015, 12, 31, 16, 0, tzinfo=<UTC>)</span>
<span class="gp">>>> </span><span class="n">utc_dt</span> <span class="o">=</span> <span class="n">utc</span><span class="o">.</span><span class="n">normalize</span><span class="p">(</span><span class="n">tpe_2016_newyear</span><span class="p">)</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">tzinfo</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span>
<span class="gp">>>> </span><span class="k">with</span> <span class="n">db</span><span class="p">:</span>
<span class="gp">... </span> <span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span>
<span class="gp">... </span> <span class="s1">'INSERT INTO test(dt) VALUES (?)'</span><span class="p">,</span>
<span class="gp">... </span> <span class="p">(</span><span class="n">utc_dt</span><span class="p">,</span> <span class="p">)</span>
<span class="gp">... </span> <span class="p">)</span>
<span class="gp">...</span>
<span class="go"><sqlite3.Cursor object at 0x10a59b960></span>
</code></pre></div>
<p>存了兩個時間,一個是 UTC 的現在時間,另一個是以 UTC 表示的台北 2016 元旦。注意兩個時間都把 UTC 時區去掉了,因為在某些情況底下 SQLite 與 python 的 datetime adapter 會看不懂時區(這是個 <a href="https://bugs.python.org/issue19065">bug #19065</a>)。</p>
<p>如果用 SQLite 可以看到時間都是以 UTC 呈現。仍可以用它內建的 <code>datetime('<UTC time>', 'localtime')</code> 把 UTC 時間字串轉換成電腦的當地時間。這樣處理是容易與其他應用程式相容的。</p>
<div class="highlight"><pre><span></span><code><span class="go">-- sqlite3 test.db</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="p">.</span><span class="k">schema</span>
<span class="go">CREATE TABLE test(dt timestamp);</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="n">dt</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">test</span><span class="p">;</span>
<span class="go">2015-09-29 12:48:16.671538</span>
<span class="go">2015-12-31 16:00:00</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="n">datetime</span><span class="p">(</span><span class="n">dt</span><span class="p">,</span><span class="w"> </span><span class="s1">'localtime'</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">test</span><span class="p">;</span>
<span class="go">2015-09-29 20:48:16</span>
<span class="go">2016-01-01 00:00:00</span>
</code></pre></div>
<p>再用 Python 讀回來仍然是 datetime 格式:</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">ret_vals</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s1">'SELECT dt AS "[timestamp]" FROM test'</span><span class="p">)</span><span class="o">.</span><span class="n">fetchall</span><span class="p">()</span>
<span class="gp">>>> </span><span class="n">ret_vals</span>
<span class="go">[(datetime.datetime(2015, 9, 29, 12, 48, 16, 671538),),</span>
<span class="go"> (datetime.datetime(2015, 12, 31, 16, 0),)]</span>
<span class="gp">>>> </span><span class="p">[</span><span class="n">tpe</span><span class="o">.</span><span class="n">fromutc</span><span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">ret_vals</span><span class="p">]</span>
<span class="go">[datetime.datetime(2015, 9, 29, 20, 48, 16, 671538, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>),</span>
<span class="go"> datetime.datetime(2016, 1, 1, 0, 0, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>)]</span>
</code></pre></div>
<h4 id="python-3-timezone">Python 3 內建 timezone 支援</h4>
<p>為了寫這篇 blog 又研究了一下內建的 datetime.timezone。Python 2 沒有這個功能,不過基本的 timedelta 有,所以要自己做應該也是做得到…吧?</p>
<p>內建的 datetime.timezone 由一個 utcoffset 做建立,基本上就是傳個相對於 UTC 的時間差,以 datetime.timedelta 表示。一樣內建帶有 UTC 時區,這邊試著建了台北以及東京的時間。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="kn">import</span> <span class="nn">datetime</span> <span class="k">as</span> <span class="nn">dt</span>
<span class="gp">>>> </span><span class="n">tpe_now</span> <span class="o">=</span> <span class="n">dt</span><span class="o">.</span><span class="n">datetime</span><span class="o">.</span><span class="n">now</span><span class="p">()</span>
<span class="gp">>>> </span><span class="n">tpe_now</span>
<span class="go">datetime.datetime(2015, 9, 29, 20, 40, 49, 347568)</span>
<span class="gp">>>> </span><span class="n">utc</span> <span class="o">=</span> <span class="n">dt</span><span class="o">.</span><span class="n">timezone</span><span class="o">.</span><span class="n">utc</span>
<span class="gp">>>> </span><span class="n">tpe_delta</span> <span class="o">=</span> <span class="n">dt</span><span class="o">.</span><span class="n">timedelta</span><span class="p">(</span><span class="n">hours</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">tpe</span> <span class="o">=</span> <span class="n">dt</span><span class="o">.</span><span class="n">timezone</span><span class="p">(</span><span class="n">tpe_delta</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">jpn_delta</span> <span class="o">=</span> <span class="n">dt</span><span class="o">.</span><span class="n">timedelta</span><span class="p">(</span><span class="n">hours</span><span class="o">=</span><span class="mi">9</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">jpn</span> <span class="o">=</span> <span class="n">dt</span><span class="o">.</span><span class="n">timezone</span><span class="p">(</span><span class="n">jpn_delta</span><span class="p">)</span>
</code></pre></div>
<p>我人在台北,所以 datetime.datetime.now() 會給我台北時間,再用 timedelta 手動算出各時區的時間。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">utc_now</span> <span class="o">=</span> <span class="n">tpe_now</span> <span class="o">-</span> <span class="n">tpe_delta</span> <span class="c1"># manually time shift</span>
<span class="gp">>>> </span><span class="n">jpn_now</span> <span class="o">=</span> <span class="n">utc_now</span> <span class="o">+</span> <span class="n">jpn_delta</span>
<span class="gp">>>> </span><span class="n">utc_now</span>
<span class="go">datetime.datetime(2015, 9, 29, 12, 40, 49, 347568)</span>
<span class="gp">>>> </span><span class="n">tpe_now</span> <span class="o">==</span> <span class="n">utc_now</span>
<span class="go">False</span>
</code></pre></div>
<p>直接比較這些算出來的時間,不意外不相等,因為預設的 tzinfo 是空的。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">tpe_now</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">tzinfo</span><span class="o">=</span><span class="n">tpe</span><span class="p">)</span>
<span class="go">datetime.datetime(2015, 9, 29, 20, 40, 49, 347568, tzinfo=datetime.timezone(datetime.timedelta(0, 28800)))</span>
<span class="gp">>>> </span><span class="n">tpe_now</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">tzinfo</span><span class="o">=</span><span class="n">tpe</span><span class="p">)</span> <span class="o">==</span> <span class="n">utc_now</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">tzinfo</span><span class="o">=</span><span class="n">utc</span><span class="p">)</span>
<span class="go">True</span>
<span class="gp">>>> </span><span class="n">jpn_now</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">tzinfo</span><span class="o">=</span><span class="n">jpn</span><span class="p">)</span> <span class="o">==</span> <span class="n">tpe_now</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">tzinfo</span><span class="o">=</span><span class="n">tpe</span><span class="p">)</span>
<span class="go">True</span>
</code></pre></div>
<p>給了各地的時區的 tzinfo 之後,可以看到 datetime 在做比較的時候是有考慮時區位移的。</p>
<p>接著再來看一下pytz 與內建 datetime.timezone 的相容程度。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="kn">import</span> <span class="nn">pytz</span>
<span class="gp">>>> </span><span class="n">pytz_utc</span> <span class="o">=</span> <span class="n">pytz</span><span class="o">.</span><span class="n">utc</span>
<span class="gp">>>> </span><span class="n">pytz_tpe</span> <span class="o">=</span> <span class="n">pytz</span><span class="o">.</span><span class="n">timezone</span><span class="p">(</span><span class="s1">'Asia/Taipei'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">pytz</span><span class="o">.</span><span class="n">utc</span><span class="o">.</span><span class="n">normalize</span><span class="p">(</span><span class="n">jpn_now</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">tzinfo</span><span class="o">=</span><span class="n">jpn</span><span class="p">))</span>
<span class="go">datetime.datetime(2015, 9, 29, 12, 40, 49, 347568, tzinfo=<UTC>)</span>
<span class="gp">>>> </span><span class="n">pytz_tpe</span><span class="o">.</span><span class="n">localize</span><span class="p">(</span><span class="n">tpe_now</span><span class="p">)</span> <span class="o">==</span> <span class="n">utc_now</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">tzinfo</span><span class="o">=</span><span class="n">utc</span><span class="p">)</span>
<span class="go">True</span>
</code></pre></div>
<p>比較跟轉換都沒有問題,可以放心轉換。</p>
<h3 id="python-sqlite-adapter">讓 Python 內建 SQLite adapter 支援時區</h3>
<p>看了一下 <a href="https://bugs.python.org/issue19065">Python issue 19065</a>,之所以沒有解決其實是缺 patch,因為現在的 patch 並不相容 Python 2.x(沒有 datetime.timezone),然後 pysqlite 的維護者並沒有想要支援 timezone 的意思。</p>
<p>不過那只是內建的 adapter for datetime.datetime object,要自己做也沒問題。參考 issue 裡面提供的解法(在 Github <a href="https://gist.github.com/acdha/6655391">gist</a> 上)。</p>
<div class="highlight"><pre><span></span><code><span class="c1"># tz_aware_adpater.py</span>
<span class="c1"># Adapt from https://gist.github.com/acdha/6655391</span>
<span class="kn">import</span> <span class="nn">datetime</span>
<span class="kn">import</span> <span class="nn">sqlite3</span>
<span class="k">def</span> <span class="nf">tz_aware_timestamp_adapter</span><span class="p">(</span><span class="n">val</span><span class="p">):</span>
<span class="n">datepart</span><span class="p">,</span> <span class="n">timepart</span> <span class="o">=</span> <span class="n">val</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">b</span><span class="s2">" "</span><span class="p">)</span>
<span class="n">year</span><span class="p">,</span> <span class="n">month</span><span class="p">,</span> <span class="n">day</span> <span class="o">=</span> <span class="nb">map</span><span class="p">(</span><span class="nb">int</span><span class="p">,</span> <span class="n">datepart</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">b</span><span class="s2">"-"</span><span class="p">))</span>
<span class="k">if</span> <span class="sa">b</span><span class="s2">"+"</span> <span class="ow">in</span> <span class="n">timepart</span><span class="p">:</span>
<span class="n">timepart</span><span class="p">,</span> <span class="n">tz_offset</span> <span class="o">=</span> <span class="n">timepart</span><span class="o">.</span><span class="n">rsplit</span><span class="p">(</span><span class="sa">b</span><span class="s2">"+"</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">if</span> <span class="n">tz_offset</span> <span class="o">==</span> <span class="sa">b</span><span class="s1">'00:00'</span><span class="p">:</span>
<span class="n">tzinfo</span> <span class="o">=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">timezone</span><span class="o">.</span><span class="n">utc</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">hours</span><span class="p">,</span> <span class="n">minutes</span> <span class="o">=</span> <span class="nb">map</span><span class="p">(</span><span class="nb">int</span><span class="p">,</span> <span class="n">tz_offset</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">b</span><span class="s1">':'</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">tzinfo</span> <span class="o">=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">timezone</span><span class="p">(</span>
<span class="n">datetime</span><span class="o">.</span><span class="n">timedelta</span><span class="p">(</span><span class="n">hours</span><span class="o">=</span><span class="n">hours</span><span class="p">,</span> <span class="n">minutes</span><span class="o">=</span><span class="n">minutes</span><span class="p">))</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">tzinfo</span> <span class="o">=</span> <span class="kc">None</span>
<span class="n">timepart_full</span> <span class="o">=</span> <span class="n">timepart</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">b</span><span class="s2">"."</span><span class="p">)</span>
<span class="n">hours</span><span class="p">,</span> <span class="n">minutes</span><span class="p">,</span> <span class="n">seconds</span> <span class="o">=</span> <span class="nb">map</span><span class="p">(</span><span class="nb">int</span><span class="p">,</span> <span class="n">timepart_full</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sa">b</span><span class="s2">":"</span><span class="p">))</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">timepart_full</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span><span class="p">:</span>
<span class="n">microseconds</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="s1">'</span><span class="si">{:0<6.6}</span><span class="s1">'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">timepart_full</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">decode</span><span class="p">()))</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">microseconds</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">val</span> <span class="o">=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">datetime</span><span class="p">(</span>
<span class="n">year</span><span class="p">,</span> <span class="n">month</span><span class="p">,</span> <span class="n">day</span><span class="p">,</span> <span class="n">hours</span><span class="p">,</span> <span class="n">minutes</span><span class="p">,</span> <span class="n">seconds</span><span class="p">,</span> <span class="n">microseconds</span><span class="p">,</span>
<span class="n">tzinfo</span>
<span class="p">)</span>
<span class="k">return</span> <span class="n">val</span>
<span class="n">sqlite3</span><span class="o">.</span><span class="n">register_converter</span><span class="p">(</span><span class="s1">'timestamp'</span><span class="p">,</span> <span class="n">tz_aware_timestamp_adapter</span><span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go">python3 -i tz_aware_adpater.py</span>
<span class="gp">>>> </span><span class="n">db</span> <span class="o">=</span> <span class="n">sqlite3</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span>
<span class="gp">... </span> <span class="s1">'test.db'</span><span class="p">,</span>
<span class="gp">... </span> <span class="n">detect_types</span><span class="o">=</span><span class="n">sqlite3</span><span class="o">.</span><span class="n">PARSE_DECLTYPES</span><span class="o">|</span><span class="n">sqlite3</span><span class="o">.</span><span class="n">PARSE_COLNAMES</span>
<span class="gp">... </span><span class="p">)</span>
<span class="gp">>>> </span><span class="kn">import</span> <span class="nn">pytz</span>
<span class="gp">>>> </span><span class="n">tpe</span> <span class="o">=</span> <span class="n">pytz</span><span class="o">.</span><span class="n">timezone</span><span class="p">(</span><span class="s1">'Asia/Taipei'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">pycontw</span> <span class="o">=</span> <span class="n">tpe</span><span class="o">.</span><span class="n">localize</span><span class="p">(</span><span class="n">datetime</span><span class="o">.</span><span class="n">datetime</span><span class="p">(</span><span class="mi">2016</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span>
<span class="gp">>>> </span><span class="nb">str</span><span class="p">(</span><span class="n">pycontw</span><span class="p">)</span>
<span class="go">'2016-06-03 08:00:00+08:00'</span>
<span class="gp">>>> </span><span class="n">db</span><span class="o">.</span><span class="n">executemany</span><span class="p">(</span>
<span class="gp">... </span> <span class="s1">'INSERT INTO test(dt) VALUES (?)'</span><span class="p">,</span>
<span class="gp">... </span> <span class="p">[(</span><span class="n">pycontw</span><span class="p">,),</span> <span class="p">(</span><span class="n">pytz</span><span class="o">.</span><span class="n">utc</span><span class="o">.</span><span class="n">normalize</span><span class="p">(</span><span class="n">pycontw</span><span class="p">),)]</span>
<span class="gp">... </span><span class="p">)</span>
<span class="gp">>>> </span><span class="n">db</span><span class="o">.</span><span class="n">commit</span><span class="p">()</span>
</code></pre></div>
<p>存了兩個帶有時區的時間(兩個時間是相同的)。先從 SQLite 來讀讀看。</p>
<div class="highlight"><pre><span></span><code><span class="gp">sqlite></span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="n">dt</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">test</span><span class="p">;</span>
<span class="go">2015-09-29 12:48:16.671538</span>
<span class="go">2015-12-31 16:00:00</span>
<span class="go">2016-06-03 08:00:00+08:00</span>
<span class="go">2016-06-03 00:00:00+00:00</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="n">datetime</span><span class="p">(</span><span class="n">dt</span><span class="p">,</span><span class="w"> </span><span class="s1">'localtime'</span><span class="p">)</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">test</span><span class="p">;</span>
<span class="go">2015-09-29 20:48:16</span>
<span class="go">2016-01-01 00:00:00</span>
<span class="go">2016-06-03 08:00:00</span>
<span class="go">2016-06-03 08:00:00</span>
</code></pre></div>
<p>時區是直接寫到 SQLite 裡面,沒有的話就當成是 UTC 時區。</p>
<p>再用 Python 讀回來,測一下修改的 adapter。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">dts</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s1">'SELECT dt FROM test'</span><span class="p">)</span><span class="o">.</span><span class="n">fetchall</span><span class="p">()</span>
<span class="gp">>>> </span><span class="n">dts</span>
<span class="go">[(datetime.datetime(2015, 9, 29, 12, 48, 16, 671538),),</span>
<span class="go"> (datetime.datetime(2015, 12, 31, 16, 0),),</span>
<span class="go"> (datetime.datetime(2016, 6, 3, 8, 0, tzinfo=datetime.timezone(datetime.timedelta(0, 28800))),),</span>
<span class="go"> (datetime.datetime(2016, 6, 3, 0, 0, tzinfo=datetime.timezone.utc),)]</span>
<span class="gp">>>> </span><span class="p">[</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">astimezone</span><span class="p">(</span><span class="n">tpe</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">dts</span><span class="p">[</span><span class="mi">2</span><span class="p">:]]</span>
<span class="go">[datetime.datetime(2016, 6, 3, 8, 0, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>),</span>
<span class="go"> datetime.datetime(2016, 6, 3, 8, 0, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>)]</span>
</code></pre></div>
<p>讀回來沒有問題,如果要完整處理所有情況(前面兩個 datetime 是 naive 沒有時區)</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="p">[(</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">astimezone</span><span class="p">(</span><span class="n">utc</span><span class="p">)</span> <span class="k">if</span> <span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">tzinfo</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span>
<span class="gp">... </span> <span class="k">else</span> <span class="n">utc</span><span class="o">.</span><span class="n">localize</span><span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
<span class="gp">... </span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">dts</span><span class="p">]</span>
<span class="go">[datetime.datetime(2015, 9, 29, 12, 48, 16, 671538, tzinfo=<UTC>),</span>
<span class="go"> datetime.datetime(2015, 12, 31, 16, 0, tzinfo=<UTC>),</span>
<span class="go"> datetime.datetime(2016, 6, 3, 0, 0, tzinfo=<UTC>),</span>
<span class="go"> datetime.datetime(2016, 6, 3, 0, 0, tzinfo=<UTC>)]</span>
<span class="gp">>>> </span><span class="p">[(</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">astimezone</span><span class="p">(</span><span class="n">tpe</span><span class="p">)</span> <span class="k">if</span> <span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">tzinfo</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span>
<span class="gp">... </span> <span class="k">else</span> <span class="n">tpe</span><span class="o">.</span><span class="n">fromutc</span><span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
<span class="gp">... </span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">dts</span><span class="p">]</span>
<span class="go">[datetime.datetime(2015, 9, 29, 20, 48, 16, 671538, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>),</span>
<span class="go"> datetime.datetime(2016, 1, 1, 0, 0, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>),</span>
<span class="go"> datetime.datetime(2016, 6, 3, 8, 0, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>),</span>
<span class="go"> datetime.datetime(2016, 6, 3, 8, 0, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>)]</span>
</code></pre></div>
<h3 id="_1">總結</h3>
<p>時區真的很煩,尤其是很多地方不一定都完整支援時區,最好的情況還是用 UTC 溝通,只有在真的需要時再轉換成當地時間。</p>
<p>如果大家對時區很有興趣,不久前 <a href="https://www.python.org/dev/peps/pep-0495/">PEP 495</a> 已經被接受,沒有意外應該會出現在 Python 3.6 裡面,它處理的是日光節約時間的問題。(感覺在臺灣對日光節約時間完全沒有概念啊)</p>
<p>不得不說要正確處理時間…很麻煩啊。</p>用 Flask 與 SQLite 架抽籤網站2015-09-28T12:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2015-09-28:/posts/2015/09/flask-draw-member/<p>為了實驗室的專題生而寫。</p>
<p>目標其實是 Django + Django ORM + PostgreSQL,不過一次接觸太多會有反效果,先操作比較簡單的才好上手。所以這邊講 …</p><p>為了實驗室的專題生而寫。</p>
<p>目標其實是 Django + Django ORM + PostgreSQL,不過一次接觸太多會有反效果,先操作比較簡單的才好上手。所以這邊講的並不是 best practice,但使用最少(底層)的知識與工具。如果一開始讓太多套件(像 SQLAlchemy)做掉了細節部份,反而不太能掌握到重要的概念以及為什麼需要這些套件。</p>
<p><strong>本篇文章非常長,應該沒辦法幾分鐘內讀完。對象是初學者學習簡單網站架設。</strong></p>
<p>這個專案的目標:因為大家 meeting 的時候都不問問題,教授需要一個抽籤點人問問題的工具。我們實驗室有分成幾個組別,所以抽籤的時候也要能針對單個組別抽。</p>
<p>以下使用 <a href="https://zh.wikipedia.org/wiki/LoveLive!">LoveLive!</a> 還有 <a href="https://zh.wikipedia.org/wiki/K-ON!輕音部">K-ON!</a> 的成員來當例子。<strong>先聲明我兩個動畫都沒有看過,如果有什麼名字打錯請告訴我,絕對不是故意的。</strong>(2016-06-14 更新:我把兩個動畫都看完了!)</p>
<div class="toc">
<ul>
<li><a href="#_1">資料設計</a></li>
<li><a href="#_2">網站架構規劃</a></li>
<li><a href="#_3">實作環境設定</a><ul>
<li><a href="#python">Python 環境</a></li>
<li><a href="#flaskjinja2">安裝 Flask、Jinja2 等套件</a></li>
<li><a href="#sqlite-database">SQLite Database</a></li>
</ul>
</li>
<li><a href="#csv">把 CSV 寫進資料庫</a></li>
<li><a href="#flask">Flask 基本架構</a></li>
<li><a href="#flask-sqlite">Flask 與 SQLite 資料庫讀取</a></li>
<li><a href="#first-view-first-template">First view, first template</a></li>
<li><a href="#_4">抽籤功能</a><ul>
<li><a href="#get-vs-post">GET vs POST</a></li>
<li><a href="#form">Form</a></li>
<li><a href="#request-form-post-handling-in-flask">Request (Form / POST) handling in Flask</a></li>
<li><a href="#demo">Demo</a></li>
</ul>
</li>
<li><a href="#more-on-templates">More on templates</a></li>
<li><a href="#_5">歷史記錄</a><ul>
<li><a href="#datetime">時間處理用 datetime</a></li>
</ul>
</li>
<li><a href="#whats-next">What’s Next</a></li>
<li><a href="#static-files-and-better-theme">Static files and better theme</a></li>
<li><a href="#more-how-web-works">More how web works</a></li>
<li><a href="#object-relational-model-orm">Object Relational Model (ORM)</a></li>
<li><a href="#django">Django</a></li>
<li><a href="#_6">總結</a></li>
<li><a href="#details">Details</a><ul>
<li><a href="#sqlite-table-info">SQLite table info</a></li>
<li><a href="#sqlite-foreign-key-check">SQLite foreign key check</a></li>
<li><a href="#_7">重新讀入資料</a></li>
<li><a href="#datetime-in-sqlite-and-python">Datetime in SQLite and Python</a></li>
</ul>
</li>
</ul>
</div>
<h3 id="_1">資料設計</h3>
<p>我們先假設所有檔案都放在同個資料夾裡,估且叫 <code>draw_member</code>。之後沒有額外說明的話,都是在這個目錄下操作。</p>
<p>原始資料用 CSV 格式來儲存,有「名字」以及「團體」兩個欄位。不過考慮到可能會把檔案匯出,在原始檔案多加一個「最近被抽到的日期」欄位,希望最近被抽到的會比其他人再被抽到的機會低一點。</p>
<p>這個 CSV 檔案命名為 <code>members.csv</code>。一開始沒有人被抽到,所以最後一欄都先設成空的<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>,第一行是每一欄欄位的名稱。如果從資料庫匯出,這欄位就會有值。</p>
<div class="highlight"><pre><span></span><code>"名字","團體","最近被抽到的日期"
"高坂 穂乃果","μ's",""
"絢瀬 絵里","μ's",""
"南 ことり","μ's",""
"園田 海未","μ's",""
"星空 凛","μ's",""
"西木野 真姫","μ's",""
"東條 希","μ's",""
"小泉 花陽","μ's",""
"矢澤 にこ","μ's",""
"平沢 唯","K-ON!",""
"秋山 澪","K-ON!",""
"田井中 律","K-ON!",""
"琴吹 紬","K-ON!",""
"中野 梓","K-ON!",""
</code></pre></div>
<p>首先我們先確定會用 Python 把資料讀出來。在 Python 當中有個叫 <code>csv</code> 的內建模組(module)可以處理 CSV 的檔案讀寫。在這邊我們選用 <a href="https://docs.python.org/3.5/library/csv.html#csv.DictReader">csv.DictReader</a>,它預設會把檔案的第一行當成欄位名稱,然後根據這名稱,每一行都會產生一個 <code>dict</code> 物件。</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">csv</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'./members.csv'</span><span class="p">,</span> <span class="n">newline</span><span class="o">=</span><span class="s1">''</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">DictReader</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">r</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="n">row</span><span class="p">)</span>
</code></pre></div>
<p>可以把這段程式碼直接打在 Python REPL 裡或者存成一個檔案後再用 Python 執行它,結果都會是:</p>
<div class="highlight"><pre><span></span><code><span class="p">{</span><span class="s1">'名字'</span><span class="p">:</span> <span class="s1">'高坂 穂乃果'</span><span class="p">,</span> <span class="s1">'團體'</span><span class="p">:</span> <span class="s2">"μ's"</span><span class="p">}</span>
<span class="p">{</span><span class="s1">'名字'</span><span class="p">:</span> <span class="s1">'絢瀬 絵里'</span><span class="p">,</span> <span class="s1">'團體'</span><span class="p">:</span> <span class="s2">"μ's"</span><span class="p">}</span>
<span class="p">{</span><span class="s1">'名字'</span><span class="p">:</span> <span class="s1">'南 ことり'</span><span class="p">,</span> <span class="s1">'團體'</span><span class="p">:</span> <span class="s2">"μ's"</span><span class="p">}</span>
<span class="p">{</span><span class="s1">'名字'</span><span class="p">:</span> <span class="s1">'園田 海未'</span><span class="p">,</span> <span class="s1">'團體'</span><span class="p">:</span> <span class="s2">"μ's"</span><span class="p">}</span>
<span class="p">{</span><span class="s1">'名字'</span><span class="p">:</span> <span class="s1">'星空 凛'</span><span class="p">,</span> <span class="s1">'團體'</span><span class="p">:</span> <span class="s2">"μ's"</span><span class="p">}</span>
<span class="p">{</span><span class="s1">'名字'</span><span class="p">:</span> <span class="s1">'西木野 真姫'</span><span class="p">,</span> <span class="s1">'團體'</span><span class="p">:</span> <span class="s2">"μ's"</span><span class="p">}</span>
<span class="p">{</span><span class="s1">'名字'</span><span class="p">:</span> <span class="s1">'東條 希'</span><span class="p">,</span> <span class="s1">'團體'</span><span class="p">:</span> <span class="s2">"μ's"</span><span class="p">}</span>
<span class="p">{</span><span class="s1">'名字'</span><span class="p">:</span> <span class="s1">'小泉 花陽'</span><span class="p">,</span> <span class="s1">'團體'</span><span class="p">:</span> <span class="s2">"μ's"</span><span class="p">}</span>
<span class="p">{</span><span class="s1">'名字'</span><span class="p">:</span> <span class="s1">'矢澤 にこ'</span><span class="p">,</span> <span class="s1">'團體'</span><span class="p">:</span> <span class="s2">"μ's"</span><span class="p">}</span>
<span class="p">{</span><span class="s1">'名字'</span><span class="p">:</span> <span class="s1">'平沢 唯'</span><span class="p">,</span> <span class="s1">'團體'</span><span class="p">:</span> <span class="s1">'K-ON!'</span><span class="p">}</span>
<span class="p">{</span><span class="s1">'名字'</span><span class="p">:</span> <span class="s1">'秋山 澪'</span><span class="p">,</span> <span class="s1">'團體'</span><span class="p">:</span> <span class="s1">'K-ON!'</span><span class="p">}</span>
<span class="p">{</span><span class="s1">'名字'</span><span class="p">:</span> <span class="s1">'田井中 律'</span><span class="p">,</span> <span class="s1">'團體'</span><span class="p">:</span> <span class="s1">'K-ON!'</span><span class="p">}</span>
<span class="p">{</span><span class="s1">'名字'</span><span class="p">:</span> <span class="s1">'琴吹 紬'</span><span class="p">,</span> <span class="s1">'團體'</span><span class="p">:</span> <span class="s1">'K-ON!'</span><span class="p">}</span>
<span class="p">{</span><span class="s1">'名字'</span><span class="p">:</span> <span class="s1">'中野 梓'</span><span class="p">,</span> <span class="s1">'團體'</span><span class="p">:</span> <span class="s1">'K-ON!'</span><span class="p">}</span>
</code></pre></div>
<p>如果不要直接 <code>print(row)</code> ,而是稍微整理一下資料再輸出,改成:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">csv</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'./members.csv'</span><span class="p">,</span> <span class="n">newline</span><span class="o">=</span><span class="s1">''</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">DictReader</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">r</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'</span><span class="si">{}</span><span class="s1"> of </span><span class="si">{}</span><span class="s1">'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="s1">'名字'</span><span class="p">],</span> <span class="n">row</span><span class="p">[</span><span class="s1">'團體'</span><span class="p">]))</span>
</code></pre></div>
<p>則輸出結果會是:</p>
<div class="highlight"><pre><span></span><code>高坂 穂乃果 of μ's
絢瀬 絵里 of μ's
南 ことり of μ's
園田 海未 of μ's
星空 凛 of μ's
西木野 真姫 of μ's
東條 希 of μ's
小泉 花陽 of μ's
矢澤 にこ of μ's
平沢 唯 of K-ON!
秋山 澪 of K-ON!
田井中 律 of K-ON!
琴吹 紬 of K-ON!
中野 梓 of K-ON!
</code></pre></div>
<p>這樣就確定我們有辦法把資料用 Python 讀取了。要拿每個欄位的內容也很簡單,像要名字的話,只要用 <code>row['名字']</code>。</p>
<h3 id="_2">網站架構規劃</h3>
<p>這個抽籤網站主要就幾個功能:</p>
<ul>
<li><strong>首頁</strong>可以選擇其中一個團體或所有人去抽籤<ul>
<li>送出之後可以看到結果</li>
<li>並且把這個抽籤結果更新到歷史記錄裡</li>
</ul>
</li>
<li><strong>歷史記錄</strong>列出過去被抽到的記錄</li>
<li><strong>更新成員</strong>清除所有資料,重新讀入</li>
</ul>
<p>每一頁我們要有個功能表列,方便功能的切換。</p>
<p>所以資料庫的部份會有兩張表格:<strong>members</strong> 以及 <strong>draw_histories</strong> 分別記錄成員以及被抽過的時間。</p>
<pre style="font-family: Consolas, 'Courier New', monospace">
┌─────────────────────┐
│ members │
├─────────────────────┤
│ id INTEGER │ <─┐
│ name TEXT │ │
│ group_name TEXT │ │
└─────────────────────┘ │
│
┌─────────────────────┐ │
│ draw_histories │ │ foreign
├─────────────────────┤ │ key
│ memberid INTEGER │ ──┘
│ time DATETIME │
└─────────────────────┘
</pre>
<p>Table <strong>members</strong> 應該很好理解,一個欄位是名字 name,一個是團體名稱 group_name。其中 id 這個欄位是程式內部在使用的,它會在讀入資料的時候自動產生。</p>
<p>Table <strong>draw_histories</strong> 記錄每次抽籤發生的時間 time 還有誰被抽到 memberid,可以發現 memberid 是用成員的 id,因此我們多加一個限制是這欄位的值應該要在 members 裡的 id 中出現過。</p>
<h3 id="_3">實作環境設定</h3>
<p>我們選用 <a href="http://flask.pocoo.org/">Flask</a> 架設 server,因為它一開始用相當簡單。資料的部份會讀到 <a href="https://www.sqlite.org/">SQLite</a> 資料庫。</p>
<blockquote>
<p><em>Flask</em> is a microframework for Python based on Werkzeug, Jinja 2 and good intentions. (Flask official site)</p>
<p><em>SQLite</em> is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. (SQLite official site)</p>
</blockquote>
<h4 id="python">Python 環境</h4>
<p>使用 <a href="https://www.python.org/downloads/">Python 3.5</a>。理論上 SQLite 就已經裝好了能直接使用。一般在開發 Python 程式的時候會使用虛擬環境,好處虛擬環境安裝的 Python 套件可以獨立管理,不受系統或其他虛擬環境影響。我們用內建的 <a href="https://docs.python.org/3.5/library/venv.html#module-venv">venv</a> 建立一個名稱為 <code>VENV</code> 的虛擬環境:</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>python3.5<span class="w"> </span>-m<span class="w"> </span>venv<span class="w"> </span>VENV
</code></pre></div>
<p>這時候目錄底下就會多一個 <code>VENV</code> 資料夾,裡面是個完整的 Python 執行結構,就好像在這個路徑安裝 Python 一樣。先暫時不管它怎麼做到虛擬隔離,知道怎麼用就好。使用跟離開分別是:</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span><span class="nb">source</span><span class="w"> </span>VENV/bin/activate
<span class="o">(</span>VENV<span class="o">)</span><span class="w"> </span>$<span class="w"> </span>which<span class="w"> </span>python
<span class="c1"># /path/to/draw_member/VENV/bin/python</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="o">(</span>VENV<span class="o">)</span><span class="w"> </span>$<span class="w"> </span>deactivate
$<span class="w"> </span><span class="c1"># 前面的 (VENV) 會消失</span>
</code></pre></div>
<h4 id="flaskjinja2">安裝 Flask、Jinja2 等套件</h4>
<p>Python 使用 pip 管理安裝的套件,</p>
<div class="highlight"><pre><span></span><code><span class="o">(</span>VENV<span class="o">)</span><span class="w"> </span>$<span class="w"> </span>pip<span class="w"> </span>install<span class="w"> </span>flask<span class="w"> </span>jinja2
<span class="c1"># Collecting flask</span>
<span class="c1"># Collecting jinja2</span>
<span class="c1"># ... (連帶裝了相關的套件)</span>
</code></pre></div>
<p>這時候如果查看裝了哪些套件就會看到:</p>
<div class="highlight"><pre><span></span><code><span class="o">(</span>VENV<span class="o">)</span><span class="w"> </span>$<span class="w"> </span>pip<span class="w"> </span>freeze
<span class="c1"># Flask==0.10.1</span>
<span class="c1"># itsdangerous==0.24</span>
<span class="c1"># Jinja2==2.8</span>
<span class="c1"># MarkupSafe==0.23</span>
<span class="c1"># Werkzeug==0.10.4</span>
</code></pre></div>
<p>為了方便之後把環境安裝在別的電腦上,記得用</p>
<div class="highlight"><pre><span></span><code>pip<span class="w"> </span>freeze<span class="w"> </span>><span class="w"> </span>requirements.txt
</code></pre></div>
<p>把套件版本的資訊都存在一個檔案裡的好處是,下次把要環境架起來就只要</p>
<div class="highlight"><pre><span></span><code>pip<span class="w"> </span>install<span class="w"> </span>-r<span class="w"> </span>requirements.txt
</code></pre></div>
<p>就設定完成了。</p>
<h4 id="sqlite-database">SQLite Database</h4>
<p>我們先把 SQLite 每個資料表設定好,這樣之後在寫程式就只要專心讀寫資料就好了。根據前面建的模型,我們可以轉換成 SQL 語法:</p>
<div class="highlight"><pre><span></span><code><span class="c1">-- create_db.sql</span>
<span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">members</span><span class="w"> </span><span class="p">(</span>
<span class="w"> </span><span class="n">id</span><span class="w"> </span><span class="nb">INTEGER</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="k">ASC</span><span class="w"> </span><span class="n">AUTOINCREMENT</span><span class="p">,</span>
<span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="nb">TEXT</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="p">,</span>
<span class="w"> </span><span class="n">group_name</span><span class="w"> </span><span class="nb">TEXT</span>
<span class="p">);</span>
<span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">draw_histories</span><span class="w"> </span><span class="p">(</span>
<span class="w"> </span><span class="n">memberid</span><span class="w"> </span><span class="nb">INTEGER</span><span class="p">,</span>
<span class="w"> </span><span class="k">time</span><span class="w"> </span><span class="n">DATETIME</span><span class="w"> </span><span class="k">DEFAULT</span><span class="w"> </span><span class="p">(</span><span class="n">datetime</span><span class="p">(</span><span class="s1">'now'</span><span class="p">,</span><span class="w"> </span><span class="s1">'localtime'</span><span class="p">)),</span>
<span class="w"> </span><span class="k">FOREIGN</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">memberid</span><span class="p">)</span><span class="w"> </span><span class="k">REFERENCES</span><span class="w"> </span><span class="n">members</span><span class="p">(</span><span class="n">id</span><span class="p">)</span>
<span class="p">);</span>
</code></pre></div>
<p>把這串 SQL 寫到一個檔案 <code>create_db.sql</code> 後就可以實際測試一下。我們把兩個成員寫到檔案裡面,</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>sqlite3<span class="w"> </span>-init<span class="w"> </span>create_db.sql<span class="w"> </span>test.db
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="go">-- Loading resources from create_db.sql</span>
<span class="go">SQLite version 3.8.11.1 2015-07-29 20:00:57</span>
<span class="go">Enter ".help" for usage hints.</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">members</span><span class="w"> </span><span class="p">(</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">group_name</span><span class="p">)</span>
<span class="gp"> ...></span><span class="w"> </span><span class="k">VALUES</span>
<span class="gp"> ...></span><span class="w"> </span><span class="p">(</span><span class="s1">'高坂 穂乃果'</span><span class="p">,</span><span class="w"> </span><span class="s1">'μ''s'</span><span class="p">),</span>
<span class="gp"> ...></span><span class="w"> </span><span class="p">(</span><span class="s1">'平沢 唯'</span><span class="p">,</span><span class="w"> </span><span class="s1">'K-ON!'</span><span class="p">);</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="p">.</span><span class="n">header</span><span class="w"> </span><span class="k">on</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">members</span><span class="p">;</span>
<span class="go">id|name|group_name</span>
<span class="go">1|高坂 穂乃果|μ's</span>
<span class="go">2|平沢 唯|K-ON!</span>
</code></pre></div>
<p><code>sqlite3 -init xxx.sql</code> 意思是把 <code>xxx.sql</code> 裡面的 SQL 指令都執行了一遍,所以一進到 SQLite shell 裡面就建立好表格了。</p>
<p>再來我們模擬幾次抽籤的過程。注意到我們之前有寫 <strong>draw_histories</strong>.time 的預設值,所以抽籤只要寫是誰就可以了,時間 SQLite 會自動根據指令執行的時間給值。不過我們都試一下吧。</p>
<div class="highlight"><pre><span></span><code><span class="gp">sqlite></span><span class="w"> </span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">draw_histories</span><span class="w"> </span><span class="p">(</span><span class="n">memberid</span><span class="p">)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">1</span><span class="p">),</span><span class="w"> </span><span class="p">(</span><span class="mi">2</span><span class="p">);</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">draw_histories</span><span class="w"> </span><span class="p">(</span><span class="n">memberid</span><span class="p">,</span><span class="w"> </span><span class="k">time</span><span class="p">)</span>
<span class="gp"> ...></span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="n">datetime</span><span class="p">(</span><span class="s1">'2015-09-25 16:30'</span><span class="p">));</span>
</code></pre></div>
<p>所以第一次 INSERT 指令抽了果果以及小唯各一次。第二次 INSERT 再抽了一次小唯,這次還有額外指定時間為的 9 月 25 號下午 4 點半。關於 SQLite 裡 <code>datetime</code> 的更多使用方式可以參考<a href="https://sqlite.org/lang_datefunc.html">官網說明文件</a>,我們的例子只要這樣就足夠了。</p>
<div class="highlight"><pre><span></span><code><span class="gp">sqlite></span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">draw_histories</span><span class="p">;</span>
<span class="go">memberid|time</span>
<span class="go">1|2015-09-28 16:55:03</span>
<span class="go">2|2015-09-28 16:55:03</span>
<span class="go">2|2015-09-25 16:30:00</span>
</code></pre></div>
<p>前兩個就是第一次 INSERT 所建立的抽籤歷史,跟你下指令的時間有關。第二次 INSERT 有給定時間,所以記錄永遠是 9 月 25 號下午。</p>
<p><strong>draw_histories</strong> 只儲存了 member_id,我們可以做一個比較複雜的查詢,把成員的名字跟所屬團體一起列出來。</p>
<div class="highlight"><pre><span></span><code><span class="gp">sqlite></span><span class="w"> </span><span class="k">SELECT</span><span class="w"> </span><span class="n">m</span><span class="p">.</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="p">.</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="p">.</span><span class="n">group_name</span><span class="p">,</span><span class="w"> </span><span class="n">d</span><span class="p">.</span><span class="k">time</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">draw_time</span>
<span class="gp"> ...></span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">draw_histories</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="n">members</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="n">m</span>
<span class="gp"> ...></span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">m</span><span class="p">.</span><span class="n">id</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">d</span><span class="p">.</span><span class="n">memberid</span>
<span class="gp"> ...></span><span class="w"> </span><span class="k">ORDER</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">d</span><span class="p">.</span><span class="k">time</span><span class="w"> </span><span class="k">ASC</span><span class="p">;</span>
<span class="go">id|name|group_name|draw_time</span>
<span class="go">2|平沢 唯|K-ON!|2015-09-25 16:30:00</span>
<span class="go">1|高坂 穂乃果|μ's|2015-09-28 16:55:03</span>
<span class="go">2|平沢 唯|K-ON!|2015-09-28 16:55:03</span>
</code></pre></div>
<h3 id="csv">把 CSV 寫進資料庫</h3>
<p>我們就把之後要用的資料庫取名為 <code>members.db</code>。我們先把初始的資料寫進資料庫裡。</p>
<p>這邊只有多一個在 Python 裡操作 SQLite 的步驟。透過 Python 內建的 <a href="https://docs.python.org/3.5/library/sqlite3.html">sqlite</a> module 就可以控制資料庫存取。先確定有這些檔案了:</p>
<ul>
<li><code>members.csv</code>: 所有成員資料</li>
<li><code>create_db.sql</code>: 資料庫 schema</li>
</ul>
<p>先 import 用到的 module</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="kn">import</span> <span class="nn">sqlite3</span>
<span class="gp">>>> </span><span class="kn">import</span> <span class="nn">csv</span>
</code></pre></div>
<p>把成員資料從 CSV 讀進來,跟之前一樣,只是我們稍微整理一下格式,存在 <code>members</code> 這個變數。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'./members.csv'</span><span class="p">,</span> <span class="n">newline</span><span class="o">=</span><span class="s1">''</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="gp">... </span> <span class="n">csv_reader</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">DictReader</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
<span class="gp">... </span> <span class="n">members</span> <span class="o">=</span> <span class="p">[</span>
<span class="gp">... </span> <span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="s1">'名字'</span><span class="p">],</span> <span class="n">row</span><span class="p">[</span><span class="s1">'團體'</span><span class="p">])</span>
<span class="gp">... </span> <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">csv_reader</span>
<span class="gp">... </span> <span class="p">]</span>
<span class="gp">...</span>
<span class="gp">>>> </span><span class="n">members</span>
<span class="go">[('高坂 穂乃果', "μ's"),</span>
<span class="go"> ('絢瀬 絵里', "μ's"),</span>
<span class="go"> ('南 ことり', "μ's"),</span>
<span class="go"> ('園田 海未', "μ's"),</span>
<span class="go"> # ...</span>
<span class="go"> ('中野 梓', 'K-ON!')]</span>
</code></pre></div>
<p>接著是新的部份,要先用 <code>sqlite3.connect()</code> 建立 SQLite database 連線,然後再用這個連線去下 SQL 指令。首先要把 table 都建立出來:</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'create_db.sql'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="gp">... </span> <span class="n">create_db_sql</span> <span class="o">=</span> <span class="n">f</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="gp">...</span>
<span class="gp">>>> </span><span class="n">db</span> <span class="o">=</span> <span class="n">sqlite3</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="s1">'members.db'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="k">with</span> <span class="n">db</span><span class="p">:</span>
<span class="gp">... </span> <span class="n">db</span><span class="o">.</span><span class="n">executescript</span><span class="p">(</span><span class="n">create_db_sql</span><span class="p">)</span>
<span class="gp">...</span>
</code></pre></div>
<p><code>db.executescript('...')</code> 可以執行一系列的 SQL 指令(注意指令間要有分號)。另外使用 <code>with db: ...</code> 作用是會 sqlite3 module 會自動幫我們把中間的 SQL 指令送出<sup id="fnref:sqlite3 auto commit"><a class="footnote-ref" href="#fn:sqlite3 auto commit">2</a></sup>,等同於:</p>
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="n">c</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">c</span><span class="o">.</span><span class="n">executescript</span><span class="p">(</span><span class="n">create_db_sql</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">c</span><span class="o">.</span><span class="n">commit</span><span class="p">()</span>
</code></pre></div>
<p>再來把讀進來的 <code>members</code> 變數寫到資料表裡面。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="k">with</span> <span class="n">db</span><span class="p">:</span>
<span class="gp">... </span> <span class="n">db</span><span class="o">.</span><span class="n">executemany</span><span class="p">(</span>
<span class="gp">... </span> <span class="s1">'INSERT INTO members (name, group_name) VALUES (?, ?)'</span><span class="p">,</span>
<span class="gp">... </span> <span class="n">members</span>
<span class="gp">... </span> <span class="p">)</span>
</code></pre></div>
<p>試著把資料讀出來,確定真的存進去了。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="n">c</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s1">'SELECT * FROM members LIMIT 3'</span><span class="p">)</span>
<span class="gp">>>> </span><span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">c</span><span class="p">:</span>
<span class="gp">>>> </span> <span class="nb">print</span><span class="p">(</span><span class="n">row</span><span class="p">)</span>
<span class="go">(1, '高坂 穂乃果', "μ's")</span>
<span class="go">(2, '絢瀬 絵里', "μ's")</span>
<span class="go">(3, '南 ことり', "μ's")</span>
</code></pre></div>
<p>到了這邊我們資料的部份沒問題了,接下來就要處理網站流程本身。</p>
<h3 id="flask">Flask 基本架構</h3>
<p><a href="http://flask.pocoo.org/">Flask</a> 的 web server 可以把所有功能都寫在一個檔案,在這邊就以 <code>draw_member.py</code> 為例子。</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">flask</span> <span class="kn">import</span> <span class="n">Flask</span>
<span class="n">app</span> <span class="o">=</span> <span class="n">Flask</span><span class="p">(</span><span class="vm">__name__</span><span class="p">)</span>
<span class="nd">@app</span><span class="o">.</span><span class="n">route</span><span class="p">(</span><span class="s1">'/'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">index</span><span class="p">():</span>
<span class="k">return</span> <span class="s2">"<p>Hello World!</p>"</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s1">'__main__'</span><span class="p">:</span>
<span class="n">app</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">debug</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</code></pre></div>
<p>以上就是最基本的 Flask server 架構。先來測試看看,都已經等待一千六百多字了。先把 server 跑起來,</p>
<div class="highlight"><pre><span></span><code><span class="o">(</span>VENV<span class="o">)</span><span class="w"> </span>$<span class="w"> </span>python<span class="w"> </span>draw_member.py
<span class="w"> </span>*<span class="w"> </span>Running<span class="w"> </span>on<span class="w"> </span>http://127.0.0.1:5000/<span class="w"> </span><span class="o">(</span>Press<span class="w"> </span>CTRL+C<span class="w"> </span>to<span class="w"> </span>quit<span class="o">)</span>
<span class="w"> </span>*<span class="w"> </span>Restarting<span class="w"> </span>with<span class="w"> </span>stat
</code></pre></div>
<p>再來可以開瀏覽器訪問 <a href="http://localhost:5000/">http://localhost:5000/</a>,或者用 command line 來訪問:</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>curl<span class="w"> </span><span class="s1">'http://localhost:5000/'</span>
<span class="c1"># <p>Hello World!</p></span>
</code></pre></div>
<p>會看到 server 回傳「Hello World!」。太感動了!底下先說明整個流程與 code 的關係。</p>
<p><code>app</code> 是整個 Flask application 的核心物件,可以看到最後我們會呼叫它的 <code>.run()</code> 來產生一個可以動的 web server。<code>debug=True</code> 表示如果 server 有錯誤的時候 Flask 會提供我們完整的錯誤訊息,包含錯誤是在哪個 Python function 裡產生的,錯誤時各個變數的值等等。因為這樣會也會讓有心人士知道網站是怎麼運行的,變正式網站(上 production)時會把這個選項關掉。</p>
<p>我們定義了一個 <code>index</code> function 並且用 decorator 把這個函式綁定在 <code>/</code> 路徑也就是首頁上。使用者訪問 <code>/</code> 就會跑到這個 function 裡來。</p>
<h3 id="flask-sqlite">Flask 與 SQLite 資料庫讀取</h3>
<p>我們先把資料庫相關的函式都先寫好,這邊基本上參照 <a href="http://flask.pocoo.org/docs/0.10/patterns/sqlite3/#using-sqlite-3-with-flask">Flask 官網 SQLite 使用方式</a>。</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">csv</span>
<span class="kn">import</span> <span class="nn">sqlite3</span>
<span class="kn">from</span> <span class="nn">flask</span> <span class="kn">import</span> <span class="n">Flask</span><span class="p">,</span> <span class="n">g</span>
<span class="n">app</span> <span class="o">=</span> <span class="n">Flask</span><span class="p">(</span><span class="vm">__name__</span><span class="p">)</span>
<span class="n">SQLITE_DB_PATH</span> <span class="o">=</span> <span class="s1">'members.db'</span>
<span class="n">SQLITE_DB_SCHEMA</span> <span class="o">=</span> <span class="s1">'create_db.sql'</span>
<span class="n">MEMBER_CSV_PATH</span> <span class="o">=</span> <span class="s1">'members.csv'</span>
<span class="c1"># SQLite3-related operations</span>
<span class="c1"># See SQLite3 usage pattern from Flask official doc</span>
<span class="c1"># http://flask.pocoo.org/docs/0.10/patterns/sqlite3/</span>
<span class="k">def</span> <span class="nf">get_db</span><span class="p">():</span>
<span class="n">db</span> <span class="o">=</span> <span class="nb">getattr</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="s1">'_database'</span><span class="p">,</span> <span class="kc">None</span><span class="p">)</span>
<span class="k">if</span> <span class="n">db</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
<span class="n">db</span> <span class="o">=</span> <span class="n">g</span><span class="o">.</span><span class="n">_database</span> <span class="o">=</span> <span class="n">sqlite3</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="n">SQLITE_DB_PATH</span><span class="p">)</span>
<span class="c1"># Enable foreign key check</span>
<span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">"PRAGMA foreign_keys = ON"</span><span class="p">)</span>
<span class="k">return</span> <span class="n">db</span>
<span class="nd">@app</span><span class="o">.</span><span class="n">teardown_appcontext</span>
<span class="k">def</span> <span class="nf">close_connection</span><span class="p">(</span><span class="n">exception</span><span class="p">):</span>
<span class="n">db</span> <span class="o">=</span> <span class="nb">getattr</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="s1">'_database'</span><span class="p">,</span> <span class="kc">None</span><span class="p">)</span>
<span class="k">if</span> <span class="n">db</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
<span class="n">db</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s1">'__main__'</span><span class="p">:</span>
<span class="n">app</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">debug</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</code></pre></div>
<p>一下子多了很多 code,如果太複雜可以先當作就是這樣吧。</p>
<p>需要了解的部份,第一是 <code>g</code> 這個變數裡可以放很多需要傳來傳去的變數,所以就把建立好的資料庫連線放在 <code>g._database</code>。平常如果要用這個連線的話,就用 <code>db = get_db()</code> 去拿。</p>
<p>第二是我們把資料的路徑等等,都寫成變數放在程式碼的最開頭。這是個好習慣,把常數跟程式分開來,管理才方便<sup id="fnref:flask-config"><a class="footnote-ref" href="#fn:flask-config">3</a></sup>。</p>
<h3 id="first-view-first-template">First view, first template</h3>
<p>先來做首頁,把 HTML 放在 <code>templates/index.html</code>。</p>
<div class="highlight"><pre><span></span><code><span class="cp"><!DOCTYPE html></span>
<span class="p"><</span><span class="nt">html</span><span class="p">></span>
<span class="p"><</span><span class="nt">head</span><span class="p">></span>
<span class="p"><</span><span class="nt">meta</span> <span class="na">charset</span><span class="o">=</span><span class="s">"utf-8"</span><span class="p">></span>
<span class="p"><</span><span class="nt">meta</span> <span class="na">name</span><span class="o">=</span><span class="s">"viewport"</span> <span class="na">content</span><span class="o">=</span><span class="s">"width=device-width"</span><span class="p">></span>
<span class="p"><</span><span class="nt">title</span><span class="p">></span>成員抽籤<span class="p"></</span><span class="nt">title</span><span class="p">></span>
<span class="p"></</span><span class="nt">head</span><span class="p">></span>
<span class="p"><</span><span class="nt">body</span><span class="p">></span>
<span class="p"><</span><span class="nt">h1</span><span class="p">></span>來抽出快樂的夥伴吧!<span class="p"></</span><span class="nt">h1</span><span class="p">></span>
<span class="p"><</span><span class="nt">h3</span><span class="p">></span>功能列<span class="p"></</span><span class="nt">h3</span><span class="p">></span>
<span class="p"><</span><span class="nt">ul</span><span class="p">></span>
<span class="p"><</span><span class="nt">li</span><span class="p">></span>首頁(抽籤)<span class="p"></</span><span class="nt">li</span><span class="p">></span>
<span class="p"><</span><span class="nt">li</span><span class="p">></span>歷史記錄<span class="p"></</span><span class="nt">li</span><span class="p">></span>
<span class="p"><</span><span class="nt">li</span><span class="p">></span>清除記錄、更新成員資料<span class="p"></</span><span class="nt">li</span><span class="p">></span>
<span class="p"></</span><span class="nt">ul</span><span class="p">></span>
<span class="p"></</span><span class="nt">body</span><span class="p">></span>
<span class="p"></</span><span class="nt">html</span><span class="p">></span>
</code></pre></div>
<p>這只是個單純的首頁,有個標題,還有個功能列,但暫時都沒有功能。我們修改一下 <code>draw_member.py</code> 裡定義的 index 讓:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">flask</span> <span class="kn">import</span> <span class="n">render_template</span>
<span class="nd">@app</span><span class="o">.</span><span class="n">route</span><span class="p">(</span><span class="s1">'/'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">index</span><span class="p">():</span>
<span class="k">return</span> <span class="n">render_template</span><span class="p">(</span><span class="s1">'index.html'</span><span class="p">)</span>
</code></pre></div>
<p>馬上來執行看看,用一樣的方式。不過之前執行的那個可能沒有結束,記得一個 port 只能有一個服務,所以要不是用舊的(Flask 很聰明,在 <code>debug=True</code> 時知道檔案被更新時就會用新的),要不是就關掉再重開一個新的。</p>
<p>打開瀏覽器訪問 <a href="http://localhost:5000">http://localhost:5000</a> 應該會出現底下的畫面。</p>
<figure class="align-center">
<img src="https://blog.liang2.tw/posts/2015/09/flask-draw-member/pics/flask_helloworld.png"/>
<figcaption>Flask Hello World</figcaption>
</figure>
<h3 id="_4">抽籤功能</h3>
<p>接下來要實作抽籤的功能啦,照前面說的,我們在首頁會設一個團體列表,使用者就會選擇某個團體去抽籤。</p>
<p>在實作之前要來背景介紹一下,要先講一下 GET 與 POST 的差異。</p>
<h4 id="get-vs-post">GET vs POST</h4>
<p>使用者平常在訪問網站時,該人輸入一個網站、點一個超連址,這時候瀏覽器會發送一個 GET request 到對應的 server 以及路徑。瀏覽器(通常)就會回傳一個對應的 HTML 檔案,瀏覽器就會負責把它顯示在畫面上。</p>
<p>但當使用者跟網站有更多互動的時候,常常是要把使用者的資訊送給網站時,像帳號登入、填問卷表單,或者在這邊的選擇某個團體去抽籤,這時候就會透過 POST。</p>
<p>更多的 GET/POST 以及其他的 HTTP request,可以參考<a href="https://archer1609wp.wordpress.com/2014/03/02/httppost%E8%88%87get/">一頁式介紹(中)</a>或<a href="https://developer.mozilla.org/en-US/docs/Web/HTTP">非常完整的介紹在 Mozilla Developer Network (MDN)(英)</a></p>
<h4 id="form">Form</h4>
<p>最常見的 POST 就是搭配<a href="https://developer.mozilla.org/en/docs/Web/HTML/Element/form">表單 (form)</a> 使用。像登入要填帳號密碼、問卷問題與回答,就很常用 form 實作。Form 裡面有很多種 input,代表使用者能填的欄位,類型可能是單選、複選、單行、多行、密文等。</p>
<p>我們就先看一下 form 實際的長相吧。改寫 <code>templates/index.html</code>,加上一個抽籤選團體的 form。</p>
<div class="highlight"><pre><span></span><code><h1>來抽出快樂的夥伴吧!</h1><!-- 本來有的 -->
<p>選擇要被抽的團體</p>
<form action="/draw" method="post">
<label for="group_name">團隊名稱:</label>
<input type="radio" name="group_name" value="μ's">μ's
<input type="radio" name="group_name" value="K-ON!">K-ON!
<input type="radio" name="group_name" value="ALL" checked>(全)
<input type="submit" value="Submit">
</form>
<hr><!-- 這是分隔線 -->
</code></pre></div>
<p>基本上加在 body 裡面就可以。這個 form 包含了一個標籤,指定是給名為 <code>group_name</code> 的 input。底下接四個 input tags 但實際上只有兩大個。</p>
<p>第一大個是團體的單選選項共三個 input,注意到他們的 <code>name</code> 都是 <code>group_name</code> 但 <code>value</code> 不同,後面接著他們顯示的字。其中「(全)」它多了一個 <code>checked</code> 表示預設選擇這個選項。</p>
<p>另一大個是 <code>type=submit</code> 的 input,他就是送出的表單的按鈕。</p>
<p>再來注意 form tag 本身。<code>method="post"</code> 應該很好理解,表示要送出 POST request;<code>action="/draw"</code> 表示這個 POST 要發到 <code>/draw</code> 這個路徑。</p>
<p>同樣,form 底下也很多細節,歡迎再去 <a href="https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Forms">MDN</a> 了解。</p>
<h4 id="request-form-post-handling-in-flask">Request (Form / POST) handling in Flask</h4>
<p>所以我們馬上來寫處理 <code>/draw</code> POST 的 view 吧。</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">random</span>
<span class="kn">from</span> <span class="nn">flask</span> <span class="kn">import</span> <span class="n">request</span>
<span class="nd">@app</span><span class="o">.</span><span class="n">route</span><span class="p">(</span><span class="s1">'/draw'</span><span class="p">,</span> <span class="n">methods</span><span class="o">=</span><span class="p">[</span><span class="s1">'POST'</span><span class="p">])</span>
<span class="k">def</span> <span class="nf">draw</span><span class="p">():</span>
<span class="c1"># Get the database connection</span>
<span class="n">db</span> <span class="o">=</span> <span class="n">get_db</span><span class="p">()</span>
<span class="c1"># Draw member ids from given group</span>
<span class="c1"># If ALL is given then draw from all members</span>
<span class="n">group_name</span> <span class="o">=</span> <span class="n">request</span><span class="o">.</span><span class="n">form</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'group_name'</span><span class="p">,</span> <span class="s1">'ALL'</span><span class="p">)</span>
<span class="n">valid_members_sql</span> <span class="o">=</span> <span class="s1">'SELECT id FROM members '</span>
<span class="k">if</span> <span class="n">group_name</span> <span class="o">==</span> <span class="s1">'ALL'</span><span class="p">:</span>
<span class="n">cursor</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">valid_members_sql</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">valid_members_sql</span> <span class="o">+=</span> <span class="s1">'WHERE group_name = ?'</span>
<span class="n">cursor</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">valid_members_sql</span><span class="p">,</span> <span class="p">(</span><span class="n">group_name</span><span class="p">,</span> <span class="p">))</span>
<span class="n">valid_member_ids</span> <span class="o">=</span> <span class="p">[</span>
<span class="n">row</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">cursor</span>
<span class="p">]</span>
<span class="c1"># If no valid members return 404 (unlikely)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">valid_member_ids</span><span class="p">:</span>
<span class="n">err_msg</span> <span class="o">=</span> <span class="s2">"<p>No members in group '</span><span class="si">%s</span><span class="s2">'</p>"</span> <span class="o">%</span> <span class="n">group_name</span>
<span class="k">return</span> <span class="n">err_msg</span><span class="p">,</span> <span class="mi">404</span>
<span class="c1"># Randomly choice a member</span>
<span class="n">lucky_member_id</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">valid_member_ids</span><span class="p">)</span>
<span class="c1"># Obtain the lucy member's information</span>
<span class="n">member_name</span><span class="p">,</span> <span class="n">member_group_name</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span>
<span class="s1">'SELECT name, group_name FROM members WHERE id = ?'</span><span class="p">,</span>
<span class="p">(</span><span class="n">lucky_member_id</span><span class="p">,</span> <span class="p">)</span>
<span class="p">)</span><span class="o">.</span><span class="n">fetchone</span><span class="p">()</span>
<span class="k">return</span> <span class="s1">'<p></span><span class="si">%s</span><span class="s1">(團體:</span><span class="si">%s</span><span class="s1">)</p>'</span> <span class="o">%</span> <span class="p">(</span><span class="n">member_name</span><span class="p">,</span> <span class="n">member_group_name</span><span class="p">)</span>
</code></pre></div>
<p>Flask 會把使用者發給 server 的 request 存在 <code>request</code> 裡面,其實使用者會傳蠻多資訊的,像該人的語言、用的瀏覽器、時間等等,這些都能在 <code>request</code> 找到。而使用者填好的 form 的內容會存在當中 <code>request.form</code> 裡,而我們先前定義在 form 中 input name 就會變成這邊的 dict key。</p>
<p>因此如果要拿使用者決定的 <code>group_name</code> 時,就會寫成 <code>request.form.get('group_name', 'ALL')</code>。這相當於 <code>request.form['group_name']</code> 但在沒有這個 key 時回傳預設值 <code>'ALL'</code>。正常使用並不會找不到這個 key,但網站開發者永遠不要相信使用者會乖乖回傳這些內容。</p>
<p>拿了團體名稱之後,就用團體名稱去下查詢的 SQL。同理這名稱可能沒有結果,這時就回傳一個 HTTP status code 為 404 的錯誤訊息。一般情況 4XX 都代表使用者給的資料有問題的。</p>
<p>拿到了所有成員的 id 後,用了個 <code>random.choice</code> 隨機抽一個出來。如同字面上的意思,<a href="https://docs.python.org/3.5/library/random.html#random.choice">random</a> 是個 Python 內建的 module。再把這個 id 拿去查名字與團體。</p>
<p>我們總共做了兩個資料庫查詢,第一次把可能的 member id 都傳回來,第二次把抽中的人的名字、團體都拿回來。暫時還沒做寫到歷史的功能,但那個也不難,之後再說。先不做 template,把結果包在 HTML 最基本的 <code><p></code> 元素就傳回來。</p>
<h4 id="demo">Demo</h4>
<p>重新整理首頁,可以看到多了一個表單(廢話)。Flask 的 web server 很聰明,不用重新啟動它,會自動看到檔案有更新做 reload。可以回去比對一下自己寫在 <code>index.html</code> 裡 HTML 在瀏覽器上呈現的對應關係。</p>
<figure class="align-center">
<img src="https://blog.liang2.tw/posts/2015/09/flask-draw-member/pics/flask_index_form.png"/>
<figcaption>新的首頁,多了一個表單</figcaption>
</figure>
<p>按下 Submit 之後就會跳到抽籤結果(注意 URL 的變化)</p>
<figure class="align-center">
<img src="https://blog.liang2.tw/posts/2015/09/flask-draw-member/pics/flask_draw_result.png"/>
<figcaption>抽籤結果</figcaption>
</figure>
<p>預計是抽全部,你也可以回到上一頁,選自己想要的團體。</p>
<p>最重要的功能就完成啦!如果自己程式遇到一些狀況的話,可以看<a href="https://github.com/ccwang002/draw_member/blob/169d81650d8ca649c5484c43c05324885e7cb7fb/draw_member.py">我寫的完整版本</a>。</p>
<h3 id="more-on-templates">More on templates</h3>
<p>之前我們 <code>render_template</code> 其實都是傳一個完整的 HTML 內容給它,並沒有用到 template 功能。Template 有幾個用處:</p>
<ul>
<li>集中重覆用到的片段、結構</li>
<li>讓一部份 HTML 的內容受變數控制</li>
</ul>
<p>馬上來改寫一下吧。我們的功能表應該每一頁都要出現,再來我們希望 <code>/draw</code> 的頁面也是個完整的 HTML。</p>
<p>首先先把常用的部份獨立出來,做成 <code>templates/base.html</code>。</p>
<div class="highlight"><pre><span></span><code><!-- templates/base.html -->
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width">
<title>{% block title %}成員抽籤{% endblock title %}</title>
</head>
<body>
{% block content %}{% endblock content %}
<hr>
<h3>功能列</h3>
<ul>
<li><a href="/">首頁(抽籤)</a></li>
<li><a href="/history">歷史記錄</a></li>
<li><a href="/reset">清除記錄、更新成員資料</a></li>
</ul>
</body>
</html>
</code></pre></div>
<p>像功能列這種不會變的就很適合放在這邊。而我們的首頁就可以重覆使用這個結構,</p>
<div class="highlight"><pre><span></span><code><!-- templates/index.html -->
{% extends "base.html" %}
{% block content %}
<h1>來抽出快樂的夥伴吧!</h1>
<p>選擇要被抽的團體</p>
<form action="/draw" method="post">
<label for="group_name">團隊名稱:</label>
<input type="radio" name="group_name" value="μ's">μ's
<input type="radio" name="group_name" value="K-ON!">K-ON!
<input type="radio" name="group_name" value="ALL" checked>(全)
<input type="submit" value="Submit">
</form>
{% endblock content %}
</code></pre></div>
<p>可以看到最大的差異就是我們的 <code>index.html</code> 變簡單了。它就像物件繼承一樣,<code>{% extends "base.html" %}</code>,表示先把 <code>base.html</code> 的內容放進來,而裡面定義了兩個 block:<code>title</code> 以及 <code>content</code>。Index 有定義 content 的內容,所以就取代掉原本定義在 base 裡空的 content。 Index 並沒有定義 title,那就會用原本 block 內的值,即「成員抽籤」。</p>
<p>再來處理 <code>/draw</code> 的部份,我們除而再利用 <code>base.html</code> 之外,還要引入 template variable 的概念。</p>
<div class="highlight"><pre><span></span><code><!-- templates/draw.html -->
{% extends "base.html" %}
{% block title %}抽籤結果{% endblock title %}
{% block content %}
<h1>抽籤結果</h1>
<p>{{ name }}(團體:{{ group }})</p>
{% endblock content %}
</code></pre></div>
<p>特別的是 <code>{{ name }}</code> 和 <code>{{ group }}</code>。這語法表示他們的值分別受 <code>name</code> 和 <code>group</code> 這兩個變數決定,變數的值在 <code>render_template</code> 時才會決定。要怎麼把變數的值傳到 template 裡呢?</p>
<div class="highlight"><pre><span></span><code><span class="nd">@app</span><span class="o">.</span><span class="n">route</span><span class="p">(</span><span class="s1">'/draw'</span><span class="p">,</span> <span class="n">methods</span><span class="o">=</span><span class="p">[</span><span class="s1">'POST'</span><span class="p">])</span>
<span class="k">def</span> <span class="nf">draw</span><span class="p">():</span>
<span class="c1"># ...</span>
<span class="c1"># return '<p>%s(團體:%s)</p>' % (member_name, member_group_name)</span>
<span class="k">return</span> <span class="n">render_template</span><span class="p">(</span>
<span class="s1">'draw.html'</span><span class="p">,</span>
<span class="n">name</span><span class="o">=</span><span class="n">member_name</span><span class="p">,</span>
<span class="n">group</span><span class="o">=</span><span class="n">member_group_name</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div>
<p>改寫好的 draw 使用 template <code>templates/draw.html</code>,並在 <code>render_template</code> 時把變數的值都放進去。</p>
<p>這時候才重新抽籤可以看到新的 template 的輸出結果,功能表也出現了。</p>
<figure class="align-center">
<img src="https://blog.liang2.tw/posts/2015/09/flask-draw-member/pics/flask_new_draw_result.png"/>
</figure>
<h3 id="_5">歷史記錄</h3>
<p>記得要在抽籤的時候把記錄加到 database 裡。因為之前有設好 schema 預設用現在時間當抽籤時間,所以時間的處理完全交給 SQLite。用 SQL 語法 <code>LIMIT 10</code> 以及 <code>ORDER BY</code> 選擇最近的十筆,同時在查結果時,也同時查詢 <strong>members</strong> table 對應的名字與團體。這個專業術語叫 <a href="https://en.wikipedia.org/wiki/Join_%28SQL%29">JOIN</a>。</p>
<p>把這個 view 放在 <code>/history</code> 路徑。</p>
<div class="highlight"><pre><span></span><code><span class="nd">@app</span><span class="o">.</span><span class="n">route</span><span class="p">(</span><span class="s1">'/draw'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">draw</span><span class="p">():</span>
<span class="c1"># ...</span>
<span class="c1"># Update draw history</span>
<span class="k">with</span> <span class="n">db</span><span class="p">:</span>
<span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s1">'INSERT INTO draw_histories (memberid) VALUES (?)'</span><span class="p">,</span>
<span class="p">(</span><span class="n">lucky_member_id</span><span class="p">,</span> <span class="p">))</span>
<span class="c1"># Render template</span>
<span class="k">return</span> <span class="o">...</span>
<span class="nd">@app</span><span class="o">.</span><span class="n">route</span><span class="p">(</span><span class="s1">'/history'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">history</span><span class="p">():</span>
<span class="n">db</span> <span class="o">=</span> <span class="n">get_db</span><span class="p">()</span>
<span class="n">recent_histories</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span>
<span class="s1">'SELECT m.name, m.group_name, d.time '</span>
<span class="s1">'FROM draw_histories AS d, members as m '</span>
<span class="s1">'WHERE m.id == d.memberid '</span>
<span class="s1">'ORDER BY d.time DESC '</span>
<span class="s1">'LIMIT 10'</span>
<span class="p">)</span><span class="o">.</span><span class="n">fetchall</span><span class="p">()</span>
<span class="k">return</span> <span class="n">render_template</span><span class="p">(</span>
<span class="s1">'history.html'</span><span class="p">,</span>
<span class="n">recent_histories</span><span class="o">=</span><span class="n">recent_histories</span>
<span class="p">)</span>
</code></pre></div>
<p>同理也要建立對應的 template。</p>
<div class="highlight"><pre><span></span><code><!-- templates/history.html -->
{% extends "base.html" %}
{% block title %}抽籤歷史{% endblock title %}
{% block content %}
<h1>抽籤歷史(最近 10 筆)</h1>
<table>
<thead>
<tr>
<th>名字</th>
<th>團體</th>
<th>抽中時間</th>
</tr>
</thead>
<tbody>
{% for history in recent_histories %}
<tr>
<td>{{ history.0 }}</td>
<td>{{ history.1 }}</td>
<td>{{ history.2 }}</td>
</tr>
{% endfor %}
</tbody>
</table>
{% endblock content %}
</code></pre></div>
<p>這邊用了新的 template 語法 for loop,每次 loop <code>history</code> 的值都會變,而且還可以再存取它底下的屬性,寫成 Python 就像:</p>
<div class="highlight"><pre><span></span><code><span class="k">for</span> <span class="n">history</span> <span class="ow">in</span> <span class="n">recent_histories</span><span class="p">:</span>
<span class="n">history</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</code></pre></div>
<p>Flask 用的 Jinja2 template 功能很多,現在各位已經比較理解 server 的運作了,可以去閱讀一下 <a href="http://jinja.pocoo.org/docs/dev/templates/">Jinja2 官網文件</a>看完整的使用方式。</p>
<figure class="align-center">
<img src="https://blog.liang2.tw/posts/2015/09/flask-draw-member/pics/flask_history.png"/>
</figure>
<h4 id="datetime">時間處理用 datetime</h4>
<p>如果有注意到的話,我們用的時間從 SQLite 回傳回來其實是字串。想要改寫時間格式怎麼辦?這時候就要用上內建 module <a href="https://docs.python.org/3.5/library/datetime.html#datetime-objects">datetime</a> 裡提供的 <code>datetime</code> 物件。同時我們也順便把本來用 <code>fetchall()</code> 的結果,改成用 dict 表示每一筆歷史。</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>
<span class="nd">@app</span><span class="o">.</span><span class="n">route</span><span class="p">(</span><span class="s1">'/history'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">history</span><span class="p">():</span>
<span class="n">db</span> <span class="o">=</span> <span class="n">get_db</span><span class="p">()</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span>
<span class="s1">'SELECT m.name, m.group_name, d.time AS "draw_time [timestamp]" '</span>
<span class="s1">'FROM draw_histories AS d, members as m '</span>
<span class="s1">'WHERE m.id == d.memberid '</span>
<span class="s1">'ORDER BY d.time DESC '</span>
<span class="s1">'LIMIT 10'</span>
<span class="p">)</span><span class="o">.</span><span class="n">fetchall</span><span class="p">()</span>
<span class="n">recent_histories</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">c</span><span class="p">:</span>
<span class="n">recent_histories</span><span class="o">.</span><span class="n">append</span><span class="p">({</span>
<span class="s1">'name'</span><span class="p">:</span> <span class="n">row</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span>
<span class="s1">'group'</span><span class="p">:</span> <span class="n">row</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span>
<span class="s1">'draw_time'</span><span class="p">:</span> <span class="n">datetime</span><span class="o">.</span><span class="n">strptime</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="s1">'%Y-%m-</span><span class="si">%d</span><span class="s1"> %H:%M:%S'</span><span class="p">),</span>
<span class="p">})</span>
<span class="k">return</span> <span class="n">render_template</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code>{% for history in recent_histories %}
<tr>
<td>{{ history.name }}</td>
<td>{{ history.group }}</td>
<td>{{ history.draw_time.strftime("%Y 年 %m 月 %d 日 %H 時 %M 分") }}</td>
</tr>
{% endfor %}
</code></pre></div>
<p>可以看到 for loop 不再使用 0, 1, 2 去拿每筆歷史各欄位的值,而是用欄位名稱,相當於 <code>history['name']</code>。這樣的做法比較好,因為用數字一下就忘了,隨便調整一下 view 的內容順序就不一定是這樣了;單獨讀 template 也能懂每個欄位的意思。</p>
<figure class="align-center">
<img src="https://blog.liang2.tw/posts/2015/09/flask-draw-member/pics/flask_history_zh.png"/>
</figure>
<h3 id="whats-next">What’s Next</h3>
<h3 id="static-files-and-better-theme">Static files and better theme</h3>
<p>我們只用了 HTML template。想要讓網站看起來更漂亮,就要寫 CSS 與 Javascript (JS)。有像 Bootstrap、PureCSS、Semantic UI 這類的「framework」,套用之後能在短時間畫出美觀實用的版面。</p>
<p>而 CSS、JS,以及站上大大小的其他檔案都必需要從 server 傳送到用戶端上,這邊就是 static files 的處理。</p>
<h3 id="more-how-web-works">More how web works</h3>
<p>除了 HTTP GET、POST 之外,還有 HTTPS、session、cookie 等很常見的技術。</p>
<h3 id="object-relational-model-orm">Object Relational Model (ORM)</h3>
<p>我們只舉了純寫 SQL 的例子,但當專案變複雜的時候,純 SQL 管理上越來越複雜。ORM 是一種解決的方案。</p>
<h3 id="django">Django</h3>
<p>當然可以繼續把 Flask 研究下去,它也是個很好的 web framework。不過我們主要的 code base 是 Django。所以希望大家在了解一個 web server (app) 長得像怎樣之後,就可以開始學習 Django。Django 與 Flask 最大的設計不同就是 Django 一開始就提供了很多模組與功能,感覺很「肥」,而 Flask 只提供了必要的功能</p>
<h3 id="_6">總結</h3>
<p>這樣就是一個完整的抽籤的網站了。其實架網站的主要知識也差不多是這些,再來就是細節以及知識的加強。</p>
<p>做好的成品我也放在 <a href="https://github.com/ccwang002/draw_member">Github</a> 上了,裡面的 commit log 記錄了幾個重要的步驟,所以想要看看每一步的結果可以用 <code>git checkout</code> 回到每個記錄點,例如想要看抽籤功能寫完,用上 template 的版本就可以到 <code>git checkout f39fc1</code>。</p>
<p>PS: 沒想到會寫這麼長啊……</p>
<h3 id="details">Details</h3>
<p>底下記了很多技術細節,有興趣再看吧。</p>
<h4 id="sqlite-table-info">SQLite table info</h4>
<p>除了用 <code>.schema</code> 去看每個 TABLE 建立時的指令之外,也可以用 <code>PRAGMA table_info</code> 去看某個 table 每個欄位的設定。</p>
<div class="highlight"><pre><span></span><code><span class="go">-- Run `sqlite -init create_db.sql`</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="p">.</span><span class="n">header</span><span class="w"> </span><span class="k">on</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="p">.</span><span class="k">mode</span><span class="w"> </span><span class="k">column</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="n">PRAGMA</span><span class="w"> </span><span class="n">table_info</span><span class="p">(</span><span class="n">members</span><span class="p">);</span>
<span class="go">cid name type notnul dflt_value pk</span>
<span class="go">--- ----------- --------- ------ ---------------------------- --</span>
<span class="go">0 id INTEGER 0 1</span>
<span class="go">1 name TEXT 1 0</span>
<span class="go">2 group_name TEXT 0 0</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="n">PRAGMA</span><span class="w"> </span><span class="n">table_info</span><span class="p">(</span><span class="n">draw_histories</span><span class="p">);</span>
<span class="go">cid name type notnul dflt_value pk</span>
<span class="go">--- ----------- --------- ------ ---------------------------- --</span>
<span class="go">0 memberid INTEGER 0 0</span>
<span class="go">1 draw_time DATETIME 0 datetime('now', 'localtime') 0</span>
</code></pre></div>
<h4 id="sqlite-foreign-key-check">SQLite foreign key check</h4>
<p>SQLite3 在比較新版才會去處理 foreign key 限制的功能,參考<a href="https://www.sqlite.org/foreignkeys.html#fk_enable">官網的說明</a>,</p>
<div class="highlight"><pre><span></span><code><span class="gp">sqlite></span><span class="w"> </span><span class="n">PRAGMA</span><span class="w"> </span><span class="n">foreign_keys</span><span class="p">;</span>
<span class="go">0</span>
</code></pre></div>
<p>如果是 0 的話表示 SQLite 並不會去檢查 foreign key。這可以手動打開檢查。</p>
<div class="highlight"><pre><span></span><code><span class="gp">sqlite></span><span class="w"> </span><span class="n">PRAGMA</span><span class="w"> </span><span class="n">foreign_keys</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">ON</span><span class="p">;</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="n">PRAGMA</span><span class="w"> </span><span class="n">foreign_keys</span><span class="p">;</span>
<span class="go">1</span>
<span class="gp">sqlite></span><span class="w"> </span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">draw_histories</span><span class="p">(</span><span class="n">memberid</span><span class="p">)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">1000</span><span class="p">);</span>
<span class="go">Error: FOREIGN KEY constraint failed</span>
</code></pre></div>
<h4 id="_7">重新讀入資料</h4>
<p>我們先包好一個 function <code>reset_db</code>。</p>
<div class="highlight"><pre><span></span><code><span class="c1"># draw_members.py</span>
<span class="k">def</span> <span class="nf">reset_db</span><span class="p">():</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">SQLITE_DB_SCHEMA</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">create_db_sql</span> <span class="o">=</span> <span class="n">f</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="n">db</span> <span class="o">=</span> <span class="n">get_db</span><span class="p">()</span>
<span class="c1"># Reset database</span>
<span class="c1"># Note that CREATE/DROP table are *immediately* committed</span>
<span class="c1"># even inside a transaction</span>
<span class="k">with</span> <span class="n">db</span><span class="p">:</span>
<span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">"DROP TABLE IF EXISTS draw_histories"</span><span class="p">)</span>
<span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">"DROP TABLE IF EXISTS members"</span><span class="p">)</span>
<span class="n">db</span><span class="o">.</span><span class="n">executescript</span><span class="p">(</span><span class="n">create_db_sql</span><span class="p">)</span>
<span class="c1"># Read members CSV data</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">MEMBER_CSV_PATH</span><span class="p">,</span> <span class="n">newline</span><span class="o">=</span><span class="s1">''</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">csv_reader</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">DictReader</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
<span class="n">members</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="s1">'名字'</span><span class="p">],</span> <span class="n">row</span><span class="p">[</span><span class="s1">'團體'</span><span class="p">])</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">csv_reader</span>
<span class="p">]</span>
<span class="c1"># Write members into databse</span>
<span class="k">with</span> <span class="n">db</span><span class="p">:</span>
<span class="n">db</span><span class="o">.</span><span class="n">executemany</span><span class="p">(</span>
<span class="s1">'INSERT INTO members (name, group_name) VALUES (?, ?)'</span><span class="p">,</span>
<span class="n">members</span>
<span class="p">)</span>
</code></pre></div>
<p><code>reset_db()</code> 會 DROP 掉舊的 database ,然後再用剛剛介紹的方法再把資料從 CSV 讀進來。</p>
<p>所以這個 function 要怎麼使用?</p>
<p>一個是像之前一樣綁定一個路徑 <code>@app.route('/reset')</code>;另一個方式我們可以透過 python shell 達到。</p>
<div class="highlight"><pre><span></span><code><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">draw_member</span> <span class="kn">import</span> <span class="n">app</span><span class="p">,</span> <span class="n">reset_db</span>
<span class="gp">>>> </span><span class="k">with</span> <span class="n">app</span><span class="o">.</span><span class="n">app_context</span><span class="p">():</span>
<span class="gp">... </span> <span class="n">reset_db</span><span class="p">()</span>
</code></pre></div>
<h4 id="datetime-in-sqlite-and-python">Datetime in SQLite and Python</h4>
<p>這篇文章太長了,寫到<a href="../datetime-sqlite/#datetime-sqlite">下一篇去</a>。</p>
<p>2016-06-14 更新:增加使用 <code>datetime.datetime</code> 的說明避免跟 module 名稱混淆 (credit: 馬國薰)</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>在資料處理上其實會有個 NA 的值來區分「空」以及「空值」的概念。不過這用 Python 內建的 <code>csv.reader</code> 處理會太複雜就先算了。 <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:sqlite3 auto commit">
<p>參考<a href="https://docs.python.org/3.5/library/sqlite3.html#using-the-connection-as-a-context-manager">官方說明文件</a>,它是在進入 <code>with db: ...</code> code block 時開啟一個 transaction,並在正常離開的時候自動 commit。如果中間遇到沒有處理的 Exception 時,就會自動 roll back。 <a class="footnote-backref" href="#fnref:sqlite3 auto commit" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
<li id="fn:flask-config">
<p>其實 Flask 相關的設定通常放在 <code>app.config</code> 裡面,不過我們的例子沒差。 <a class="footnote-backref" href="#fnref:flask-config" title="Jump back to footnote 3 in the text">↩</a></p>
</li>
</ol>
</div>FASTA/Q sequence processing toolkit -- seqtk2015-09-27T14:11:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2015-09-27:/posts/2015/09/seqtk/<p>This post demonstrates the FASTQ to FASTA conversion and sequence quality check using seqtk.</p><p>This is the first post of the series of my common NGS processing workflows and notes.</p>
<p>Some of the most common operation in sequence processing is FASTQ → FASTA conversion. Tons of conversion scripts using either sed or awk can be found by search. For example,</p>
<div class="highlight"><pre><span></span><code><span class="c1"># FASTQ to FASTA</span>
<span class="c1"># Assume every read record takes exactly 4 line</span>
<span class="c1"># Ref: http://stackoverflow.com/a/10359425</span>
$<span class="w"> </span>sed<span class="w"> </span>-n<span class="w"> </span><span class="s1">'1~4s/^@/>/p;2~4p'</span>
</code></pre></div>
<p>The assumption of 4 lines per read usually holds for recent NGS sequencing data, so not a big deal.</p>
<p>In many case the sequence is gzip’d. It is still a piece of cake when combining with pipe editing,</p>
<div class="highlight"><pre><span></span><code>gzcat<span class="w"> </span>myseq.fq.gz<span class="w"> </span><span class="p">|</span><span class="w"> </span>sed<span class="w"> </span>-n<span class="w"> </span><span class="s1">'1~4s/^@/>/p;2~4p'</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>gzip<span class="w"> </span>><span class="w"> </span>myseq.fa.gz
</code></pre></div>
<p>However, things can get complex really fast when one wants to additionally do reverse complement, randomly sample a subset of reads, and many other types of sequence manipulation. Efficiency matters if those tasks are applied to tens of millions of reads. Even a few nanoseconds longer of computing time difference per read can make a difference at this scale of reads.</p>
<h3 id="seqtk">Seqtk</h3>
<p>So <a href="https://github.com/lh3/seqtk">seqtk</a> comes into rescue. It is written in C and MIT licensed. <a href="https://www.biostars.org/p/85929/#86082">A quick comparison</a> shows it is generally faster than other UNIX-based solutions, let alone implementations based on scripting languages.</p>
<p>Seqtk bundles many other operations, but I’ll just mention those I frequently use.</p>
<div class="highlight"><pre><span></span><code><span class="nv">$seqtk</span>
Usage:<span class="w"> </span>seqtk<span class="w"> </span><command><span class="w"> </span><arguments>
Version:<span class="w"> </span><span class="m">1</span>.0-r77-dirty
Command:<span class="w"> </span>seq<span class="w"> </span>common<span class="w"> </span>transformation<span class="w"> </span>of<span class="w"> </span>FASTA/Q
<span class="w"> </span>comp<span class="w"> </span>get<span class="w"> </span>the<span class="w"> </span>nucleotide<span class="w"> </span>composition<span class="w"> </span>of<span class="w"> </span>FASTA/Q
<span class="w"> </span>sample<span class="w"> </span>subsample<span class="w"> </span>sequences
<span class="w"> </span>subseq<span class="w"> </span>extract<span class="w"> </span>subsequences<span class="w"> </span>from<span class="w"> </span>FASTA/Q
<span class="w"> </span>fqchk<span class="w"> </span>fastq<span class="w"> </span>QC<span class="w"> </span><span class="o">(</span>base/quality<span class="w"> </span>summary<span class="o">)</span>
<span class="w"> </span>mergepe<span class="w"> </span>interleave<span class="w"> </span>two<span class="w"> </span>PE<span class="w"> </span>FASTA/Q<span class="w"> </span>files
<span class="w"> </span>trimfq<span class="w"> </span>trim<span class="w"> </span>FASTQ<span class="w"> </span>using<span class="w"> </span>the<span class="w"> </span>Phred<span class="w"> </span>algorithm
<span class="w"> </span>hety<span class="w"> </span>regional<span class="w"> </span>heterozygosity
<span class="w"> </span>mutfa<span class="w"> </span>point<span class="w"> </span>mutate<span class="w"> </span>FASTA<span class="w"> </span>at<span class="w"> </span>specified<span class="w"> </span>positions
<span class="w"> </span>mergefa<span class="w"> </span>merge<span class="w"> </span>two<span class="w"> </span>FASTA/Q<span class="w"> </span>files
<span class="w"> </span>dropse<span class="w"> </span>drop<span class="w"> </span>unpaired<span class="w"> </span>from<span class="w"> </span>interleaved<span class="w"> </span>PE<span class="w"> </span>FASTA/Q
<span class="w"> </span>rename<span class="w"> </span>rename<span class="w"> </span>sequence<span class="w"> </span>names
<span class="w"> </span>randbase<span class="w"> </span>choose<span class="w"> </span>a<span class="w"> </span>random<span class="w"> </span>base<span class="w"> </span>from<span class="w"> </span>hets
<span class="w"> </span>cutN<span class="w"> </span>cut<span class="w"> </span>sequence<span class="w"> </span>at<span class="w"> </span>long<span class="w"> </span>N
<span class="w"> </span>listhet<span class="w"> </span>extract<span class="w"> </span>the<span class="w"> </span>position<span class="w"> </span>of<span class="w"> </span>each<span class="w"> </span>het
</code></pre></div>
<h3 id="fastq-fasta">FASTQ → FASTA</h3>
<p>Read (gzip’d) FASTQ and write out as FASTA,</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>seqtk<span class="w"> </span>seq<span class="w"> </span>-A<span class="w"> </span><span class="k">in</span>.fq<span class="o">[</span>.gz<span class="o">]</span><span class="w"> </span>><span class="w"> </span>out.fa
</code></pre></div>
<p>To make the output gzip’d again, piped with gzip,</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>seqtk<span class="w"> </span>seq<span class="w"> </span>-A<span class="w"> </span><span class="k">in</span>.fq<span class="o">[</span>.gz<span class="o">]</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>gzip<span class="w"> </span>><span class="w"> </span>out.fa.gz
</code></pre></div>
<h3 id="reverse-complement">Reverse complement</h3>
<p>If one wants to debug the R2 reads of pair-end sequencing (second read on forward strand), since they contain reverse complement sequence of the insert DNA, one needs to reverse complement R2 reads again to debug directly by bare human eyes.</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>seqtk<span class="w"> </span>seq<span class="w"> </span>-r<span class="w"> </span>R2.fq<span class="w"> </span>><span class="w"> </span>R2_rc.fq
$<span class="w"> </span><span class="nb">echo</span><span class="w"> </span><span class="s1">'> Example R2 seq</span>
<span class="s1"> GCATTGGTGGTTCAGTGGTAGAATTCT'</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>seqtk<span class="w"> </span>seq<span class="w"> </span>-r
<span class="c1"># > Example R2 seq</span>
<span class="c1"># AGAATTCTACCACTGAACCACCAATGC</span>
</code></pre></div>
<h3 id="quality-check">Quality check</h3>
<p>To be honest, <a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/">FastQC</a> is more frequently used for quality check because it generates <a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html">reports with beautiful figures</a>.</p>
<p>But for a detail report on each read position, one should consider <code>seqtk fqchk</code>.</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>seqtk<span class="w"> </span>fqchk<span class="w"> </span>myseq.fq<span class="o">[</span>.gz<span class="o">]</span>
</code></pre></div>
<p>By default it sets <code>-q 20</code>. This quality threshold determines the threshold of counting a base as low or high quality, shown as <code>%low</code> and <code>%high</code> per read position. In the default case, quality score higher than 20 will be treated as high quality bases.</p>
<div class="highlight"><pre><span></span><code>min_len: 10; max_len: 174; avg_len: 28.92; 37 distinct quality values
POS #bases %A %C %G %T %N avgQ errQ %low %high
ALL 236344886 17.0 22.5 31.3 29.2 0.0 39.9 37.6 0.1 99.9
1 8172342 8.9 12.4 57.0 21.7 0.0 39.6 29.0 0.5 99.5
2 8172342 7.7 62.5 16.2 13.7 0.0 39.8 37.8 0.2 99.8
3 8172342 50.3 24.1 11.9 13.6 0.0 39.8 38.2 0.1 99.9
4 8172342 10.4 22.9 15.3 51.3 0.0 39.9 38.7 0.1 99.9
5 8172342 14.3 12.9 22.3 50.5 0.0 39.8 37.0 0.2 99.8
# ... (trimmed)
</code></pre></div>
<p>The following columns, <code>avgQ</code> and <code>errQ</code>, need more explanation. Average quality (<code>avgQ</code>) is computed by weighted mean of each base’s quality,</p>
<p><span class="katex-display"><span class="katex"><span class="katex-mathml"><math display="block" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mrow><mi mathvariant="normal">a</mi><mi mathvariant="normal">v</mi><mi mathvariant="normal">g</mi><mi mathvariant="normal">Q</mi></mrow><mo>=</mo><mfrac><mrow><munderover><mo>∑</mo><mrow><mi>q</mi><mo>=</mo><mn>0</mn></mrow><mn>93</mn></munderover><mi>q</mi><mo>⋅</mo><msub><mi>n</mi><mi>q</mi></msub></mrow><mrow><munderover><mo>∑</mo><mrow><mi>q</mi><mo>=</mo><mn>0</mn></mrow><mn>93</mn></munderover><msub><mi>n</mi><mi>q</mi></msub></mrow></mfrac><mo separator="true">,</mo></mrow><annotation encoding="application/x-tex">
\mathrm{avgQ} = \dfrac{\sum_{q=0}^{93} q \cdot n_q}{\sum_{q = 0}^{93} n_q},
</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathrm">avgQ</span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:3.0597em;vertical-align:-1.2798em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.7798em;"><span style="top:-2.156em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mop"><span class="mop op-symbol small-op" style="position:relative;top:0em;">∑</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.954em;"><span style="top:-2.4003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em;">q</span><span class="mrel mtight">=</span><span class="mord mtight">0</span></span></span></span><span style="top:-3.2029em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">93</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.4358em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em;">q</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.8258em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mop"><span class="mop op-symbol small-op" style="position:relative;top:0em;">∑</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.954em;"><span style="top:-2.4003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em;">q</span><span class="mrel mtight">=</span><span class="mord mtight">0</span></span></span></span><span style="top:-3.2029em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">93</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.4358em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em;">q</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.2798em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mpunct">,</span></span></span></span></span></p>
<p>where <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>n</mi><mi>q</mi></msub></mrow><annotation encoding="application/x-tex">n_q</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.7167em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em;">q</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span> is the number of bases with quality score being <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>q</mi></mrow><annotation encoding="application/x-tex">q</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span></span></span></span>. The magic number 93 comes from the quality score of Sanger sequencing<sup id="fnref:sanger-qual-score"><a class="footnote-ref" href="#fn:sanger-qual-score">1</a></sup>, whose score ranges from 0 to 93.</p>
<p>For <code>errQ</code> we need more background knowledge about how quality score is computed. A base with quality score <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>q</mi></mrow><annotation encoding="application/x-tex">q</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span></span></span></span> implies the probability of being erroneously called, <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>P</mi><mi>q</mi></msub></mrow><annotation encoding="application/x-tex">P_q</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.9694em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em;">q</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span></span>, is</p>
<p><span class="katex-display"><span class="katex"><span class="katex-mathml"><math display="block" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>P</mi><mi>q</mi></msub><mo>=</mo><mn>1</mn><msup><mn>0</mn><mfrac><mrow><mo>−</mo><mi>q</mi></mrow><mn>10</mn></mfrac></msup><mo separator="true">,</mo><mspace width="1em"></mspace><mi>q</mi><mo>=</mo><mo>−</mo><mn>10</mn><msub><mrow><mi>log</mi><mo></mo></mrow><mn>10</mn></msub><msub><mi>P</mi><mi>q</mi></msub><mi mathvariant="normal">.</mi></mrow><annotation encoding="application/x-tex">
P_q = 10^{\frac{-q}{10}}, \hspace{1em} q = -10\log_{10}{P_q}.
</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.9694em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em;">q</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1.2228em;vertical-align:-0.1944em;"></span><span class="mord">1</span><span class="mord"><span class="mord">0</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:1.0283em;"><span style="top:-3.413em;margin-right:0.05em;"><span class="pstrut" style="height:3em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mopen nulldelimiter sizing reset-size3 size6"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.879em;"><span style="top:-2.656em;"><span class="pstrut" style="height:3em;"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord mtight">10</span></span></span></span><span style="top:-3.2255em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line mtight" style="border-bottom-width:0.049em;"></span></span><span style="top:-3.4624em;"><span class="pstrut" style="height:3em;"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord mtight">−</span><span class="mord mathnormal mtight" style="margin-right:0.03588em;">q</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.344em;"><span></span></span></span></span></span><span class="mclose nulldelimiter sizing reset-size3 size6"></span></span></span></span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:1em;"></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.9805em;vertical-align:-0.2861em;"></span><span class="mord">−</span><span class="mord">10</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mop"><span class="mop">lo<span style="margin-right:0.01389em;">g</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.207em;"><span style="top:-2.4559em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">10</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2441em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em;">q</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span><span class="mord">.</span></span></span></span></span></p>
<p>Therefore, given <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>q</mi></mrow><annotation encoding="application/x-tex">q</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span></span></span></span> being <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0</mn><mo separator="true">,</mo><mn>1</mn><mo separator="true">,</mo><mn>2</mn><mo separator="true">,</mo><mo>…</mo></mrow><annotation encoding="application/x-tex">0, 1, 2, \ldots</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.8389em;vertical-align:-0.1944em;"></span><span class="mord">0</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">1</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">2</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="minner">…</span></span></span></span>, seqtk has a conversion table <code>perr</code> from quality score to probability,</p>
<table>
<thead>
<tr>
<th style="text-align: left;">Q</th>
<th style="text-align: right;">0</th>
<th style="text-align: right;">1</th>
<th style="text-align: right;">2</th>
<th style="text-align: right;">3<sup id="fnref:2"><a class="footnote-ref" href="#fn:2">2</a></sup></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left;"><strong>P</strong></td>
<td style="text-align: right;">0.5</td>
<td style="text-align: right;">0.5</td>
<td style="text-align: right;">0.5</td>
<td style="text-align: right;">0.5</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: left;">Q</th>
<th style="text-align: right;">4</th>
<th style="text-align: right;">5</th>
<th style="text-align: center;">…</th>
<th style="text-align: right;">38</th>
<th style="text-align: right;">39</th>
<th style="text-align: right;">40</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left;"><strong>P</strong></td>
<td style="text-align: right;">0.398107</td>
<td style="text-align: right;">0.316228</td>
<td style="text-align: center;">…</td>
<td style="text-align: right;">0.000158</td>
<td style="text-align: right;">0.000126</td>
<td style="text-align: right;">0.000100</td>
</tr>
</tbody>
</table>
<p>Based on the probability, it computes the expected number of base call errors, num_err, and the empirical probability of having a base call error at this position, errP,</p>
<p><span class="katex-display"><span class="katex"><span class="katex-mathml"><math display="block" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mrow><mi mathvariant="normal">n</mi><mi mathvariant="normal">u</mi><mi mathvariant="normal">m</mi></mrow><mtext>err</mtext></msub><mo>=</mo><munder><mo>∑</mo><mi>q</mi></munder><msub><mi>P</mi><mi>q</mi></msub><mo>⋅</mo><msub><mi>n</mi><mi>q</mi></msub><mo separator="true">,</mo><mspace width="1em"></mspace><mtext>errP</mtext><mo>=</mo><mfrac><msub><mrow><mi mathvariant="normal">n</mi><mi mathvariant="normal">u</mi><mi mathvariant="normal">m</mi></mrow><mtext>err</mtext></msub><mrow><munder><mo>∑</mo><mi>q</mi></munder><msub><mi>n</mi><mi>q</mi></msub></mrow></mfrac><mi mathvariant="normal">.</mi></mrow><annotation encoding="application/x-tex">
\mathrm{num}_\text{err} = \sum_q P_q \cdot n_q, \hspace{1em} \text{errP} = \frac{\mathrm{num}_\text{err}}{\sum_q n_q}.
</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord"><span class="mord mathrm">num</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord text mtight"><span class="mord mtight">err</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:2.4361em;vertical-align:-1.3861em;"></span><span class="mop op-limits"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.05em;"><span style="top:-1.9em;margin-left:0em;"><span class="pstrut" style="height:3.05em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em;">q</span></span></span><span style="top:-3.05em;"><span class="pstrut" style="height:3.05em;"></span><span><span class="mop op-symbol large-op">∑</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.3861em;"><span></span></span></span></span></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em;">q</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.9694em;vertical-align:-0.2861em;"></span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em;">q</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:1em;"></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord text"><span class="mord">errP</span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:2.2294em;vertical-align:-1.1218em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.1076em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mop"><span class="mop op-symbol small-op" style="position:relative;top:0em;">∑</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.0017em;"><span style="top:-2.4003em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em;">q</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.4358em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.03588em;">q</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2861em;"><span></span></span></span></span></span></span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord"><span class="mord"><span class="mord mathrm">num</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord text mtight"><span class="mord mtight">err</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:1.1218em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mord">.</span></span></span></span></span></p>
<p>Thus the <code>errQ</code> is the equivalent quality score of errP, which better interprets the probability of base call error than <code>avgQ</code>,</p>
<p><span class="katex-display"><span class="katex"><span class="katex-mathml"><math display="block" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mrow><mi mathvariant="normal">e</mi><mi mathvariant="normal">r</mi><mi mathvariant="normal">r</mi><mi mathvariant="normal">Q</mi></mrow><mo>=</mo><mo>−</mo><mn>10</mn><msub><mrow><mi>log</mi><mo></mo></mrow><mn>10</mn></msub><mrow><mi mathvariant="normal">e</mi><mi mathvariant="normal">r</mi><mi mathvariant="normal">r</mi><mi mathvariant="normal">P</mi></mrow><mi mathvariant="normal">.</mi></mrow><annotation encoding="application/x-tex">
\mathrm{errQ} = -10\log_{10}{\mathrm{errP}}.
</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathrm">errQ</span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.9386em;vertical-align:-0.2441em;"></span><span class="mord">−</span><span class="mord">10</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mop"><span class="mop">lo<span style="margin-right:0.01389em;">g</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.207em;"><span style="top:-2.4559em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">10</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2441em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord"><span class="mord mathrm">errP</span></span></span><span class="mord">.</span></span></span></span></span></p>
<p>By passing <code>-q 0</code> to <code>seqtk fqchk</code>, one can get the proportion of all distinct quality scores at each position. This information is pretty useful if the sequencing data is all a mess and one needs to figure out the cause.</p>
<p>Though some of the <code>seqtk fqchk</code>’s behavior is not documented, it should be straight forward enough to understand. All in all, the details can always be found in the <a href="https://github.com/lh3/seqtk/blob/4feb6e81444ab6bc44139dd3a125068f81ae4ad8/seqtk.c#L1483">source code</a>.</p>
<h3 id="summary">Summary</h3>
<p><a href="https://github.com/lh3/seqtk">Seqtk</a> is fast to use for daily routines of FASTA/Q conversion. On top of that it provide various functionalities such as read random sampling, quality check, and many I haven’t tried or mentioned.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:sanger-qual-score">
<p>See multiple specifications of quality score at <a href="http://scikit-bio.org/docs/latest/generated/skbio.io.format.fastq.html#quality-score-variants">sckit-bio doc</a>. The score is <a href="https://en.wikipedia.org/wiki/Phred_quality_score">Phred quality score</a>. More other score representations can be found at <a href="https://en.wikipedia.org/wiki/FASTQ_format">FASTQ wiki</a>. <a class="footnote-backref" href="#fnref:sanger-qual-score" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>Note that the probability of q less than 4 is fixed with 0.5. A quick computation can see when <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>q</mi><mo>=</mo><mn>3</mn></mrow><annotation encoding="application/x-tex">q = 3</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">3</span></span></span></span>, its actual Phred probability is <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>1</mn><msup><mn>0</mn><mrow><mo>−</mo><mn>0.3</mn></mrow></msup><mo>=</mo><mn>0.501</mn></mrow><annotation encoding="application/x-tex">10 ^ {-0.3} = 0.501</annotation></semantics></math></span><span aria-hidden="true" class="katex-html"><span class="base"><span class="strut" style="height:0.8141em;"></span><span class="mord">1</span><span class="mord"><span class="mord">0</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">−</span><span class="mord mtight">0.3</span></span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.501</span></span></span></span>. <a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
</ol>
</div>清除 ^H2015-09-27T02:28:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2015-09-27:/posts/2015/09/remove-^H/<p>中文輸入我用嘸蝦米,在打中英文切換時候,很容易打出 <code>\x08</code> 這東西,在 vim 就會顯示成 <code>^H</code>,功能是 <kbd>Backspace</kbd>,但在一般 GUI 環境裡,可能就 …</p><p>中文輸入我用嘸蝦米,在打中英文切換時候,很容易打出 <code>\x08</code> 這東西,在 vim 就會顯示成 <code>^H</code>,功能是 <kbd>Backspace</kbd>,但在一般 GUI 環境裡,可能就會因為它而把前面的字吃掉。</p>
<p>因此今天寫了個小腳本可以把清掉當目錄底下的文字檔的 <code>^H</code>:</p>
<div class="highlight"><pre><span></span><code>ag<span class="w"> </span>-l<span class="w"> </span><span class="s1">'\x08'</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>xargs<span class="w"> </span>sed<span class="w"> </span>-i<span class="w"> </span><span class="s1">''</span><span class="w"> </span><span class="s1">'s/\x08//'</span>
</code></pre></div>
<p><a href="https://github.com/ggreer/the_silver_searcher">ag</a> 能夠換成比較慢但內建就有的 grep,參數兩者是相容的。</p>
<p>如果要順便印出改了哪些檔案的話:</p>
<div class="highlight"><pre><span></span><code><span class="nb">echo</span><span class="w"> </span><span class="s1">'Found ^H in the following files:'</span>
ag<span class="w"> </span>-l<span class="w"> </span><span class="s1">'\x08'</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>tee<span class="w"> </span>/dev/fd/2<span class="w"> </span><span class="p">|</span><span class="w"> </span>xargs<span class="w"> </span>sed<span class="w"> </span>-i<span class="w"> </span><span class="s1">''</span><span class="w"> </span><span class="s1">'s/\x08//'</span>
</code></pre></div>
<p>用 tee 把 stdout 導向到 stderr 還蠻有趣的,以前都不知道這樣用。</p>使用 Zotero 管理文獻書目2015-09-26T00:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2015-09-26:/posts/2015/09/ref-management-zotero/<p><strong>TL;DR</strong> Use Zotero to sync references, webpages, and everything.</p>
<p>一開始會想要收集 reference 無非是做研究。寫論文、平常報告進度需要放上 citation,而在學術界最常就是 cite 別 …</p><p><strong>TL;DR</strong> Use Zotero to sync references, webpages, and everything.</p>
<p>一開始會想要收集 reference 無非是做研究。寫論文、平常報告進度需要放上 citation,而在學術界最常就是 cite 別人的期刊。期刊 citation 有它一定的格式,而且每個期刊用的格式不同,手打容易錯,也很難維護。所以最好的方式就是把期刊完整的資訊存在資料庫,然後引用的時候再插到文件裡面。</p>
<p>整個問題就變成怎麼管理這些期刊資訊。</p>
<div class="toc">
<ul>
<li><a href="#bibtex-is-for-latex">BibTex is for LaTeX</a><ul>
<li><a href="#bibdesk">BibDesk</a></li>
</ul>
</li>
<li><a href="#endnote-is-for-word">EndNote is for Word</a></li>
<li><a href="#zotero-bridges-the-both-world">Zotero bridges the both world</a><ul>
<li><a href="#zotfile">Zotfile</a></li>
<li><a href="#how-to-sync-data-storage">How to sync data storage</a></li>
</ul>
</li>
<li><a href="#_1">總結</a></li>
</ul>
</div>
<h3 id="bibtex-is-for-latex">BibTex is for LaTeX</h3>
<p>在 <a href="https://www.latex-project.org/">LaTeX</a> 當中可以利用 <a href="http://www.bibtex.org/">BibTeX</a>(或更新的 <a href="https://www.ctan.org/pkg/biblatex">BibLaTeX</a>)提供的流程處理 citation 與管理 reference(即 <a href="https://en.wikibooks.org/wiki/LaTeX/Bibliography_Management">Bibliography 管理</a>)。他把所有 reference 集中在一個純文字的檔案,</p>
<div class="highlight"><pre><span></span><code>@article<span class="nb">{</span>Calin:2006aa,
Author = <span class="nb">{</span>Calin, George A. and Croce, Carlo M.<span class="nb">}</span>,
Journal = <span class="nb">{</span>Nat Rev Cancer<span class="nb">}</span>,
Month = <span class="nb">{</span>11<span class="nb">}</span>,
Number = <span class="nb">{</span>11<span class="nb">}</span>,
Pages = <span class="nb">{</span>857--866<span class="nb">}</span>,
Title = <span class="nb">{</span>MicroRNA signatures in human cancers<span class="nb">}</span>,
Volume = <span class="nb">{</span>6<span class="nb">}</span>,
Year = <span class="nb">{</span>2006<span class="nb">}}</span>
</code></pre></div>
<p>每篇文章會有一個 cite key,在內文用到的時候就可以引用,而 BibTeX 就會根據現在定義的 style 去放 citation 以及在文末加上對應的 reference。</p>
<h4 id="bibdesk">BibDesk</h4>
<p>真正讓 BibTeX 能在日常生活中很好使用,有一部份要歸功於像 <a href="http://bibdesk.sourceforge.net/">BibDesk</a> 這樣的圖形工具。</p>
<p><a href="http://bibdesk.sourceforge.net/">BibDesk</a> 是個 OSX 的應用程式,包含在 <a href="https://tug.org/mactex/">MacTex</a> distribution 裡面。除了能自動從匯入來自網站或不同格式的 citation 之外,還有檔案附件的功能,能把例如論文的 PDF、Supplementary files 自動跟對應的條目做連接,重新命名並放在一個架構化的資料夾。重新命名跟歸檔的方式都能自訂,例如可以照 期刊名稱/年份 去分類,然後把這個資料夾放在 Dropbox 上就完成了自動同步。</p>
<figure>
<img src="https://blog.liang2.tw/posts/2015/09/ref-management-zotero/pics/bibdesk_usage.png"/>
<p class="caption center"><span class="fig">BibDesk 使用畫面</span></p>
</figure>
<p>這樣解決了幾乎所有寫 paper 會碰到的問題。</p>
<p>平常會有個超級大的 BibTeX 檔,裡面有所有各式各樣的 reference。要寫 paper 的時候就把相關的 paper 拿出來 export 成一個小的檔案,然後把一些條目裡不相關的資訊拿掉,就不用再去想文獻引用的部份。我有好幾年都是這樣管理 reference 的。</p>
<h3 id="endnote-is-for-word">EndNote is for Word</h3>
<p>不過不是所有人都用 LaTeX 寫 paper,例如我們實驗室就只有我一個人用 LaTeX,其他人都用 Word。Word 上面就沒這麼簡單又好用的管理工具了。最多人用的是 EndNote。它是付費的,但因為我是公立大學的學生,所以謝謝各位納稅人,讓我能免費使用它(鞠躬)。</p>
<p>EndNote 做到的功能跟 BibTeX 一樣,用了它之後在 word 裡面就不用再管理書目的格式。不過我平常都是從別的地方把 reference export 再丟進 EndNote 裡,所以也不清楚它有什麼別的功能。</p>
<p>噢,他有個好處就是在 OSX 和在 Windows 上都一樣好用。</p>
<h3 id="zotero-bridges-the-both-world">Zotero bridges the both world</h3>
<p>BibDesk (BibTeX) 真的很方便,讓我有時候想要管理一些很經典的技術文章,想要存一些有用的不論是 PDF、影片、網站,都想要放到 BibDesk 裡面。但這些地方都沒有提供 BibTex citation format 讓人直接複製貼上,而且它的語法也沒有設計要解決這麼多來源,所以寫起來很卡、很花時間。</p>
<p>另一方面,現在查資料都是用瀏覽器,看到一篇論文,如果要 Export citation、打開 BibDesk、Import citation、Download PDF(s)、Link PDF(s) 這一連串動作也很麻煩。</p>
<p>所以就有了 <a href="https://www.zotero.org/">Zotero</a> [zoh-TAIR-oh] 這整合在瀏覽器的工具。目前支援 Firefox、Chrome、Safari,也有提供 Plugin 給 Word 或 LibreOffice 使用。所以它應該足夠取代前面的工具,雖然我並沒有結合 Word 使用過。</p>
<p>基本畫面蠻簡單的,大概所有的書目管理軟體都差不多,只是它是整合在 Browser 當中,</p>
<figure>
<img src="https://blog.liang2.tw/posts/2015/09/ref-management-zotero/pics/zotero.png"/>
<p class="caption center"><span class="fig">Zotero 使用畫面</span></p>
</figure>
<p>使用很簡單,就兩個按鈕 <img src="https://blog.liang2.tw/posts/2015/09/ref-management-zotero/pics/zotero_icon.png" style="height: 1.6em;"/>,左邊打開 Zotero 視窗,右邊把當前網頁存進自己的 library 裡,它右下角就會出現處理的訊息,如果是期刊網站而且有 full text PDF 的權限,就會一起把 PDF 都存起來。</p>
<figure class="align-center">
<img src="https://blog.liang2.tw/posts/2015/09/ref-management-zotero/pics/zotero_saving.png"/>
</figure>
<figure class="align-center">
<img src="https://blog.liang2.tw/posts/2015/09/ref-management-zotero/pics/zotero_citation.png"/>
</figure>
<p>平常要放到論文裡時,我還是會先匯出到 BibTeX 或 EndNote。不過它額外還有好用的功能,能把 citation 輸出成 RTF/HTML 的 bibliography,這可以直接貼在 Powerpoint 做投影片很方便。</p>
<blockquote>
<ol>
<li>Torsten Thomas, Jack Gilbert & Folker Meyer. Metagenomics - a guide from sampling to data analysis. <em>Microbial Informatics and Experimentation</em> <strong>2</strong>, 3 (2012).</li>
</ol>
</blockquote>
<p>Zotero 有提供免費 300MB 讓使用者同步 library,這對於單純 citation 本身已經很足夠了,它也支援同步到自己架設的 WebDAV。</p>
<p>PS: Zotero 採用 AGPL v3 授權,原始碼在 <a href="https://github.com/zotero/zotero">Github</a> 上。</p>
<h4 id="zotfile">Zotfile</h4>
<p>Zotero 內建的 PDF attachment 功能不如 BibDesk 這麼完整,因此有 <a href="http://zotfile.com/">Zotfile</a> 來額外管理 PDF 檔案的功能。再者 Zotero 的空間有限,會想把 PDF 等大的檔案放在像 Dropbox 的地方,不要都用 Zotero 同步。</p>
<figure class="align-center">
<img src="https://blog.liang2.tw/posts/2015/09/ref-management-zotero/pics/zotfile_file_location.png"/>
<p class="caption left">自訂 (PDF) 檔案存放路徑,底下可以再設定子目錄。在這邊是會按照<code>期刊名/年分</code>去分目錄。</p>
</figure>
<figure class="align-center">
<img src="https://blog.liang2.tw/posts/2015/09/ref-management-zotero/pics/zotfile_rename_setting.png"/>
<figcaption>自訂檔案命名規則</figcaption>
</figure>
<p>不過如果是同步到 Dropbox 的話,可能每台電腦的路徑都不一樣,例如 OSX 可能是 <code>/Users/me/Dropbox</code>,但 Debian 可能是 <code>/home/me/Dropbox</code>,這時候存放的路徑就要改成相對路徑。</p>
<figure class="align-center">
<img src="https://blog.liang2.tw/posts/2015/09/ref-management-zotero/pics/zotero_file_location.png"/>
<figcaption>Zotero Advanced 設定裡修改 library 相關檔案的路徑。</figcaption>
</figure>
<p>這邊要額外說明一下 Linked Attachment Base Directory 以及 Data Directory 的差異。像 PDF 這類如果被 Zotfile 所管理的檔案,或是自己手動選「Attach Link …」的檔案,他使用的是 linked attachment,icon 會有個連結的符號 <img src="https://blog.liang2.tw/posts/2015/09/ref-management-zotero/pics/zotfile_fileicon.png" style="height: 1.6em;"/>。其他像 Webpage Snapshot 或是預設的 PDF 檔都是放在 Data Directory。</p>
<h4 id="how-to-sync-data-storage">How to sync data storage</h4>
<p>如果要進一步讓 data storage 也用 Dropbox 同步的話,參考官網關於 <a href="https://www.zotero.org/support/sync">sync 的介紹</a>,OSX 上 Zotero Firefox 的資料會存放在</p>
<div class="highlight"><pre><span></span><code>~/Library/Application Support/Firefox/Profiles/xxxxxxxx.default/zotero
</code></pre></div>
<p>其中 data storage 就在底下的 <code>storage</code> 資料夾。官網建議不要把 Zotero 的 SQLite database 等都同步在 Dropbox 上,所以只要把這個資料移到 Dropbox 再 soft link 回來就可以了。</p>
<h3 id="_1">總結</h3>
<p><a href="https://www.zotero.org/">Zotero</a> 是個實用並且跟瀏覽器整合的文獻(書目)管理工具。但它也能處理像網頁等其他網路上也很常見的格式,也能與既有的工具、文件編輯軟體結合,並有同步功能,非常適合作為外部記憶庫。</p>
<p>(應該要用英文寫的,什麼時候才會有第一篇英文 blog post QAQ)</p>PNG Optimizer2015-09-22T00:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2015-09-22:/posts/2015/09/png-optim/<p>部落格剛成立總是特別興奮,測了一下瀏覽像這樣的網站要用多少頻寬。首頁能壓在 600KB 左右,因為文章 summary 裡面暫時還沒圖 …</p><p>部落格剛成立總是特別興奮,測了一下瀏覽像這樣的網站要用多少頻寬。首頁能壓在 600KB 左右,因為文章 summary 裡面暫時還沒圖。不過像部落格設定這篇文章,裡面有幾張螢幕截圖的就要花快 2MB 傳。</p>
<p>就想了一下圖檔有什麼壓縮方式。如果是 JPG 的話,<a href="https://github.com/tjko/jpegoptim">jpegoptim</a> 簡單又有效;如果是 PNG 的話,以前都是用 <a href="http://optipng.sourceforge.net/">OptiPNG</a>,但效果有限,而且是無損壓縮。在螢幕截圖的情況,我倒不介意幾個像素的顏色稍微不一樣(其實人眼無法分辨)</p>
<p>於是,需要比較看看市面找得到的幾種 PNG 壓縮方式。很碰巧找到 <a href="http://css-ig.net/png-tools-overview.html">http://css-ig.net/png-tools-overview.html</a> 專講 PNG 優化的比較,就挑了幾款來試。</p>
<p>直接把結果整理成下表:</p>
<table>
<thead>
<tr>
<th style="text-align: left;">Filename</th>
<th style="text-align: right;">Original size</th>
<th style="text-align: right;"><a href="http://optipng.sourceforge.net/">OptiPNG</a> (lossless)</th>
<th style="text-align: right;"><a href="https://github.com/google/zopfli">Zopfli</a> (lossless)</th>
<th style="text-align: right;"><a href="https://pngquant.org/">pngquant</a> (lossy)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left;"><a href="../blog-internals/pics/blog_desktop.png">blog_desktop.png</a></td>
<td style="text-align: right;">180K</td>
<td style="text-align: right;">108K</td>
<td style="text-align: right;">100K</td>
<td style="text-align: right;">56K</td>
</tr>
<tr>
<td style="text-align: left;"><a href="../blog-internals/pics/blog_mobile.png">blog_mobile.png</a></td>
<td style="text-align: right;">72K</td>
<td style="text-align: right;">52K</td>
<td style="text-align: right;">44K</td>
<td style="text-align: right;">28K</td>
</tr>
<tr>
<td style="text-align: left;"><a href="../blog-internals/pics/justfont_setting.png">justfont_setting.png</a></td>
<td style="text-align: right;">272K</td>
<td style="text-align: right;">196K</td>
<td style="text-align: right;">164K</td>
<td style="text-align: right;">84K</td>
</tr>
<tr>
<td style="text-align: left;"><a href="../blog-internals/pics/oldsite.png">oldsite.png</a></td>
<td style="text-align: right;">604K</td>
<td style="text-align: right;">536K</td>
<td style="text-align: right;">492K</td>
<td style="text-align: right;">288K</td>
</tr>
<tr>
<td style="text-align: left;"><a href="https://blog.liang2.tw/posts/2015/09/png-optim/pics/oldsite_full.png">oldsite_full.png</a></td>
<td style="text-align: right;">816K</td>
<td style="text-align: right;">684K</td>
<td style="text-align: right;">644K</td>
<td style="text-align: right;">376K</td>
</tr>
</tbody>
</table>
<h3 id="optipng">OptiPNG</h3>
<p><a href="http://optipng.sourceforge.net/">OptiPNG</a> 還是有部份效果,不過他跑不快。</p>
<h3 id="zopfli-zopflipng">Zopfli (Zopflipng)</h3>
<p><a href="https://github.com/google/zopfli">Zopfli</a> 是 Google 開發的壓縮演算法,相容於 deflate, gzip, zlib 格式。也因此能用在 PNG 上面。他也是 lossless compression。</p>
<blockquote>
<p>Zopfli Compression Algorithm is a compression library programmed in C to perform very good, but slow, deflate or zlib compression.
(<a href="https://github.com/google/zopfli">Zopfli Readme</a>)</p>
</blockquote>
<p>這個 <a href="https://github.com/google/zopfli/blob/master/README.zopflipng">Zopflipng</a> 也是同個 repo 維護。自己編譯簡單,</p>
<div class="highlight"><pre><span></span><code>git clone https://github.com/google/zopfli.git
make zopflipng
</code></pre></div>
<p>如果要一口氣壓縮一堆 PNG,可以這樣使用:</p>
<div class="highlight"><pre><span></span><code>zopflipng --lossy_transparent --prefix *.png
</code></pre></div>
<p>速度也蠻慢的,有個 <code>-q</code> 選項可以加速。但壓縮效率比 OptiPNG 還好。</p>
<p>PS 剛好今天早上看到 Google 又出了另一個壓縮演算法 <a href="https://github.com/google/brotli">Brotli</a>,但這個與 deflate 不相容,應該不能用在 PNG 上面。</p>
<h3 id="pngquant">pngquant</h3>
<p>想要有損的 PNG 可以用 <a href="https://pngquant.org/">pngquant</a>。看官網有特別強調在透明度的資訊會被保留,並能像 JPEG 一樣設定 quality。一般 quality 容許越低壓縮比都會越高。</p>
<div class="highlight"><pre><span></span><code>pngquant -f --ext=.png --quality=70-85 --skip-if-larger *.png
</code></pre></div>
<p>可以看到 pngquant 能很容易達到 50% 以下的壓縮比。就我的例子看不太出現螢幕截圖哪裡失真了,而且失真了……也不會怎麼樣啦。</p>
<h3 id="pngquant-zopflipng">pngquant + Zopflipng</h3>
<p>看了一下相關的討論,pngquant 還有再被壓縮的空間,所以最後再套上 Zopflipng 還可以再變小,還蠻驚人的。</p>
<table>
<thead>
<tr>
<th style="text-align: left;">Filename</th>
<th style="text-align: right;">Orig. size</th>
<th style="text-align: right;">pngquant</th>
<th style="text-align: right;">pngquant + Zopfli</th>
<th style="text-align: right;">Compress ratio</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left;"><a href="../blog-internals/pics/blog_desktop.png">blog_desktop.png</a></td>
<td style="text-align: right;">180K</td>
<td style="text-align: right;">56K</td>
<td style="text-align: right;">60K</td>
<td style="text-align: right;">0.33</td>
</tr>
<tr>
<td style="text-align: left;"><a href="../blog-internals/pics/blog_mobile.png">blog_mobile.png</a></td>
<td style="text-align: right;">72K</td>
<td style="text-align: right;">28K</td>
<td style="text-align: right;">28K</td>
<td style="text-align: right;">0.39</td>
</tr>
<tr>
<td style="text-align: left;"><a href="../blog-internals/pics/justfont_setting.png">justfont_setting.png</a></td>
<td style="text-align: right;">272K</td>
<td style="text-align: right;">84K</td>
<td style="text-align: right;">76K</td>
<td style="text-align: right;">0.28</td>
</tr>
<tr>
<td style="text-align: left;"><a href="../blog-internals/pics/oldsite.png">oldsite.png</a></td>
<td style="text-align: right;">604K</td>
<td style="text-align: right;">288K</td>
<td style="text-align: right;">268K</td>
<td style="text-align: right;">0.44</td>
</tr>
<tr>
<td style="text-align: left;"><a href="https://blog.liang2.tw/posts/2015/09/png-optim/pics/oldsite_full.png">oldsite_full.png</a></td>
<td style="text-align: right;">816K</td>
<td style="text-align: right;">376K</td>
<td style="text-align: right;">348K</td>
<td style="text-align: right;">0.43</td>
</tr>
</tbody>
</table>
<h3 id="truepng">沒有測的 TruePNG</h3>
<p>在原始網站中有提到 <a href="http://css-ig.net/articles/truepng">TruePNG</a> 表現很好,但它不是 open source 而且好像只能在 Windows 上跑,那就算了。</p>
<h3 id="_1">結論</h3>
<p>以後沒事截圖都會用 pngquant 壓縮一下,完全不能有色差的考慮從 OptiPNG 改為 Zopfli。</p>設定部落格筆記2015-09-21T00:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2015-09-21:/posts/2015/09/blog-internals/<p>Blog 對我來說,最重要的就是書寫的舒適度。</p>
<p>一開始在設定 github CNAME 的時候就訂為 <code>blog.liang2.tw</code>,但一直以來都只是個一頁式的自我介 …</p><p>Blog 對我來說,最重要的就是書寫的舒適度。</p>
<p>一開始在設定 github CNAME 的時候就訂為 <code>blog.liang2.tw</code>,但一直以來都只是個一頁式的自我介紹<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup>,用 <a href="http://semantic-ui.com/">SemanticUI</a> 手刻而成。不過部落格如果每篇文章都還要手刻的話,大概就不會有力氣再寫內容了。</p>
<p>整理了一下有幾個目標:</p>
<ul>
<li>只考慮用 static site 因為不想維護 server,而且 blog 也沒什要炫的,現在光用前端就可以做到很多互動功能</li>
<li>最好 site generator 是用 Python 實作,這樣想要調整它的功能時,比較懂怎麼改</li>
<li>能支援 markdown 和 reStructuredText 最好</li>
</ul>
<p>篩完之後選項也沒幾個:<a href="http://docs.getpelican.com/">Pelican</a>、<a href="http://sphinx-doc.org/">Sphinx</a>,但 Sphinx 可能對 blog 開發的功能比較少,最多人用的大概就 Pelican,所以就決定用它了。</p>
<p>整理起來也做了不少調整,就列點吧:</p>
<div class="toc">
<ul>
<li><a href="#pelican">Pelican 簡介</a></li>
<li><a href="#all-is-about-the-theme">All is about the theme</a></li>
<li><a href="#_1">字型</a><ul>
<li><a href="#webfont">中文 webfont</a></li>
<li><a href="#_2">中文排版</a></li>
</ul>
</li>
<li><a href="#figure-caption">Figure caption</a></li>
<li><a href="#markdown-or-rst">Markdown or rst?</a></li>
<li><a href="#to-do">To do</a></li>
<li><a href="#edit-2015-09-22">EDIT (2015-09-22)</a></li>
<li><a href="#edit-2015-09-23">EDIT (2015-09-23)</a></li>
</ul>
</div>
<h3 id="pelican">Pelican 簡介</h3>
<p><a href="http://docs.getpelican.com/">Pelican</a> 結論來說不難理解,而且要自訂 blog theme 也不會很複雜。首先跟 Sphinx 一樣,用內建的 <code>pelican-quickstart</code> 預設值就能架好一個可以動的。目錄大概長這樣</p>
<div class="highlight"><pre><span></span><code>my_blog/
├── content/
│ ├── blog_post_1.md
│ └── blog_post_2.rst
├── output/
├── develop_server.sh*
├── Makefile
├── fabfile.py
├── pelicanconf.py
├── publishconf.py
└── requirements.txt
</code></pre></div>
<p>Blog source 都放在 content 底下,設定檔分成 local 用的 <code>pelicanconf.py</code> 以及 deploy 用的 <code>publishconf.py</code>。並且提供了像 Fabric、Make、shell script 等自動化腳本把 source 用 theme template render 成一個靜態網站,</p>
<div class="highlight"><pre><span></span><code>make<span class="w"> </span>html
</code></pre></div>
<p>預設會輸出到 <code>output/</code>,到時候 deploy 就把這個資料夾的內容丟到 server 上。</p>
<p>每篇文章可以用 markdown 或者 reStructuredText(rst) 來寫,概念上像這樣:</p>
<div class="highlight"><pre><span></span><code>---
Title: Hello World
Date: 2016-01-16 18:00
Tags: world, programming
Category: test
<span class="gu">Slug: hello-world</span>
<span class="gu">---</span>
Hello [World]
[<span class="nl">World</span>]: <span class="na">https://en.wikipedia.org/wiki/World</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="gh">Hello World</span>
<span class="gh">##############</span>
<span class="nc">:date:</span> 2016-01-16 18:00
<span class="nc">:tags:</span> world, programming
<span class="nc">:category:</span> test
<span class="nc">:slug:</span> hello-world
Hello World_
<span class="p">..</span> <span class="nt">_World:</span> https://en.wikipedia.org/wiki/World
</code></pre></div>
<p>這樣已經設定好了標題、分類、標籤、發布日期還有 slug(有點像文章的 ID)算很完整了。最低要求至少有標題。</p>
<p>最後調整了一下 static file 的路徑。我把文章按年月分開,每個子資料夾裡有當月的圖、檔案等等。URL 也是以年月為單位。其實最理想的應該是有個 hash 之類的東西 <code>/posts/2015/09/abcd/</code> 等同於 <code>/posts/2015/09/abcd-my-post/</code> 比較好分享。找了一下好像沒這功能,不過沒有它影響也不嚴重,暫且不理。</p>
<h3 id="all-is-about-the-theme">All is about the theme</h3>
<p>一開始最花時間就是找個好主題了。內建的主題實用性不差,但初次看結構太複雜,再來我喜歡更簡潔的版面,也希望有寫好 responsive layout。</p>
<p>Pelican 大部份的主題都集中在 <a href="http://pelicanthemes.com/">http://pelicanthemes.com/</a>,有縮圖很好挑選,而且 theme 跟內容是分開的,換 theme 只是改 config 裡 <code>THEME</code> 這變數而已,不喜歡就換。選了一陣子挑到 <a href="https://github.com/alexandrevicenzi/flex">Flex</a>,他不是我最喜歡的版型,我比較喜歡單欄式置中,但意外只有少數主題滿足上述條件。</p>
<p>Theme template 用 Jinja2,一開始只要調整 <code>base</code>、<code>index</code>、<code>article</code>、<code>page</code> 這幾頁跟 blog 最相關的就能改變主要的外觀。好在兩欄式的網頁 code 讀起來也很舒適。看了一下只要把 responsive 調整一下,讓手機內文寬度夠、很大的螢幕不要滿版整體看起來就差不多。大致上 theme 就這樣定下來了。</p>
<p>細部的 CSS 修正,Flex 有用 <a href="http://lesscss.org/">LESS</a> 和 <a href="http://gulpjs.com/">gulp</a> 處理前端的設定。LESS 變數跟 nesting rules 不會讓 CSS 變得很髒;每次改完跑個 gulp 就有新的 <code>style.min.css</code> 很方便。</p>
<p>唯一討厭左側的大頭照,有夠煩的,而且還要增加 54KB 的流量。還再想該放什麼來關掉它,放初音好了。</p>
<h3 id="_1">字型</h3>
<p>因為自用 OSX,有時候都會忘了在 Windows 上的字體有多悲哀。</p>
<p>Flex 內建用 Google webfont 來處理英文字體,為了引言還有完整性多加了一組 serif 字體 <a href="https://www.google.com/fonts/specimen/Crimson+Text">Crimson Text</a>。我喜歡這種 Garamond 類的古典襯線字。剛剛發現它是<a href="https://github.com/skosch/Crimson">開源的 (SIL 1.1)</a>,nice。(大陸網友表示:……)</p>
<h4 id="webfont">中文 webfont</h4>
<p>麻煩的就是中文字型。直接放棄系統內建,但最後有把 Noto Sans CJK 和 Source Hans Sans 加進來當備用。一直都有想嘗試 <a href="http://justfont.com">justfont</a> 推出的 webfont 功能。它運作時會嵌入一個 javascript,看這頁網頁用到哪些中文字,才去要這些中文字的字型來加速載入。使用上就跟 Google webfont 一樣,官網教學考慮了很多使用情況,其實沒做什麼設定就好了,我以為要調很多東西才看得到效果,最後只改了 <code>font-family</code> 就完工。他的設定也能保留原本英文字的字型。</p>
<figure>
<img src="https://blog.liang2.tw/posts/2015/09/blog-internals/pics/justfont_setting.png"/>
<p class="caption center"><span class="fig">Justfont 設定</span></p>
</figure>
<p>免費的試用沒問題之後就刷下去了。說真的免費只能綁兩個字型,設定好內文以及內文粗體 quota 就用完了,現在 100,000 page views/year 大約 NTD 350/year 也不貴。既然付費了當然要試試信黑體,電腦版的到現在還買不起啊。設了兩個字重,一樣加了一套楷體當引言用。楷體也選了比較秀氣的 cwTeX 楷。</p>
<p>也許未來會試試看仿宋體,但我有點擔心螢幕顯示的效果(用 Retina 表示解析度無感),而且 justfont 提供的(仿)宋體也沒有比信黑體更喜歡,這實驗暫且擱置。</p>
<h4 id="_2">中文排版</h4>
<p>受到 COSCUP 2015 Bobby Tung 給的演講<a href="http://www.slideshare.net/bobby3302/w3c-51661297">《中文排版需求以及我在W3C看到的事情》</a>所感召,覺得如果自己不一開始好好做網頁中文排版,之後肯定更懶得改。</p>
<p>但最後還是有所妥協啦(跪)。</p>
<p>首先段落前後還是有留白,這主要是兼顧英文排版,因為不知道怎麼在不同語言套不同的版型,英文段落是前後留大間距。再來我在純文字的時候也很習慣段落前後空一行,感覺視覺上這樣比較舒適(也許是行高不夠……)。 <del><code>margin</code> 也是設為 <code>1em</code>。</del>(EDIT: 見文末)</p>
<p>段落首行縮排最後也沒有放,主因是文句都蠻短的,有點怪;再來 markdown parser 會把我的全形空白吃掉,難以理解(但 rst 不會),真要加只能用<code>&#x3000;</code>硬加。中英交雜的段落中文字會無法對齊,不過就暫時算了,現在中英文的字重能一樣已經很感動了。</p>
<p>300 的中文字的確有點細,我把字調大了成 18px,還特別拿給我爸媽看,確定他們看得到這些字 XD</p>
<p>做到這裡其實還蠻滿意了,長得像這樣:</p>
<figure class="align-center" style="width: 250px">
<img src="https://blog.liang2.tw/posts/2015/09/blog-internals/pics/blog_mobile.png"/>
<p class="caption center"><span class="fig">手機上的樣子</span></p>
</figure>
<figure>
<img src="https://blog.liang2.tw/posts/2015/09/blog-internals/pics/blog_desktop.png"/>
<p class="caption center"><span class="fig">電腦螢幕的樣子</span></p>
</figure>
<h3 id="figure-caption">Figure caption</h3>
<p>圖的下面還蠻常會放一些圖說、reference 之類。範例上面就有。在 markdown 不容易達成這效果,因為它的語法沒這麼複雜;但 rst 本來就有支援這樣的語法:</p>
<div class="highlight"><pre><span></span><code><span class="p">..</span> <span class="ow">role</span><span class="p">::</span> fig
<span class="p">..</span> <span class="ow">figure</span><span class="p">::</span> {filename}pics.jpg
<span class="nc">:align:</span> center
<span class="na">:fig:</span><span class="nv">`Figure 1:`</span> The figure caption.
The legend consists of all elements after the caption.
</code></pre></div>
<p>這樣就會變成</p>
<div class="highlight"><pre><span></span><code><span class="p"><</span><span class="nt">div</span> <span class="na">class</span><span class="o">=</span><span class="s">"figure align-center"</span><span class="p">></span>
<span class="p"><</span><span class="nt">img</span> <span class="na">alt</span><span class="o">=</span><span class="s">""</span> <span class="na">src</span><span class="o">=</span><span class="s">"{filename}pics.jpg"</span><span class="p">></span>
<span class="p"><</span><span class="nt">figcaption</span><span class="p">><</span><span class="nt">span</span> <span class="na">class</span><span class="o">=</span><span class="s">"fig"</span><span class="p">></span>Figure 1:<span class="p"></</span><span class="nt">span</span><span class="p">></span> The figure caption<span class="p"></</span><span class="nt">figcaption</span><span class="p">></span>
<span class="p"><</span><span class="nt">div</span> <span class="na">class</span><span class="o">=</span><span class="s">"legend"</span><span class="p">></span>The legend consists of all elements after the caption.<span class="p"></</span><span class="nt">div</span><span class="p">></span>
<span class="p"></</span><span class="nt">div</span><span class="p">></span>
</code></pre></div>
<p>在 markdown 基本上就手打上面那一串 HTML,其實也還好,只是醜了一點。真的寫得很煩時再想寫 plugin 來做這件事。</p>
<h3 id="markdown-or-rst">Markdown or rst?</h3>
<p>日常的編輯應該還是以 markdown 為主,看看精美的 <a href="http://macdown.uranusjr.com/">Macdown</a> 編輯器如此好用。但如果是很複雜的檔案(分析有公式有圖表什麼的)可能就會考慮 rst;rst 缺點就是語法有點複雜,而且很多語法仰賴句中空白,使得不適用中文,然後我的 vim linter 會一直抱怨它有很多沒看過的 directives。</p>
<p>不過很高興 Pelican 把兩者整合的很好,兩個都能用就能視情況轉換,但 template 也不用寫兩份。</p>
<h3 id="to-do">To do</h3>
<p>這之外還加上了 LaTeX MathJax、Smartypants 等小細節,不過整體來說 blog 客製化就完成了。也許未來用到什麼再加吧。</p>
<p>目前想到的一些問題:</p>
<ul>
<li>標題字重:本來是跟內文同字重,但感覺長文會抓不到段落,先改成粗體,希望短文不要因為這樣變得很混亂。</li>
<li>Jupyter notebook include:還沒有試直接嵌入 nb 的功能,我想應該也是調整 CSS 那類的工(前端好累好難啊…)</li>
</ul>
<p><a href="https://github.com/getpelican/pelican-plugins">Pelican plugin</a> 裡面包含了很多樣的套件,我猜很多遇到的問題,前人都解掉了吧?……吧 xdd</p>
<h3 id="edit-2015-09-22">EDIT (2015-09-22)</h3>
<p>看來看去,又調整了很多東西。</p>
<p>首先,字體大小調小成 16px 又調回 18px。會選擇 16px 是因為我發現在 13" 筆電上閱讀會變得很擠。調回來是因為在大螢幕上看真的太小了,自己都需要放大來看。而且發現本來 13" 上很擠的問題並不是字體,而是一行文字的字數。</p>
<p>一行文字太多會影響到閱讀的效率。PTT 一行最多 39 個中文字,但應該很少文章是打滿的,大約都打個五到八成寬,也就是在 20-32 個中文字。英文的話大約在 12-15 個字。我自己調了很多版本也差不多是這個數字。</p>
<p>所以理想的文章寬度要滿足中、英文的字數。中文字寬度是固定的,所以在決定一行有多少個中文字之後,就要想辦法調整英文字體讓一行英文字數剛好。原本使用的 Source Pro Sans 稍微窄了一點,會讓純英文的頁面看起來有點擠,字重 400 的時候就好多了,但中文就變得不適合內文。最後換成 Lato,也是很普及的字體,不過其實沒寬多少。如果還是覺得很擠就只好換成 Open Sans 了,但我覺得它就有點鬆散。</p>
<p>最後內文寬 738px (41em) 或 828px (46em),實際一行最大為 612px (34em)。一行最多 34 個中文字、大約 14 個英文字(80 個字元);程式碼一行最多只能放 74 個字元,短了一點點但還可以接受。</p>
<p>意外的小發現,在內文變窄之後,還可以加上右側的 sidenote,像是 <a href="http://www.daveliepmann.com/tufte-css/">Tufte CSS</a> 這樣,有時會比 footnote 好用,但可能又變回內容太擠的狀態。</p>
<p>最後是在段落前後距離調整,把標題接內文的間距變小了,但段落間的間距調大。學到了一些以前不會的 CSS 語法,像</p>
<div class="highlight"><pre><span></span><code><span class="nt">p</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nt">p</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">margin-top</span><span class="p">:</span><span class="w"> </span><span class="mf">1.5</span><span class="kt">em</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div>
<p>代表選取相鄰的 p 元素,這樣可以避免直接改 p 的 margin 讓 p 與 h<em>、ul、pre 等間距太寬的狀況。</em><em>前端真的太神妙了。</em>*</p>
<h3 id="edit-2015-09-23">EDIT (2015-09-23)</h3>
<p>另外 smartypants 有時候有點煩,像是表達 13 吋時</p>
<p class="center"><span style="font-size: 4em; line-height: 1em;">13" vs 13"</span></p>
<p>不把 <code>"</code>(QUOTATION MARK <code>\u0022</code>)直接寫成 <code>&#34;</code> 就會被轉換成左邊那樣 <code>”</code>(RIGHT DOUBLE QUOTATION MARK <code>\u201D</code>)。</p>
<p>也把表格的格式加上,仿造 bootstrap 表格 overflow 時會變成 block 可以滑動著看。</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>以前部落格的長相:
<figure>
<img src="https://blog.liang2.tw/posts/2015/09/blog-internals/pics/oldsite.png"/>
</figure> <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>部落格第一篇2015-09-20T00:00:00-05:002023-08-13T23:05:21-05:00Liang-Bo Wangtag:blog.liang2.tw,2015-09-20:/posts/2015/09/blog-first-post/<p>所以呢,這就是我新的部落格。</p>
<p>我一直都有記錄自己的習慣。國中有週記;高中有週記;大一的時候有 Facebook,大二之後則有個版。
記 …</p><p>所以呢,這就是我新的部落格。</p>
<p>我一直都有記錄自己的習慣。國中有週記;高中有週記;大一的時候有 Facebook,大二之後則有個版。
記錄自己其實只要純文字就夠了。所以,這個部落格暫時也不會拿來做這件事。</p>
<p>不過純文字有時候還是蠻麻煩的,像程式的筆記,或一些圖多的文章(例如:宅物),
用純文字記就很麻煩,可讀性非常差,再來自己編輯、排版上也很困難,多個面向加在一起就成為很大的問題。</p>
<p>於是有了建自己部落格的想法。</p>
<h3 id="_1">會有什麼內容?</h3>
<p>目前至少有這些東西:</p>
<ul>
<li>程式開發心得</li>
<li>研究的筆記</li>
<li>社群(Ex PyCon)的記錄</li>
<li>宅物,主要是動畫</li>
</ul>
<p>另外可能會有一些的英文文章,因為我也要申請學校。這應該可以用 Tag 或者 Category 來篩選。不過主要應該是個技術性的部落格。</p>
<h3 id="_2">部落格之從前種種譬如黑歷史</h3>
<p>我目前記憶中,第一個部落格是國中用的 MSN/Hotmail。但在我無法登入 hotmail 信箱帳號被凍結的同時,它應該也消失在這個世界上了。</p>
<p>第二個部落格應該是剛上高中的<a href="http://blog.yam.com/myway">醉客樓</a>,在蕃薯藤上,當年 Google 還沒獨占中文搜尋市場。意外地它竟然沒有倒掉,而且也沒什麼人取名為「醉客樓」,搜尋還可以在第一頁,簡直是……可不可以寫信給 yam 把它關掉啊超害羞的。帳號是 <code>myway</code> 也蠻猛的,現在應該搶不到這麼潮的 ID 了。暱稱是<strong>問路客阿博</strong>,這實在是……好吧到現在還是蠻喜歡的。</p>
<p>最近發文時間是 2007 年 2 月,也就是我中二病全開的時候(好啦其實是高一)。文章是自己寫的短篇小說,雖然只有第一章:</p>
<blockquote>
<p>好漂亮的花啊。</p>
<p>我蹲坐在花圃前,輕輕的用手撫摸著一朵朵盛開的美。小小的園地,竟是如此繽紛。一排排整齊的花,在自己小小地盤綻放最耀眼的色彩。當我凝視著這爭奇鬥豔的景色,不知不覺地便為目眩神迷,一大片的花海在腦海中漫延開來,我努力想像自己在其中奔跑的樣子:用手快速地拂過一旁不停從視線中消失的花朵,我在小徑中旋轉著身體,在腦海中我可輕盈多了!我想像自己便倒臥在那花海中,張開雙手仔細品嘗著花辮溫柔的觸感,還有那淡淡的花香,抬頭望著藍天,我高興的笑著…… </p>
<p>(如果真的想看的話全文在<a href="http://blog.yam.com/myway/article/8033752">這兒</a>)</p>
</blockquote>
<p>這篇小說最後有完稿而且有投出去,在士林台電 K 書中心寫完的。不過當年是紙筆寫在稿紙上,後來就懶得打字了。這小說是在看完地獄少女第一季的時候寫的,嚴重受到它的世界觀跟敘事方式影響。</p>
<p>咦說好的技術性部落格呢?</p>
<p>Let the journey begin.</p>