FASTA/Q sequence processing toolkit -- seqtk
This is the first post of the series of my common NGS processing workflows and notes.
Some of the most common operation in sequence processing is FASTQ → FASTA conversion. Tons of conversion scripts using either sed or awk can be found by search. For example,
# FASTQ to FASTA
# Assume every read record takes exactly 4 line
# Ref: http://stackoverflow.com/a/10359425
$ sed -n '1~4s/^@/>/p;2~4p'
The assumption of 4 lines per read usually holds for recent NGS sequencing data, so not a big deal.
In many case the sequence is gzip’d. It is still a piece of cake when combining with pipe editing,
gzcat myseq.fq.gz | sed -n '1~4s/^@/>/p;2~4p' | gzip > myseq.fa.gz
However, things can get complex really fast when one wants to additionally do reverse complement, randomly sample a subset of reads, and many other types of sequence manipulation. Efficiency matters if those tasks are applied to tens of millions of reads. Even a few nanoseconds longer of computing time difference per read can make a difference at this scale of reads.
Seqtk
So seqtk comes into rescue. It is written in C and MIT licensed. A quick comparison shows it is generally faster than other UNIX-based solutions, let alone implementations based on scripting languages.
Seqtk bundles many other operations, but I’ll just mention those I frequently use.
$seqtk
Usage: seqtk <command> <arguments>
Version: 1.0-r77-dirty
Command: seq common transformation of FASTA/Q
comp get the nucleotide composition of FASTA/Q
sample subsample sequences
subseq extract subsequences from FASTA/Q
fqchk fastq QC (base/quality summary)
mergepe interleave two PE FASTA/Q files
trimfq trim FASTQ using the Phred algorithm
hety regional heterozygosity
mutfa point mutate FASTA at specified positions
mergefa merge two FASTA/Q files
dropse drop unpaired from interleaved PE FASTA/Q
rename rename sequence names
randbase choose a random base from hets
cutN cut sequence at long N
listhet extract the position of each het
FASTQ → FASTA
Read (gzip’d) FASTQ and write out as FASTA,
$ seqtk seq -A in.fq[.gz] > out.fa
To make the output gzip’d again, piped with gzip,
$ seqtk seq -A in.fq[.gz] | gzip > out.fa.gz
Reverse complement
If one wants to debug the R2 reads of pair-end sequencing (second read on forward strand), since they contain reverse complement sequence of the insert DNA, one needs to reverse complement R2 reads again to debug directly by bare human eyes.
$ seqtk seq -r R2.fq > R2_rc.fq
$ echo '> Example R2 seq
GCATTGGTGGTTCAGTGGTAGAATTCT' | seqtk seq -r
# > Example R2 seq
# AGAATTCTACCACTGAACCACCAATGC
Quality check
To be honest, FastQC is more frequently used for quality check because it generates reports with beautiful figures.
But for a detail report on each read position, one should consider seqtk fqchk
.
$ seqtk fqchk myseq.fq[.gz]
By default it sets -q 20
. This quality threshold determines the threshold of counting a base as low or high quality, shown as %low
and %high
per read position. In the default case, quality score higher than 20 will be treated as high quality bases.
min_len: 10; max_len: 174; avg_len: 28.92; 37 distinct quality values
POS #bases %A %C %G %T %N avgQ errQ %low %high
ALL 236344886 17.0 22.5 31.3 29.2 0.0 39.9 37.6 0.1 99.9
1 8172342 8.9 12.4 57.0 21.7 0.0 39.6 29.0 0.5 99.5
2 8172342 7.7 62.5 16.2 13.7 0.0 39.8 37.8 0.2 99.8
3 8172342 50.3 24.1 11.9 13.6 0.0 39.8 38.2 0.1 99.9
4 8172342 10.4 22.9 15.3 51.3 0.0 39.9 38.7 0.1 99.9
5 8172342 14.3 12.9 22.3 50.5 0.0 39.8 37.0 0.2 99.8
# ... (trimmed)
The following columns, avgQ
and errQ
, need more explanation. Average quality (avgQ
) is computed by weighted mean of each base’s quality,
where is the number of bases with quality score being . The magic number 93 comes from the quality score of Sanger sequencing1, whose score ranges from 0 to 93.
For errQ
we need more background knowledge about how quality score is computed. A base with quality score implies the probability of being erroneously called, , is
Therefore, given being , seqtk has a conversion table perr
from quality score to probability,
Q | 0 | 1 | 2 | 32 |
---|---|---|---|---|
P | 0.5 | 0.5 | 0.5 | 0.5 |
Q | 4 | 5 | … | 38 | 39 | 40 |
---|---|---|---|---|---|---|
P | 0.398107 | 0.316228 | … | 0.000158 | 0.000126 | 0.000100 |
Based on the probability, it computes the expected number of base call errors, num_err, and the empirical probability of having a base call error at this position, errP,
Thus the errQ
is the equivalent quality score of errP, which better interprets the probability of base call error than avgQ
,
By passing -q 0
to seqtk fqchk
, one can get the proportion of all distinct quality scores at each position. This information is pretty useful if the sequencing data is all a mess and one needs to figure out the cause.
Though some of the seqtk fqchk
’s behavior is not documented, it should be straight forward enough to understand. All in all, the details can always be found in the source code.
Summary
Seqtk is fast to use for daily routines of FASTA/Q conversion. On top of that it provide various functionalities such as read random sampling, quality check, and many I haven’t tried or mentioned.
-
See multiple specifications of quality score at sckit-bio doc. The score is Phred quality score. More other score representations can be found at FASTQ wiki. ↩
-
Note that the probability of q less than 4 is fixed with 0.5. A quick computation can see when , its actual Phred probability is . ↩