FASTA/Q sequence processing toolkit -- seqtk
This is the first post of the series of my common NGS processing workflows and notes.
Some of the most common operation in sequence processing is FASTQ → FASTA conversion. Tons of conversion scripts using either sed or awk can be found by search. For example,
# FASTQ to FASTA # Assume every read record takes exactly 4 line # Ref: http://stackoverflow.com/a/10359425 $ sed -n '1~4s/^@/>/p;2~4p'
The assumption of 4 lines per read usually holds for recent NGS sequencing data, so not a big deal.
In many case the sequence is gzip’d. It is still a piece of cake when combining with pipe editing,
gzcat myseq.fq.gz | sed -n '1~4s/^@/>/p;2~4p' | gzip > myseq.fa.gz
However, things can get complex really fast when one wants to additionally do reverse complement, randomly sample a subset of reads, and many other types of sequence manipulation. Efficiency matters if those tasks are applied to tens of millions of reads. Even a few nanoseconds longer of computing time difference per read can make a difference at this scale of reads.
So seqtk comes into rescue. It is written in C and MIT licensed. A quick comparison shows it is generally faster than other UNIX-based solutions, let alone implementations based on scripting languages.
Seqtk bundles many other operations, but I’ll just mention those I frequently use.
$seqtk Usage: seqtk <command> <arguments> Version: 1.0-r77-dirty Command: seq common transformation of FASTA/Q comp get the nucleotide composition of FASTA/Q sample subsample sequences subseq extract subsequences from FASTA/Q fqchk fastq QC (base/quality summary) mergepe interleave two PE FASTA/Q files trimfq trim FASTQ using the Phred algorithm hety regional heterozygosity mutfa point mutate FASTA at specified positions mergefa merge two FASTA/Q files dropse drop unpaired from interleaved PE FASTA/Q rename rename sequence names randbase choose a random base from hets cutN cut sequence at long N listhet extract the position of each het
FASTQ → FASTA
Read (gzip’d) FASTQ and write out as FASTA,
$ seqtk seq -A in.fq[.gz] > out.fa
To make the output gzip’d again, piped with gzip,
$ seqtk seq -A in.fq[.gz] | gzip > out.fa.gz
If one wants to debug the R2 reads of pair-end sequencing (second read on forward strand), since they contain reverse complement sequence of the insert DNA, one needs to reverse complement R2 reads again to debug directly by bare human eyes.
$ seqtk seq -r R2.fq > R2_rc.fq $ echo '> Example R2 seq GCATTGGTGGTTCAGTGGTAGAATTCT' | seqtk seq -r # > Example R2 seq # AGAATTCTACCACTGAACCACCAATGC
To be honest, FastQC is more frequently used for quality check because it generates reports with beautiful figures.
But for a detail report on each read position, one should consider
$ seqtk fqchk myseq.fq[.gz]
By default it sets
-q 20. This quality threshold determines the threshold of counting a base as low or high quality, shown as
%high per read position. In the default case, quality score higher than 20 will be treated as high quality bases.
min_len: 10; max_len: 174; avg_len: 28.92; 37 distinct quality values POS #bases %A %C %G %T %N avgQ errQ %low %high ALL 236344886 17.0 22.5 31.3 29.2 0.0 39.9 37.6 0.1 99.9 1 8172342 8.9 12.4 57.0 21.7 0.0 39.6 29.0 0.5 99.5 2 8172342 7.7 62.5 16.2 13.7 0.0 39.8 37.8 0.2 99.8 3 8172342 50.3 24.1 11.9 13.6 0.0 39.8 38.2 0.1 99.9 4 8172342 10.4 22.9 15.3 51.3 0.0 39.9 38.7 0.1 99.9 5 8172342 14.3 12.9 22.3 50.5 0.0 39.8 37.0 0.2 99.8 # ... (trimmed)
The following columns,
errQ, need more explanation. Average quality (
avgQ) is computed by weighted mean of each base’s quality,
where is the number of bases with quality score being . The magic number 93 comes from the quality score of Sanger sequencing1, whose score ranges from 0 to 93.
errQ we need more background knowledge about how quality score is computed. A base with quality score implies the probability of being erroneously called, , is
Therefore, given being , seqtk has a conversion table
perr from quality score to probability,
Based on the probability, it computes the expected number of base call errors, num_err, and the empirical probability of having a base call error at this position, errP,
errQ is the equivalent quality score of errP, which better interprets the probability of base call error than
-q 0 to
seqtk fqchk, one can get the proportion of all distinct quality scores at each position. This information is pretty useful if the sequencing data is all a mess and one needs to figure out the cause.
Though some of the
seqtk fqchk’s behavior is not documented, it should be straight forward enough to understand. All in all, the details can always be found in the source code.
Seqtk is fast to use for daily routines of FASTA/Q conversion. On top of that it provide various functionalities such as read random sampling, quality check, and many I haven’t tried or mentioned.