Following system color scheme Selected dark color scheme Selected light color scheme

Liang-Bo Wang's Blog

About | Talks | Archives |

Coding ๅˆๅญธๆŒ‡ๅ—้™„้Œ„ - Bioinfo Practices using Python

Last Edited: Jan, 2016 ๏ผˆๅฆ‚ๆžœๅ…งๅฎนๆœ‰่ชค๏ผŒไฝ ๅฏไปฅ็•™่จ€๏ผŒๆˆ–็”จไปปไฝ•็ฎก้“ๅ‘Š่จดๆˆ‘๏ผ‰

We are going to walk through a series of practice created by Rosalind Team.

Once you register an account at Rosalind, you can use their judging system to work through all problems. However, in this case you cannot arbitrarily skip easy levels and it sucks. So I’m not going to force you using the system. Luckily, in each problem one set of example data and expected output is given, which can be used for checking our answer.

Note: Their code assumes Python 2 but everything I mention here is Python 3.

ๅ…ถไป– Coding ๅˆๅญธๆŒ‡ๅ—็ณปๅˆ—ๆ–‡็ซ ๏ผš

ๆˆ–่€…๏ผŒ็”จ labcoding ้€™ๅ€‹ tag ไนŸๅฏไปฅๆ‰พๅˆฐๆ‰€ๆœ‰็š„ๆ–‡็ซ ใ€‚

Python Basics

Do their Python Village problem sets. If any topic you don’t know, go read your Python reference.

Should be very trivial.

Bininfo First Try

Q DNA: Counting DNA Nucleotides

Link: http://rosalind.info/problems/dna/

  • Hint: use collections.Counter provided by Python’s stdlib
  • More Hint: use ' '.join and list comprehension to output the answer

Q REVC: The Secondary and Tertiary Structures of DNA

Link: http://rosalind.info/problems/revc/

  • Hint: reversed for any sequence object and a dict for nucleotide code mapping
  • More Hint: done in a list comprehension

Q: GC: Computing GC Content

Link: http://rosalind.info/problems/gc/

This is the first complicated problem that some abstraction should help you come up the solution. Try write some re-usable code blocks, for example, function calls and class definitions.

Don’t worry about the computation complexity

Workthrough

You should implement by yourself before looking my solution. Also I didn’t see their official solution so my solution can differ a lot from theirs.

Intuitively, we need to implement a FASTA file parser. FASTA contains a series of sequence reads with unique ID. From a object-oriented viewpoint, we create classes Read for reads and Fasta for fasta files.

Read is easy to design and understand,

class Read:
    def __init__(self, id, seq):
        self.id = id
        self.seq = seq

Since we need to compute their GC content, add a method for Read.

class Read:
    # ... skipped
    def gc(self):
        """Compute the GC content (in %) of the read."""
        # put the logic here (think of problem Q DNA)
        gc_percent = f(self.seq)
        return gc_percent

Then we have to implement the FASTA parser, which reads all read entries and converts them through Read. In real world we are dealing with myfasta.fa-like files, but here the input is string.

class Fasta:
    def __init__(self, raw_str):
        """Parse a FASTA formated string."""
        self.raw_str = raw_str
        # convert string into structured reads.
        self.reads = list(self.parse())

    def parse(self):
        """Parse the string and yield read in Read class."""
        # though we have no idea right now, the code structure
        # should be something like the following.
        raw_lines = self.raw_str.splitlines()
        for line in raw_lines:
            yield Read(...)

Here I use yield Read(...), which may be unfamiliar for Python beginners. It turns parse(self) function as a generator. Generator makes you focus on the incoming data. Once data is parsed and converted, the result is immediated thrown out by yield. We don’t care about how to collect all the results. In our case, we catch all the results into a list by list(...).

So how should we read FASTA file? A simple rule in this case is that every read consists by two continuous row. Also, the first row will always be the first read id.

All we need is read two lines at the same time. Here a Pythonic idiom is introduced. The following code read two non-overlapping lines,

for first_line, second_line in zip(*[iter(raw_lines)]*2):
    yield Read(id=first_line, seq=second_line)

By zip(*[iter(s)]*n) magic, we are very close to implement a full parser. You could find a lot of explanations for this magic.

Read id line percedes with a > sign, so we could use something like first_line[1:] or first_line[len('>'):] for explicity.

Then sorting the GC% of reads in a FASTA file is easy.

fasta = Fasta('...')
sorted_reads = sorted(fasta.reads, key=lambda r: r.gc())  # note 1
top_gc_read = sorted_reads[-1]  # note 2
print(
    '>{0:s}\n'
    '{1:.6f}'  # note 3, 4
    .format(top_gc_read.id, top_gc_read.gc())
)

The code above completes the following steps:

  1. sorted(list, key=key_func) sorts the list based on the return value of key_func applied to each element.
  2. or top_gc_read = sorted(..., reversed=True)[0]
  3. two string with no operands in between will be joint automatically. In this case it is exactly >{0:s}\n{1:.6f}. This is useful to tidy a super long string.
  4. '...'.format() fills the string with given values. See doc.

In real case FASTA can span across multiple lines, also likely the file we parse is broken. How could we modify this parser to handle these situations?

Q: (next?)

I’m super tired now so I’ll leave the rest for you. Try those problems within yellow correct ratio range.