Read UniProtKB in XML format

UniProt Knowledge Base (UniProtKB) provides various methods to access their data. I settled on their XML format since no additional parsing code is required and the format is well defined, which comes with a schema. Plus, it turns out that databases such as PDB also provide their data export in XML format and the corresponding schema so the method can be applied elsewhere.

Here I will show how to read XML with its schema in Python using xmlschema.

Other ways to read UniProtKB data in bulk
XML and XML schema
Read UniProt XML by xmlschema
Summary

XML and XML schema

XML data are structured. For example, this is what entry P51587 looks like in XML format:

<?xml version="1.0" encoding="UTF-8"?>
<uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd">
<entry dataset="Swiss-Prot" created="1996-10-01" modified="2017-12-20" version="201">
<accession>P51587</accession>
<accession>O00183</accession>
<accession>O15008</accession>
<accession>Q13879</accession>
<accession>Q5TBJ7</accession>
<name>BRCA2_HUMAN</name>
<protein>
<recommendedName>
<fullName>Breast cancer type 2 susceptibility protein</fullName>
</recommendedName>
<alternativeName>
<fullName>Fanconi anemia group D1 protein</fullName>
</alternativeName>
</protein>
<!-- ...  -->
</entry>
</uniprot>

The file is available at https://www.uniprot.org/uniprot/P04637.xml. Basically, all the information about this entry should be available in this file, as long as one knows how to query the XML via XPath. However, I find XML file harder to read alone, especially without any guide of how the file was constructed.

UniProt XML is constructed based on its XML schema, available as an XSD file at http://www.uniprot.org/support/docs/uniprot.xsd. The schema not only helps understand the XML content, it also validates whether the XML is valid. In other words, since all UniProt XMLs are validated by its schema, one can expect to parse all their data the same as what the schema has defined. XML schema is also part of the W3C standard and wildly used.

Read UniProt XML by xmlschema

I use xmlschema to read XML with its schema in Python. Instead of using XPath, one can actually convert the XML content into a dictionary-like format, which can be easily passed to other Python functions.

Using same entry P51587 as an example,

>>> import xmlschema
>>> schema = xmlschema.XMLSchema('https://www.uniprot.org/docs/uniprot.xsd')
>>> entry_dict = schema.to_dict('./P51587.xml')
>>> entry_dict.keys()
dict_keys(['@xsi:schemaLocation', 'entry', 'copyright'])
>>> content = entry_dict['entry'][0]
>>> list(content)[:6]
['@dataset', '@created', '@modified', '@version', 'accession', 'name']

I don’t need any custom code to read the XML content structurally. For example, to get all the accession IDs of this entry,

>>> content['accession']
['P51587', 'O00183', 'O15008', 'Q13879', 'Q5TBJ7']

To get the protein names,

>>> content['protein']
{'alternativeName': [{'fullName': 'Fanconi anemia group D1 protein'}],
 'recommendedName': {'fullName': 'Breast cancer type 2 susceptibility protein'}}

One can compare the dictionary converted result with the original XML. I’d like to end the demo with a more complicated example that finds all the sequence variants:

>>> seq_variants = filter(
...     lambda d: d['@type'] == 'sequence variant' and 'variation' in d,
...     content['feature'])
>>> [(d['location']['position']['@position'], 
...   d['original'], d['variation'][0])
...   for d in seq_variants][:10]
[(25, 'G', 'R'),
 (31, 'W', 'C'),
 (31, 'W', 'R'),
 (32, 'F', 'L'),
 (42, 'Y', 'C'),
 (53, 'K', 'R'),
 (60, 'N', 'S'),
 (64, 'T', 'I'),
 (75, 'A', 'P'),
 (81, 'F', 'L')]

Summary

Using UniProt’s XML and its schema can read all the data in a structured fashion without a custom parser. Once downloading the XML files of interest, one could basically query everything locally, which is very helpful to retrieve substantial information from UniProt, say, extracting all the citations for certain protein feature.

XML schema really helps users to understand the data structure and it also helps the database developers validate their data export. I hope someday all the databases should have this validation enforced.

However, one may find XML format tedious and not human-friendly to read. JSON has been popular and used heavily by RESTful APIs. The specification of JSON schema exists, but it is not a W3C standard yet.

SPARQL and RDF, part of the attempt for the Semantic web can be a universal query interface solving the same problem more elegantly, though the entry level is a bit high with limited learning resources available.

For now, reading bulk data in XML with its schema seems to be the mature way to go with abundant support.

Liang-Bo Wang's Blog

About | Talks | Archives |

Read UniProtKB in XML format

Other ways to read UniProtKB data in bulk

XML and XML schema

Read UniProt XML by xmlschema

Summary