SIFT-ing through sequence space in the PDB

SIFT-ing through sequence space in the PDB

You’ve probably noticed that many of the protein structures in the PDB archive are cross-referenced to other data resources. These cross-references, for example to UniProt, Pfam, CATH and SCOP are displayed on our website, and on those of our wwPDB partners. So how are cross-references added and kept current?

 

This is done through the SIFTS resource (pdbe.org/sifts) run by the PDBe and teams at ÀÖÌìÌÃÓÎÏ·ÍøÕ¾. SIFTS maintains up-to-date residue-level mapping between UniProt and PDB entries and also furnishes users with annotations from IntEnz, GO, Pfam, InterPro, SCOP, CATH and PubMed.

 

Recently we rolled out an improvement to SIFTS so structures are not only mapped to a UniProt accession, but to the most appropriate isoform within that accession. If a structure is representative of a specific isoform (generated by alternative splicing of the mRNA), you’ll now see that clearly on our site. A good example is PDB entry 1loi, the N-terminal splice region of a cyclic AMP-specific phosphodiesterase. But it’s the N-terminal region of isoform 3 only. The sequence is completely different from the N-terminus of isoform 1 (termed the ‘canonical� sequence).

 

 

Diagram of isoforms for UniProt P54748

PDB entry 1loi contains the N-terminal region of UniProt accession . Specifically the sequence of isoform 3, which is not present in the canonical sequence.

 

 

The new SIFTS improvements also update the mapping between structures in the PDB and Pfam.  Now, structures are mapped to Pfam only if the whole Pfam domain is present in the structure.  So no more ‘falseâ€� Pfam mappings like PDB entry 5d9s which was previously listed as containing the Pfam ubiquitin domain even though it contains only an 11 residue peptide of that domain.

A further feature of the updated SIFTS is the mapping to available via the SIFTS API. UniRef90 clusters all sequences in UniProt that have 90% or greater sequence identity. This mapping shows for instance that while the PDB contains fewer than 3000 unique human proteins, there are a further 3300 structures from other organisms which are highly similar, at the sequence level, to human proteins for which structures are not available in the PDB.