A new process for aggregating and naming assemblies in the PDB

Image of hemoglobin complex highlighting the complex ID and components of the complex

PDBe has created a new process to aggregate and name assemblies in the PDB based on unique assembly composition. This process allows assemblies with the same composition to be aggregated together, with over 90,000 distinct compositions identified by mapping each component in an assembly to external databases. Over 90% of the assemblies in the PDB have now been named using external resources such as Complex Portal, Gene Ontology, and manual curation by PDBe annotators.

This new process allows users to find all assemblies containing specific combinations of macromolecules, using accessions from resources such as and . The process also identifies super-assemblies of a complex, helping researchers to understand how different protein complexes interact with each other. Furthermore, users can search for assemblies based on specific names, helping researchers to identify and analyse specific protein complexes that are relevant to their research.

 

Searching for complexes at PDBe

Integration of this data on complexes in the PDBe search has improved the accessibility of structure data for a given complex. Previously, this type of search relied on the PDB entry title or the individual components in the assembly having a common naming convention. This made it challenging to recover all the relevant information. The standardisation of assembly names and unique identifiers in each specific PDB complex now allows users to specify the PDBe complex identifier or the complex name under the advanced search option to find all the relevant information for a complex of interest.

You can use the advanced search form at PDBe to query the whole PDB archive for specific complexes. To access the advanced search form, click on the link underneath the search bar on any PDBe and PDBe-KB page. You can narrow down search fields by using the autocomplete and typing ‘complex�. There are two search fields related to complexes: ‘PDBe complex ID� allows searching by the specific complex ID, while ‘complex name� uses the name generated for the complex. Searching using these fields returns only those PDB entries which contain those specific complexes. You can also view complex information on the search results for any other PDBe search. Each entry in the search results displays the complex name and PDBe complex ID, which can in turn be used to quickly search for other entries containing that complex.

 

Hemoglobin search result for PDB ID 4N7P showing PDB complex name and identifier in the search results along with link to search for that complex
Search result for hemoglobin in . The result card displays the complex name and PDBe complex ID with link for each to search for that complex across the whole PDB archive.

The improvement in finding complexes in the PDB can be demonstrated through the comparatively simple example of hemoglobin. Though this protein is so ubiquitous in the PDB, finding the specific complexes of hemoglobin can be difficult due to the presence of different subunits within the overall complex, with two copies each of alpha and beta hemoglobin in the mammalian complex. Furthermore, there are some specific variants of hemoglobin complexes that can get lost amongst the huge number of typical mammalian hemoglobin complexes.

For example, PDBe search results for hemoglobin yield multiple variants of hemoglobin such as mutant adult human hemoglobin (PDB-CPX-159518), hemoglobin from parasitic flatworm Fasciola hepatica (PDB-CPX-163279) and foetal hemoglobin (PDB-CPX-159679). By clicking on the individual complex name, users can discover the same assembly across different species. Alternatively, clicking on the PDBe complex identifier would enable users to easily find all PDB entries containing a given assembly with identical composition and species.

 

Accessing this data via FTP

The process works by mapping each component in an assembly to the respective UniProt or Rfam accession whenever possible. Where this is not possible, each component is labelled as follows: <component_type>_<entry_id>_<entity_id>_<stoichiometry>.

PDBe provides two different files on the ftp area. The first file is the mapping file, which contains information about each assembly in the PDB such as the complex identifier, composition, Complex Portal entry (if any), assembly name, and assembly name source (how the name was generated). The second file contains subassemblies information, highlighting where other assemblies exist containing only a subset of the entities from the complete assembly.

The files can be downloaded from the following ftp area: