Improved access to ligand data and annotations in the PDBe FTP

Image with news article title as text along with the word release in a white circle with an protein structure image in the background

PDBe has overhauled its ligand FTP data structure, making data on all types of ligands in the PDB more easily accessible. Furthermore, users can also get streamlined access to all protein-ligand interaction data, combined with relevant functional annotations for ligands in the PDB archive, with new ligand data files.

 

Structuring Ligand Reference Class Data 

PDBe recently introduced Covalently Linked Components (CLCs), a new class of reference small molecules that facilitates identification of covalently linked multi-component ligands across the PDB archive. This new reference definition expands upon, and fills the gaps between, the existing wwPDB reference dictionaries of Chemical Component Definitions (CCDs) and Peptide-like molecules Reference Dictionary (PRD).

This enhancement vastly improves the interpretation of PDB small molecule data for users, however the organisation of this data in the PDBe FTP directory structure remained focused on CCD definitions. This has therefore necessitated a reorganisation of the FTP directory structure for ligand information at PDBe, while at the same time providing an opportunity to provide additional interactions data and annotations for all these molecules in the PDB archive.

 

Summarising Ligand Interactions Data:

Until now, the only method to access ligand interactions data in the PDBe was by a specific PDB ID, requiring querying of each PDB entry and hindering comprehensive analysis of ligand interactions across the full archive.

To address this challenge, PDBe has introduced two new data summary files:

 

Interacting_chains_with_ligand_functions.tsv

This comprehensive file provides a summary of all ligand interactions across the PDB archive. It includes details about interacting macromolecule chains, along with their mapped UniProt accessions (when available). Additionally, the file provides further information about each ligand,  such as its functional annotation(cofactor-like, drug-like or reactant-like) and essential identifiers  (InChIKey, bmID, LigandType).

PDBID   Chain_Symmetry  BestUnpAccession        bmID    inchikey        LigandID        LigandType    annotation
100d    A       None    bm1     PFNFFQXMRSDOHW-UHFFFAOYSA-N     SPM     CCD     None
101d    A       None    bm1     JLVVSXFLKOJNIY-UHFFFAOYSA-N     MG      CCD     ion
101d    A       None    bm2     IDBIFFKSXLYUOT-UHFFFAOYSA-N     NT      CCD     None
101d    B       None    bm1     JLVVSXFLKOJNIY-UHFFFAOYSA-N     MG      CCD     ion
101d    B       None    bm2     IDBIFFKSXLYUOT-UHFFFAOYSA-N     NT      CCD     None
101m    A       P02185  bm1     QAOWNCQODCNURD-UHFFFAOYSA-L     SO4     CCD     ion
101m    A       P02185  bm2     KABFMIBPWCXCRK-RGGAHWMASA-L     HEM     CCD     reactant-like
101m    A       P02185  bm3     FSBLVBBRXSCOKU-UHFFFAOYSA-N     NBN     CCD     None
102d    A       None    bm1     WTFXJFJYEJZMFO-UHFFFAOYSA-N     TNT     CCD     None
102d    B       None    bm1     WTFXJFJYEJZMFO-UHFFFAOYSA-N     TNT     CCD     None

 

pdb_bound_molecules.tsv

During refinement and annotation, complex ligands are often fragmented into individual chemical components (CCDs), posing challenges in identification and mapping to other databases. This file gives details on each complete ligand within PDB entries, composed of the covalently linked, non-polymeric entities. Each complete small-molecule in a PDB entry is assigned a unique identifier (bmID), and this file defines their composition as a list of the constituent components, revealing how these chemical components (CCDs) are connected within PDB structures.

PDBID   bmID    composition(list:ResName:ResNumber:Chain_Symmetry)      inchikey        LigandID      LigandType
100d    bm1     SPM:21:A        PFNFFQXMRSDOHW-UHFFFAOYSA-N     SPM     CCD
101d    bm1     MG:26:A JLVVSXFLKOJNIY-UHFFFAOYSA-N     MG      CCD
101d    bm2     NT:25:B IDBIFFKSXLYUOT-UHFFFAOYSA-N     NT      CCD
101m    bm1     SO4:157:A       QAOWNCQODCNURD-UHFFFAOYSA-L     SO4     CCD
101m    bm2     HEM:155:A       KABFMIBPWCXCRK-RGGAHWMASA-L     HEM     CCD
101m    bm3     NBN:156:A       FSBLVBBRXSCOKU-UHFFFAOYSA-N     NBN     CCD
102d    bm1     TNT:25:B        WTFXJFJYEJZMFO-UHFFFAOYSA-N     TNT     CCD
102l    bm1     BME:901:A,BME:902:A     DGVVWUTYPXICAM-UHFFFAOYSA-N     BME     CCD
102l    bm2     CL:173:A        VEXZGXHMUGYJMC-UHFFFAOYSA-M     CL      CCD
102l    bm3     CL:178:A        VEXZGXHMUGYJMC-UHFFFAOYSA-M     CL      CCD

 

These data files empower researchers to:

  • Perform large-scale analyses of ligand interactions across the entire PDB using complete ligand representations.
  • Quick access to all the ligands bound to a specific protein or identifying all the proteins binding to a specific ligand
  • Gain deeper understanding of protein function by relating ligand interactions to functional categories.
  • Easily access and navigate ligand data through unique identifiers and clear file organisation.

 

Streamlining Data Access:

Beyond these new ligand interaction files, PDBe has also simplified data access for ligands by restructuring the ligand FTP directory. Ligand data is now categorised into dedicated folders for CCDs, PRDs, CLCs, and additional interaction data files. This intuitive organisation allows researchers to easily locate specific ligand information and provides consistency of access for the different ligand definition types.

 

Image displaying the updated directory structure for ligands data in the PDBe FTP
Image displaying the updated directory structure for ligands data in the PDBe FTP (). Within the pdbechem_v2 directory, there are now separate directories for each of ‘ccd�, ‘prd� and ‘clc� reference definitions, plus a further directory for ‘additional_data�. For each reference definition, the directories are further subdivided by single characters representing the first character for CCD IDs, and the final character for PRD and CLC IDs.

 

The revamped ligand data structure will support structural bioinformatics research by providing a clear and consistent approach for access to these data. By unlocking comprehensive interaction insights and offering streamlined access, PDBe empowers researchers to delve deeper into protein function, paving the way for advancements in drug discovery and targeted therapies.

To access the data and explore the revamped ligand structure, visit the PDBe FTP area at .