Introducing ‘Covalently Linked Components� and enriched ligand information at PDBe

Title image introducing covalently linked components with complex, multi-component ligand image displayed in the background.

We are excited to announce the introduction of Covalently Linked Components (CLCs), a new class of reference small molecules that facilitates identification of covalently linked multi-component ligands across the whole PDB archive. This enhancement vastly improves the accessibility and usability of small molecule data for users. Alongside CLCs, we have further enriched the existing Peptide-like molecules Reference Dictionary (PRD) files by providing additional metadata, such as chemical properties and mapping between these molecules and other chemical data resources.

 

Addressing challenges of multi-component PDB ligands

The has long been the go-to resource for detailed information on small molecules found within PDB structures. Traditionally, the annotation process involved fragmenting ligands with multiple components into individual Chemical Components (CCDs) due to structure determination practices and annotation policies. Unfortunately, this fragmentation lead to difficulties in accurately identifying complete ligands, understanding their interactions with macromolecules, and mapping them to other small-molecule databases. To overcome this hurdle, the wwPDB introduced the in 2013, offering a partial solution for specific cases of peptide-like inhibitors and antibiotic ligands. However, this approach is limited to a case-by-case basis, leaving a considerable number of fragmented multi-component ligands unresolved. 

To address this fragmentation challenge holistically, we developed a systematic process to identify Covalently Linked Components (CLCs) for all ligands consisting of multiple covalently linked Chemical Components (CCDs). CLCs provide a more accurate and comprehensive representation of these multi-component ligands, filling the gaps left by fragmented CCDs and PRDs.

By assigning unique identifiers based on InChIKey, we have streamlined the identification and analysis of multi-component ligands in the PDB. Using the identifiers for the complete CLCs, these molecules can now be mapped to other chemical databases like PubChem, ChEMBL and KEGG, which were previously missed because they were incomplete or fragmented.

 

Finding important CLCs in the PDB

There are a number of CLCs for which this change will greatly improve the accessibility of the relevant structural data in the PDB. For example, Myristoyl-CoA is a fatty-acid modified variant of Coenzyme A (CoA), which functions as a Human metabolite. Previously, you could only find the separate components of Myristoyl-CoA in the PDB as it is defined as separate, linked CCDs: Myristic acid (MYR) and CoA (COA). Now, this molecule is defined as a complete CLC (CLC_002763) with subcomponents of MYR and COA, allowing users to find the complete molecule when searching.

Some other notable examples of combined components in CLCs include:

  • Chromomycin A3: an antibiotic and chromosome dye, previously separated into ARI, 1GL, CPH, CDR and ERI CCD components but now combined into CLC_000153.
  • Peplomycin: an antineoplastic agent, previously separated into PMY, GUP and 3FM CCD components but now combined into CLC_000034.
  • 4-O-É‘-D-glucopyranosylmoranoline: a Human metabolite, previously separated into NOJ and GLG CCD components but now combined into CLC_001242.

 

Image showing examples of CLCs displayed as chemical diagrams and with the individual subcomponents displayed in unique colours.
Image showing two examples of CLCs (left: Peplomycin, CLC_000034; right: Chromomycin A3, CLC_000153) with the individual subcomponents displayed in unique colours.

 

Accessing CLC and updated PRD data

All CLC files are generated using the PDBeChem pipeline  process and are available through the PDBe FTP area for small molecules at . Furthermore, this process also includes a process to enrich the existing CCD, PRD or CLC files to provide additional metadata and cross-references to other external resources. For each CLC and PRD molecule, we provide coordinates for 2D depiction, and 3D model coordinates in .sdf, .cml, and .pdb data formats. The reference CIF files for individual CLC and PRD molecules will also be enriched by including the following additional annotations: 

  • Mapping to other popular resources like ChEMBL, PubChem and KEGG
  • Regenerated idealized conformers using RDKit
  • Physicochemical properties
  • Murcko scaffolds
  • Substructures/fragments
  • Synonyms
  • Ligand classification information
  • Known polymer targets based on DrugBank mapping.

Exact details for the files with above information can be found in the README file available at  

The PDBeChem pipeline uses PDBe CCDUtils as its primary package. PDBe CCDUtils is an RDKit-based toolkit for handling and analysing small molecules in the PDB. For more information and access to PDBe CCDUtils, please refer to the following publication: . 

 

Important Update to PDBeChem FTP directory structure

As part of the update, we are also making changes to the PDBeChem FTP directory structure, aimed at enhancing user experience and making small molecule data in the Protein Data Bank in Europe (PDBe) more accessible. As part of this update, we have introduced separate folders for three types of reference files: Chemical Components (CCD), Covalently Linked Components (CLC), and Peptide-like molecules Reference Dictionary (PRD), making it easier for users to access specific data.

All CCD data previously presented at the top-level folder (), is also duplicated in the corresponding ccd subfolder (). For example, data for the CCD identifier ATP, previously available at , can now also be accessed via the new URL: . To simplify the download process, we have provided a gzipped tar archive with all the files in the CCD folder, accessible at .

 

Schematic of the updated FTP folder structure for ligand data at PDBe.

 

Furthermore, starting from January 1, 2024, the pdbechem.tar.gz file in the top-level directory will also include PRD and CLC folders alongside CCD. We will retire the CCD-related contents in the top-level directory on the same date, retaining only CCD, PRD, CLC folders, pdbechem.tar.gz, and the README.txt file.

We believe that this restructuring will greatly improve user navigation and access to the specific data they require. Rest assured, all the existing data will be preserved and transferred to the relevant subfolders, ensuring a seamless transition for our users. If you have any questions or need assistance, please feel free to reach out to us at [email protected].