Your domain: Structural Bioinformatics

Introduction

Structural bioinformatics provides scientific methods to analyse, predict, and validate the three-dimensional structure of biological macromolecules such as proteins, RNA, DNA, or carbohydrates including small molecules bound to them. It also provides an important link with the genomics and structural biology communities. One objective of structural bioinformatics is the creation of new methods of analysis and manipulation of biological macromolecular data in order to predict their structures, function and interactions. Currently, atomic structures of macromolecules can be obtained both based on experimental and computational methods. This document describes guidelines to deposit computationally or experimentally solved structure models together with relevant metadata according to FAIR principles. While we describe guidelines for the deposition process, predictors are usually required to collect the relevant metadata already while doing the predictions so that the data is available during deposition.

Description

Researchers in the field should be able to find predictions of macromolecular structures, access their coordinates, understand how and why they were produced, and have estimates of model quality to assess the applicability of the model for specific applications. The considerations and solutions described below are written from the perspective of protein structure predictions but they also apply to other types of macromolecular structures.

Considerations

Is your prediction based on experimental data (i.e. integrative or hybrid modelling) or purely in silico?
This is important to define the appropriate deposition system.
What is the purpose of the structure prediction? Is it a large-scale modelling effort using automated prediction methods to (for instance) generally increase structural coverage of known proteins or a single modelling effort performed, possibly with manual intervention, for a specific application?
This is important to define the appropriate deposition system.
What is the source for the sequences of the modelled proteins?
This is important to cross-link with existing databases such as UniProtKB.
What modelling steps were performed?
Descriptions here can vary widely among modelling methods but should be detailed enough to enable reproducibility and include references to methods described in manuscripts and publicly available software or web services.
What input data were used for the modelling steps?
For protein structure predictions, this commonly includes the identification of homologous proteins from sequence databases with or without coverage by experimental structures. Knowing the input data greatly facilitates further analysis and reproducibility of the structure prediction.
What is the expected accuracy of the structure prediction?
This is commonly referred to as “model quality” or “model confidence” and is of major relevance to determine whether a given model can be used for downstream analysis. Quality estimates should enable users to judge the expected accuracy of the prediction both globally and locally.
Under which licence terms can others use your models?
Depending on the deposition system, there will be predefined and commonly permissive terms of use, but if this is to be restricted or if models are made available in a self-hosted system, an appropriate usage policy must be defined.

Solutions

There are three main options to make your models available:
- Deposit in ModelArchive for theoretical models of macromolecular structures. Models deposited in the ModelArchive are made available under the CC BY-SA 4.0 licence (see here for details).
- Deposit in PDB-Dev for models using integrative or hybrid modelling. Models deposited in PDB-Dev are made available under the CC0 1.0 licence (see here for details). If theoretical models were used as part of the modelling, they can either be included in the PDB-Dev deposition or, if they are expected to be useful by themselves, deposited in ModelArchive and referenced to.
- Make available using a dedicated web service for large-scale modelling efforts which are updated on a regular basis using automated prediction methods. The solution for rapidly building such a service is to deploy the MineProt application, which is able to curate data from most AlphaFold-like systems (see here for details). Unified access to these services can be provided with the 3D-Beacons which is being developed by the ELIXIR 3D-BioInfo Community. The data providers currently connected in the network are listed in the 3D-Beacons documentation. An appropriate licence must be associated with the models (check the licensing page for guidance on this) and must be compatible with CC-BY 4.0 if the models are to be distributed in the 3D-Beacons network.
Model coordinates are preferably stored in the standard PDB archive format PDBx/mmCIF format and tools. While, for many purposes, the legacy PDB format may suffice to store model coordinates and is still widely used, the format is no longer being modified or extended.
Model quality estimates can be computed globally, per-residue, and per-residue-pair. The estimates should be computed using a relatively recent and well benchmarked tool or by the structure prediction method itself. Please check CAMEO, CASP, and CAPRI to find suitable quality estimators. The 3D-BioInfo Community is also currently working to further improve benchmarking for protein complexes, protein-ligand interactions, and nucleic acid structures. By convention, the main per-residue quality estimates are stored in place of B-factors in model coordinate files. In mmCIF files any number of quality estimates can be properly described and stored in the ma_qa_metric category of the PDBx/mmCIF ModelArchive Extension Dictionary described below.
Metadata for theoretical models of macromolecular structures should preferably be stored using the PDBx/mmCIF ModelCIF Extension Dictionary independently of the deposition process. The extension is being developed by the ModelCIF working group with input from the community. Feedback and change requests are welcome and can be given on github. The same information can also be provided manually during the deposition in ModelArchive and there is additional documentation on how to provide metadata and minimal requirements for it. Generally, the metadata must include:
- a short description of the study for which the model was generated;
- if available, a citation to the manuscript referring to the models;
- the source for the sequences of modelled proteins with references to databases such as UniProt;
- modelling steps with references to available software or web services used and to manuscripts describing the method;
- input data needed for the modelling steps. For instance in homology modelling this could include the PDB identifiers for the template structures used for modelling and their alignments to the target protein;
- model quality estimates.
If necessary, accompanying data can be provided in separate files using different file formats. The files can be added to ModelArchive depositions and referred to in the PDBx/mmCIF ModelArchive extension format.

Description

Experimentally solved atomic structures of molecules can be obtained by several methods, such as X-ray crystallography, Nuclear Magnetic Resonance (NMR) spectroscopy, and 3D Electron Microscopy. Here you can find useful tools and guides for storing and sharing structure models based on these methods.

Structure models resulting from experimental methods are broadly available in the PDB under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.
Additionally, raw and intermediate data associated with the structure model can be also published in different curated data archives depending on the methods used. Raw EM data and processed 3D volumes and tomograms can be stored in EMPIAR and Electron Microscopy Data Bank (EMDB), respectively. Raw data from NMR studies can be stored in BMRB. For X-ray diffraction experiments raw data can be stored in IRRMC and SBGrid Data Bank. Data submitted to these repositories can be cross-referenced to related PDB entries.
To extend molecular context, structure models can be visualised and analysed along with respective volume maps and UniProt sequences using data aggregators such as 3DBioNotes. This can be done independently of whether they have been published in PDB and EMDB. A collection of biochemical, biomedical and validation annotations would be mapped on the structure model coordinates and sequence to let the user better understand macromolecular function and interactions. Users can also use the COVID-19 Structural Hub, a dedicated summary view for all published SARS-CoV-2 structure models.
As well as for computationally solved structures, model coordinates and metadata for experimentally solved structures are also preferably stored using the standard PDB archive format PDBx/mmCIF format and tools and the PDBx/mmCIF ModelCIF Extension Dictionary, respectively.
Data model and metadata standards for submitting data underpinning macromolecular models depend on the experimental method used. EMDB map distribution format description has broadly followed CCP4 map format and MRC map format. Metadata is contained in a header file, an XML file that follows the XSD data model. EMPIAR data model schema consists of the main empiar.xsd XML schema file and additional requirements in empiar.sch in Schematron format (see here for more details). BMRB (meta)data distribution format is based on NMR-STAR, an extension of the Self-defining Text Archive and Retrieval (STAR) file format.
As image processing framework, users can operate with workflow managers FAIR compliant such as Scipion to obtain macromolecular models using Electron Microscopy (3DEM). It integrates several software packages while taking care of formats and conversions and data submissions to public repositories such as EMDB and EMPIAR. It is also possible to deploy a cloud-compatible version of Scipion either in a single server or in a cluster with ScipionCloud. It can be also deployed in EOSC cloud infrastructures (see here for details).
Access to Research Infrastructure Administration (ARIA) is a platform that projects and infrastructures can use to manage access, from proposal submission to reporting. It provides tools for facilities within a distributed infrastructure to manage their equipment. ARIA will soon allow linking of output data and metadata with proposals, publications and other outputs.

More information

Training

Training in TeSS

Tools and resources on this page

Tool or resource	Description	Related pages	Registry
3D-Beacons	Network providing unified programmatic access to experimentally determined and predicted structure models		Tool info
3DBioNotes	3DBIONOTES-WS is a web application designed to automatically annotate biochemical and biomedical information onto structural models.	Machine actionability	Tool info Training
Access to Research Infrastructure Administration (ARIA)	Access and data management platform that allows facilities to manage outputs and associate them with proposal information and other research outputs.		Tool info
BMRB	Biological Magnetic Resonance Data Bank	Intrinsically disorder...	Tool info
CAMEO	Continuous evaluation of the accuracy and reliability of protein structure prediction methods in a fully automated manner		Tool info Standards/Databases
CAPRI	Critical assessment of structure prediction methods for protein-protein interactions
CASP	Biennial critical assessment of techniques for protein structure prediction		Tool info Training
COVID-19 Structural Hub	Structural information organiser, collecting structures deposited in public repositories together with computationally predicted models from SARS-CoV-2, as well as their interactions with host proteins.		Training
Electron Microscopy Data Bank (EMDB)	A public repository for electron cryo-microscopy maps and tomograms of macromolecular complexes and subcellular structures.		Tool info
EMPIAR	Electron Microscopy Public Image Archive is a public resource for raw, 2D electron microscopy images. You can browse, upload and download the raw images used to build a 3D structure	OMERO Bioimaging data Data publication	Tool info Standards/Databases Training
IRRMC	Integrated Resource for Reproducibility in Macromolecular Crystallography. Repository of diffraction experiments used to determine protein structures in the PDB, contributed by the CSGID, SSGCID, JCSG, MCSG, SGC, and other large-scale projects, as well as individual research laboratories.
MineProt	A stand-alone server for structural proteome curation		Tool info
ModelArchive	Repository for theoretical models of macromolecular structures with DOIs for models	Biomolecular simulatio...	Tool info Standards/Databases
PDB	The Protein Data Bank (PDB)	Galaxy Intrinsically disorder... Data publication	Tool info Training
PDB-Dev	Prototype archiving system for structural models obtained using integrative or hybrid modeling	Biomolecular simulatio...
PDBx/mmCIF format and tools	Information about the standard PDB archive format PDBx/mmCIF, its dictionaries and related software tools		Standards/Databases
PDBx/mmCIF ModelCIF Extension Dictionary	Extension of the PDBx/mmCIF dictionary for theoretical models of macromolecular structures
SBGrid Data Bank	Repository of X-ray diffraction, MicroED, LLSM datasets, as well as structural models.
Scipion	Cryo em image processing framework. Integration, traceability and analysis.		Tool info Standards/Databases
ScipionCloud	Cloud virtual machine with the Scipion software to process EM imaging data either in a single server or in a cluster.
UniProt	Comprehensive resource for protein sequence and annotation data	Galaxy Intrinsically disorder... Proteomics Single-cell sequencing Machine actionability	Tool info Standards/Databases Training