Development build for ELIXIR-Belgium/rdmkit-sandbox@0fafbdf (branch: contribute-refactor)
Skip to content Skip to footer

Your domain: Biomolecular simulation data

Introduction

Biomolecular simulations are important technique for our understanding and design of biological molecules and their interactions. Simulation methods are demonstrating rapidly growing impact in areas as diverse as biocatalysis, drug delivery, biomaterials, biotechnology, and drug or protein design. Simulations offer the potential of uniquely detailed, atomic‐level insight into mechanisms, dynamics, and processes, as well as increasingly accurate predictions of molecular properties. Yet the field only relatively recently started to store and share (bio)simulation data to be reused for new, unexpected projects, and started discussions about their biomolecular simulation data FAIRification (i.e. to make them Findable, Accessible, Interoperable and Reusable). Here we show several current possibilities moving in this direction, but we should stress that these guidelines are not carved to stone and the biomolecular simulation community still needs to address challenges to FAIRify their data.

Storing and sharing the data from biomolecular simulations

Description

The biomolecular simulation data comes in several forms and multiple formats, which unfortunately are not completely interoperable. Different methods also require slightly different metadata description.

Considerations

  • What type of data do you have?
    • Molecular dynamics data - by far the most typical and largest biomolecular simulation data. Each molecular dynamics simulation is driven by the used engine, force-field, and multiple other and often hidden simulation parameters to produce trajectories that are further analysed.
    • Molecular docking data - docking provides the structures of the complex (e.g. ligand-protein, protein-protein, protein-nucleic acid, etc.) and its score/energy.
    • Virtual screening data - virtual screening is used for selection of active compounds from the pool of others and is usually in the form of ID and its score/energy.
    • Free energies and other analysis data - data calculable from the analysis of the simulations.
  • Where should you store this data?
    • Since there is no common community repository that would be able to gather the often spacious simulation data, the field did not systematically store them. Recently, there’s multiple possibilities where the data can be stored. The repositories can be divided in two main branches:
      • Generic: Repositories that can be used to store any kind of data.
      • Specific: Repositories designed to store specific data (e.g. MD data).
    • Are you looking for a long-term or short-term storage? Repositories have different options (and sometimes prices) for the storage time of your data.
    • Do you need a static reference for your data? A code (identifier) that can uniquely identify and refer to your data?
  • What data should you store?
  • What type of data should you store from the whole bunch of data generated in our project. Again, the type of data might vary depending on the biomolecular simulation field.
  • Consider what is essential (absolutely needed to reproduce the simulated experiment) versus what can be extracted from this data (analyses).

  • How do you want your data to be shared?
    • You should consider the terms in which other scientists can use your data for other projects, access, modify, or redistribute them.

Solutions

  • Deposit your data to a suitable repository for sharing. There’s a long (and incomplete) list of repositories available for data sharing. Repositories are divided into two main categories, general-purpose and discipline-specific, and both categories are utilised in the domain of biomolecular modeling and simulation. For a general introduction to repositories, you are advised to read the data publication page.
  • Based on the type of data to be shared, pay attention to what should be included and the data and metadata that will be deposited to repositories. Below listed are some suggested examples of types of essential and optional data describing the biomolecular simulation data:
    • Molecular Dynamics:
      • Essentials:
        • Metadata (Temperature, pressure, program, version, …)
        • Complete set of input files that were used in the simulations
        • Trajectory(ies)
        • Topology(ies)
      • Optionals:
        • Analysis data (Free energy, snapshots, clusterization)
    • Docking poses:
      • Essentials:
        • The complete set of molecules tested as well as the scoring functions used and the high-ranking, final poses (3D-structures)
        • Metadata (Identifiers (SMILES, InChI-Key), target (PDBID), energies/scores, program, version, box definition)
      • Optionals:
        • Complete ensemble of poses
    • Virtual Screening:
      • Essentials:
        • List of molecules sorted
        • Metadata (identifiers of ligands and decoy molecules, target, program+version, type of VS (QSAR, ML, Docking,…))
      • Optionals:
        • Details of the method, scores, …
    • Free energies and other analyses:
      • Essentials:
        • Metadata (model, method, program, version, force field(s), etc.)
        • Values (Free energy values, channels, etc.)
      • Optionals:
        • Link to Trajectory (Dynamic PDB?)
  • Associate a license with the data and/or source code e.g. models. Licenses mainly differ on openness vs restrictiveness, and it is crucial to understand the differences among licenses before sharing your research outputs. The RDMkit licensing page lists resources that can help you understand licensing and choose an appropriate license.
  • File formats Biomolecular simulation field has a tendency to produce a multitude of input/output formats, each of them mainly related to one software package. That makes interoperability and reproducibility really difficult. You can share your data but this data will only be useful if the scientist interested in it has access to the tool that has generated it. The field is working on possible standards (e.g. TNG trajectory).

  • Metadata standards There is no existing standard defining the type and format of the metadata needed to describe a particular project and its associated data. How to store the program, version, parameters used, input files, etc., is still an open question, which has been addressed in many ways and using many formats (json, xml, txt, etc.). Again, different initiatives exist trying to address this issue (see further references).

  • Data size Data generated in the biomolecular simulation field is growing at an alarming pace. Making this data available to the scientific community sometimes means transferring them to a long-term storage, and even this a priori straightforward process can be cumbersome because of the large data size.

Related pages

More information

Skip tool table
Tool or resource Description Related pages Registry
BigNASim Repository for Nucleic Acids MD simulations Tool info
BindingDB Public, web-accessible database of measured binding affinities Tool info Standards/Databases
Bioactive Conformational Ensemble Platform designed to efficiently generate bioactive conformers and speed up the drug discovery process. Tool info
BioExcel COVID-19 Platform designed to provide web-access to atomistic-MD trajectories for macromolecules involved in the COVID-19 disease.
Dryad Open-source, community-led data curation, publishing, and preservation platform for CC0 publicly available research data Bioimaging data Data publication Standards/Databases
Dynameomics Database of folding / unfolding pathway of representatives from all known protein folds by MD simulation
FigShare Data publishing platform Data publication Identifiers Documentation and meta... Standards/Databases Training
GPCRmd Repository of GPCR protein simulations Tool info
MemProtMD Database of over 5000 intrinsic membrane protein structures Tool info
Mendeley data Multidisciplinary, free-to-use open repository specialized for research data Data publication Existing data Standards/Databases
MoDEL Database of Protein Molecular Dynamics simulations representing different structural clusters of the PDB Tool info Training
MoDEL-CNS Repository for Central Nervous System-related mainly membrane protein MD simulations
ModelArchive Repository for theoretical models of macromolecular structures with DOIs for models Structural Bioinformatics Tool info Standards/Databases
MolMeDB Database about interactions of molecules with membranes Tool info Standards/Databases
MolSSI - BioExcel COVID-19 therapeutics hub Aggregating critical information to accelerate COVID-19 drug discovery for the molecular modeling and simulation community.
NMRlipids Repository for lipid MD simulations to validate force fields with NMR data
OpenScienceFramework free and open source project management tool that supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery Documentation and meta... Standards/Databases
PDB-Dev Prototype archiving system for structural models obtained using integrative or hybrid modeling Structural Bioinformatics
Zenodo Generalist research data repository built and developed by OpenAIRE and CERN FAIRtracks Plant Phenomics Bioimaging data Plant sciences Single-cell sequencing Data publication Identifiers Standards/Databases Training
Contributors