Introduction
Biomolecular simulations are important technique for our understanding and design of biological molecules and their interactions. Simulation methods are demonstrating rapidly growing impact in areas as diverse as biocatalysis, drug delivery, biomaterials, biotechnology, and drug or protein design. Simulations offer the potential of uniquely detailed, atomic‐level insight into mechanisms, dynamics, and processes, as well as increasingly accurate predictions of molecular properties. Yet the field only relatively recently started to store and share (bio)simulation data to be reused for new, unexpected projects, and started discussions about their biomolecular simulation data FAIRification (i.e. to make them Findable, Accessible, Interoperable and Reusable). Here we show several current possibilities moving in this direction, but we should stress that these guidelines are not carved to stone and the biomolecular simulation community still needs to address challenges to FAIRify their data.
Storing and sharing the data from biomolecular simulations
Description
The biomolecular simulation data comes in several forms and multiple formats, which unfortunately are not completely interoperable. Different methods also require slightly different metadata description.
Considerations
- What type of data do you have?
- Molecular dynamics data - by far the most typical and largest biomolecular simulation data. Each molecular dynamics simulation is driven by the used engine, force-field, and multiple other and often hidden simulation parameters to produce trajectories that are further analysed.
- Molecular docking data - docking provides the structures of the complex (e.g. ligand-protein, protein-protein, protein-nucleic acid, etc.) and its score/energy.
- Virtual screening data - virtual screening is used for selection of active compounds from the pool of others and is usually in the form of ID and its score/energy.
- Free energies and other analysis data - data calculable from the analysis of the simulations.
- Where should you store this data?
- Since there is no common community repository that would be able to gather the often spacious simulation data, the field did not systematically store them. Recently, there’s multiple possibilities where the data can be stored. The repositories can be divided in two main branches:
- Generic: Repositories that can be used to store any kind of data.
- Specific: Repositories designed to store specific data (e.g. MD data).
- Are you looking for a long-term or short-term storage? Repositories have different options (and sometimes prices) for the storage time of your data.
- Do you need a static reference for your data? A code (identifier) that can uniquely identify and refer to your data?
- Since there is no common community repository that would be able to gather the often spacious simulation data, the field did not systematically store them. Recently, there’s multiple possibilities where the data can be stored. The repositories can be divided in two main branches:
- What data should you store?
- What type of data should you store from the whole bunch of data generated in our project. Again, the type of data might vary depending on the biomolecular simulation field.
-
Consider what is essential (absolutely needed to reproduce the simulated experiment) versus what can be extracted from this data (analyses).
- How do you want your data to be shared?
- You should consider the terms in which other scientists can use your data for other projects, access, modify, or redistribute them.
Solutions
- Deposit your data to a suitable repository for sharing. There’s a long (and incomplete) list of repositories available for data sharing. Repositories are divided into two main categories, general-purpose and discipline-specific, and both categories are utilised in the domain of biomolecular modeling and simulation. For a general introduction to repositories, you are advised to read the data publication page.
- General-purpose repositories such as Zenodo, FigShare, Mendeley data, Dryad, and OpenScienceFramework can be used.
- Discipline-specific repositories can be used when the repository supports the type of data to be shared e.g. molecular dynamics data. Repositories for various data types and models are listed below:
- Molecular Dynamics repositories
- GPCRmd - for GPCR protein simulations, with submission process.
- MoDEL - (https://bio.tools/model) specific database for protein MD simulations.
- BigNASim - (https://bio.tools/bignasim) specific database for Nucleic Acids MD simulations, with submission process.
- MoDEL-CNS - specific database for Central Nervous System-related, mainly membrane protein, MD simulations.
- NMRlipids - project to validate lipid force fields with NMR data with submission process
- MolSSI - BioExcel COVID-19 therapeutics hub - database with COVID-19 related simulations, with submission process.
- Molecular Dynamics databases - allow access to precalculated data
- BioExcel COVID-19 - database and associated web server to offer in a graphical way analyses on top of COVID-19 related MD trajectories stored in the MolSSI-BioExcel COVID-19 therapeutics hub.
- Dynameomics - database of folding/unfolding pathways
- MemProtMD - database of automatically generated membrane proteins from PDB inserted into simulated lipid bilayers
- Docking respositories
- MolSSI - BioExcel COVID-19 therapeutics hub - database with COVID-19 related simulations, with submission process.
- PDB-Dev - prototype archiving system for structural models using integrative or hybrid modeling, with submission process.
- ModelArchive - theoretical models of macromolecular structures, with submission process.
- Virtual Screening repositories:
- Bioactive Conformational Ensemble - small molecule conformations, with submission process.
- BindingDB - database of measured binding affinities, focusing chiefly on the interactions of protein considered to be drug-targets with small, drug-like molecules, with submission process.
- Repositories for the analyzed data from simulations:
- MolMeDB - for molecule-membrane interactions and free energy profiles, with submission process.
- ChannelsDB - resource of channels, pores and tunnels found in biomacromolecules, with submission process.
- Molecular Dynamics repositories
- Based on the type of data to be shared, pay attention to what should be included and the data and metadata that will be deposited to repositories. Below listed are some suggested examples of types of essential and optional data describing the biomolecular simulation data:
- Molecular Dynamics:
- Essentials:
- Metadata (Temperature, pressure, program, version, …)
- Complete set of input files that were used in the simulations
- Trajectory(ies)
- Topology(ies)
- Optionals:
- Analysis data (Free energy, snapshots, clusterization)
- Essentials:
- Docking poses:
- Essentials:
- The complete set of molecules tested as well as the scoring functions used and the high-ranking, final poses (3D-structures)
- Metadata (Identifiers (SMILES, InChI-Key), target (PDBID), energies/scores, program, version, box definition)
- Optionals:
- Complete ensemble of poses
- Essentials:
- Virtual Screening:
- Essentials:
- List of molecules sorted
- Metadata (identifiers of ligands and decoy molecules, target, program+version, type of VS (QSAR, ML, Docking,…))
- Optionals:
- Details of the method, scores, …
- Essentials:
- Free energies and other analyses:
- Essentials:
- Metadata (model, method, program, version, force field(s), etc.)
- Values (Free energy values, channels, etc.)
- Optionals:
- Link to Trajectory (Dynamic PDB?)
- Essentials:
- Molecular Dynamics:
- Associate a license with the data and/or source code e.g. models. Licenses mainly differ on openness vs restrictiveness, and it is crucial to understand the differences among licenses before sharing your research outputs. The RDMkit licensing page lists resources that can help you understand licensing and choose an appropriate license.
Related problems
-
File formats Biomolecular simulation field has a tendency to produce a multitude of input/output formats, each of them mainly related to one software package. That makes interoperability and reproducibility really difficult. You can share your data but this data will only be useful if the scientist interested in it has access to the tool that has generated it. The field is working on possible standards (e.g. TNG trajectory).
-
Metadata standards There is no existing standard defining the type and format of the metadata needed to describe a particular project and its associated data. How to store the program, version, parameters used, input files, etc., is still an open question, which has been addressed in many ways and using many formats (json, xml, txt, etc.). Again, different initiatives exist trying to address this issue (see further references).
-
Data size Data generated in the biomolecular simulation field is growing at an alarming pace. Making this data available to the scientific community sometimes means transferring them to a long-term storage, and even this a priori straightforward process can be cumbersome because of the large data size.
Related pages
More information
Training
Skip tool tableTools and resources on this page
Tool or resource | Description | Related pages | Registry |
---|---|---|---|
BigNASim | Repository for Nucleic Acids MD simulations | Tool info | |
BindingDB | Public, web-accessible database of measured binding affinities | Tool info Standards/Databases | |
Bioactive Conformational Ensemble | Platform designed to efficiently generate bioactive conformers and speed up the drug discovery process. | Tool info | |
BioExcel COVID-19 | Platform designed to provide web-access to atomistic-MD trajectories for macromolecules involved in the COVID-19 disease. | ||
Dryad | Open-source, community-led data curation, publishing, and preservation platform for CC0 publicly available research data | Bioimaging data Data publication | Standards/Databases |
Dynameomics | Database of folding / unfolding pathway of representatives from all known protein folds by MD simulation | ||
FigShare | Data publishing platform | Data publication Identifiers Documentation and meta... | Standards/Databases Training |
GPCRmd | Repository of GPCR protein simulations | Tool info | |
MemProtMD | Database of over 5000 intrinsic membrane protein structures | Tool info | |
Mendeley data | Multidisciplinary, free-to-use open repository specialized for research data | Data publication Existing data | Standards/Databases |
MoDEL | Database of Protein Molecular Dynamics simulations representing different structural clusters of the PDB | Tool info Training | |
MoDEL-CNS | Repository for Central Nervous System-related mainly membrane protein MD simulations | ||
ModelArchive | Repository for theoretical models of macromolecular structures with DOIs for models | Structural Bioinformatics | Tool info Standards/Databases |
MolMeDB | Database about interactions of molecules with membranes | Tool info Standards/Databases | |
MolSSI - BioExcel COVID-19 therapeutics hub | Aggregating critical information to accelerate COVID-19 drug discovery for the molecular modeling and simulation community. | ||
NMRlipids | Repository for lipid MD simulations to validate force fields with NMR data | ||
OpenScienceFramework | free and open source project management tool that supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery | Documentation and meta... | Standards/Databases |
PDB-Dev | Prototype archiving system for structural models obtained using integrative or hybrid modeling | Structural Bioinformatics | |
Zenodo | Generalist research data repository built and developed by OpenAIRE and CERN | FAIRtracks Plant Phenomics Bioimaging data Plant sciences Single-cell sequencing Data publication Identifiers | Standards/Databases Training |