What features do you need in a storage solution when collecting data?
Description
The need for Data storage arises early on in a research project, as space will be required to put your data when starting collection or generation. Therefore, it is a good practice to think about storage solutions during the data management planning phase, and request storage in advance and/or pay for it.
The storage solution for your data should fulfil certain criteria (e.g. space, access & transfer speed, duration of storage, etc.), which should be discussed with the IT team. You may choose a tiered storage system for assigning data to various types of storage media based on requirements for access, performance, recovery and cost. Using tiered storage allows you to classify data according to levels of importance and assign it to the appropriate storage tiers or move it to different tier for e.g. once analysis is completed you have the option to move data to lower tier for preservation or archiving.
Tiered Storage is classified as “Cold” or “Hot” Storage. “Hot” storage is associated with fast access speed, high access frequency, high value data and consists of faster drives such as the Solid State Drives (SSD). This storage is usually located in close proximity to the user such as on campus and incurs high costs. “Cold” storage is associated with low access speed and frequency and consists of slower drives or tapes. This storage is usually off-premises and incurs low cost.
Considerations
When looking for solutions to store your data during the collection or generation phase, you should consider the following aspects.
- The volume of your data is an important discerning factor to determine the appropriate storage solution. At the minimum, try to estimate the volume of raw data that you are going to generate or collect.
- What kind of access/transfer speed and access frequency will be required for your data.
- Knowing where the data will come from is also crucial. If the data comes from an external facility or needs to be transferred to a different server, you should think about an appropriate data transfer method.
- It is a good practice to have a copy of the original raw data in a separate location, to keep it untouched and unchanged (not editable).
- Knowing for how long the raw data, as well as data processing pipelines and analysis workflows need to be stored, especially after the end of the project, is also a relevant aspect for storage.
- It is highly recommended to have metadata, such as an identifier and file description, associated with your data (see Documentation and metadata page). This is useful if you want to retrieve the data years later or if your data needs to be shared with your colleagues for collaboration. Make sure to keep metadata together with the data or establish a clear link between data and metadata files.
- In addition to the original “read-only” raw (meta)data files, you need storage for files used for data processing and analysis as well as the workflows/processes used to produce the data. For these, you should consider:
- who is allowed to access the data (in case of collaborative projects), how do they expect to access the data and for what purpose;
- check if you have the rights to give access to the data, in case of legal limitations or third party rights (for instance, collaboration with industry);
- consult policy for data sharing outside the institute/country (see Compliance monitoring page).
- Keeping track of the changes (version control), conflict resolution and back-tracing capabilities.
Solutions
- Provide an estimate about the volume of your raw data (i.e., is it in the order of Megabytes, Gigabytes or Terabytes?) to the IT support in your institute when consulting for storage solutions.
- Clarify if your data needs to be transferred from one location to another. Try to provide IT with as much information as possible about the system where the data will come from. See our Data transfer page for additional information.
- Ask for a tiered storage solution that gives you easy and fast access to the data for processing and analysis. Explain to the IT support what machine or infrastructure you need to access the data from and if other researchers should have access as well (in case of collaborative projects).
- Ask if the storage solution includes an automatic management of versioning, conflict resolution and back-tracing capabilities (see also our Data organisation page).
- Ask the IT support in your institute if they offer technical solutions to keep a copy of your (raw)data secure and untouched (snapshot, read-only access, backup…). You could also keep a copy of the original data file in a separate folder as “read-only”.
- For small data files and private or collaborative projects within your institute, commonly accessible Cloud Storage is usually provided by the institute, such as Nextcloud (on-premises), Microsoft OneDrive, Dropbox, Box, etc. Do not use personal accounts on similar services for this purpose, adhere to the policies of your institute.
- For large data sets consider cloud storage services, such as ScienceMesh, OpenStack) and cloud synchronization and sharing services (CS3), such as CERNBox or SeaFile
- It is a requirement from the funders or universities to store raw data and data analysis workflows (for reproducible results) for a certain amount of time after the end of the project (see our Preserve page). This is usually a requirement. Check the data policy for your project or institute to know if a copy of the data should be also stored at your institute for a specific time after the project. This helps you budget for storage costs and helps your IT support with estimation of storage resources needed.
- Make sure to generate good documentation (i.e., README file) and metadata together with the data. Follow best practices for folder structure, file naming and versioning systems (see our Data organisation page). Check if your institute provides a (meta)data management system, such as iRODS, DATAVERSE, FAIRDOM-SEEK or OSF.
How do you estimate computational resources for data processing and analysis?
Description
In order to process and analyse your data, you will need access to computational resources. This ranges from your laptop, local compute clusters to High Performance Computing (HPC) infrastructures. However, it can be difficult to be able to estimate the amount of computational resource needed for a process or an analysis.
Considerations
Below, you can find some aspects that you need to consider to be able to estimate the computational resource needed for data processing and analysis.
- The volume of total data is an important discerning factor to estimate the computational resources needed.
- Consider how much data volume you need “concurrently or at once”. For example, consider the possibility to analyse a large dataset by downloading or accessing only a subset of the data at a time (e.g., stream 1 TB at a time from a big dataset of 500 TB).
- Define the expected speed and the reliability of connection between storage and compute.
- Determine which software you are going to use. If it is a proprietary software, you should check possible licensing issues. Check if it only runs on specific operative systems (Windows, MacOS, Linux,…).
- Establish if and what reference datasets you need.
- In the case of collaborative projects, define who can access the data and the computational resource for analysis (specify from what device, if possible). Check policy about data access between different Countries. Try to establish a versioning system.
Solutions
- Try to estimate the volume of:
- raw data files necessary for the process/analysis;
- data files generated during the computational analysis as intermediate files;
- results data files.
- Communicate your expectations about speed and the reliability of connection between storage and compute to the IT team. This could depend on the communication protocols that the compute and storage systems use.
- It is recommended to ask about the time span for analysis to colleagues or bioinformatic support that have done similar work before. This could save you money and time.
- If you need some reference datasets (e.g the reference genomes such as human genome.), ask IT if they provide it or consult bioinformaticians that can set up automated public reference dataset retrieval.
- For small data files and private projects, using the computational resources of your own laptop might be fine, but make sure to preserve the reproducibility of your work by using data analysis software such as Galaxy or R Markdown.
- For small data volume and small collaborative projects, a commonly accessible cloud storage, such as Nextcloud (on-premises) or ownCloud might be fine. Adhere to the policies of your institute.
- For large data volume and bigger collaborative projects, you need a large storage volume on fast hardware that is closely tied to a computational resource accessible to multiple users, such as Rucio, tranSMART, Semares or Research Data Management Platform (RDMP).
Where should you store the data after the end of the project?
Description
After the end of the project, all the relevant (meta)data (to guarantee reproducibility) should be preserved for a certain amount of time, that is usually defined by funders or institution policy. However, where to preserve data that are not needed for active processing or analysis anymore is a common question in data management.
Considerations
- Data preservation doesn’t refer to a place nor to a specific storage solution, but rather to the way or “how” data can be stored. As described in our Preservation page, numerous precautions need to be implemented by people with a variety of technical skills to preserve data.
- Estimate the volume of the (meta)data files that need to be preserved after the end of the project. Consider using a compressed file format to minimize the data volume.
- Define the amount of time (hours, days…) that you could wait in case the data needs to be reanalysed in the future.
- It is a good practice to publish your data in public data repositories. Usually, data publication in repositories is a requirement for scientific journals and funders. Repositories preserve your data for a long time, sometimes for free. See our Data publication page for more information.
- Institutes or universities could have specific policies for data preservation. For example, your institute can ask you to preserve the data internally for 5 years after the project, even if the same data is available in public repositories.
Solutions
- Based on the funders or institutional policy about data preservation, the data volume and the retrieval time span, discuss with the IT team what preservation solutions they can offer (i.e., data archiving services in your Country) and the costs, so that you can budget for it in your DMP.
- Publish your data in public repositories, and they will preserve the data for you.
Related pages
More information
Links to FAIR Cookbook
FAIR Cookbook is an online, open and live resource for the Life Sciences with recipes that help you to make and keep data Findable, Accessible, Interoperable and Reusable; in one word FAIR.
Links to DSW
With Data Stewardship Wizard (DSW), you can create, plan, collaborate, and bring your data management plans to life with a tool trusted by thousands of people worldwide — from data management pioneers, to international research institutes.
Training
Skip tool tableTools and resources on this page
Tool or resource | Description | Related pages | Registry |
---|---|---|---|
Box | Cloud storage and file sharing service | Data transfer | Training |
CERNBox | CERNBox cloud data storage, sharing and synchronization | ||
CS3 | Cloud Storage Services for Synchronization and Sharing (CS3) | ||
DATAVERSE | Open source research data respository software.
|
Plant Phenomics Plant sciences Machine actionability | Training |
Dropbox | Cloud storage and file sharing service | Data transfer Documentation and meta... | |
FAIRDOM-SEEK | A data Management Platform for organising, sharing and publishing research datasets, models, protocols, samples, publications and other research outcomes. | NeLS Plant Phenomics Microbial biotechnology Plant sciences Documentation and meta... | Tool info |
Galaxy | Open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses. | Marine Metagenomics Single-cell sequencing Data analysis Data provenance | Tool info Training |
iRODS | Integrated Rule-Oriented Data System (iRODS) is open source data management software for a cancer genome analysis workflow. | TransMed Bioimaging data | Tool info |
Microsoft OneDrive | Cloud storage and file sharing service from Microsoft | Data transfer | |
Nextcloud | As fully on-premises solution, Nextcloud Hub provides the benefits of online collaboration without the compliance and security risks | ||
OpenStack | OpenStack is an open source cloud computing infrastructure software project and is one of the three most active open source projects in the world
|
Data analysis | Training |
OSF | OSF (Open Science Framework) is a free, open platform to support your research and enable collaboration. | Training | |
ownCloud | Cloud storage and file sharing service
|
Data transfer | |
R Markdown | R Markdown documents are fully reproducible. Use a productive notebook interface to weave together narrative text and code to produce elegantly formatted output. Use multiple languages including R, Python, and SQL. | Training | |
Research Data Management Platform (RDMP) | Data management platform for automated loading, storage, linkage and provision of data sets | Tool info | |
Rucio | Rucio - Scientific Data Management | Data transfer | |
ScienceMesh | ScienceMesh - frictionless scientific collaboration and access to research services | Data transfer | |
SeaFile | SeaFile File Synchronization and Share Solution | Data transfer | |
Semares | All-in-one platform for life science data management, semantic data integration, data analysis and visualization | Documentation and meta... | |
tranSMART | Knowledge management and high-content analysis platform enabling analysis of integrated data for the purposes of hypothesis generation, hypothesis validation, and cohort discovery in translational research. | TransMed | Tool info |
National resources
Tools and resources tailored to users in different countries.
Tool or resource | Description | Related pages | Registry |
---|---|---|---|
Flemish Supercomputing Center (VSC) | VSC is the Flanders’ most highly integrated high-performance research computing environment, providing world-class services to government, industry, and researchers. |
Data Steward Research Software Engi... Data analysis | |
OLOS | OLOS is a Swiss-based data management portal, to help Swiss researchers safely manage, publish and preserve their data. |
Data publication | |
SWISSUbase | SWISSUbase is a national cross-disciplinary solution for Swiss universities and other research organizations in need of local institutional data repositories for their researchers. The platform relies on international archiving standards and processes to ensure that data are preserved and accessible in the long-term. |
Data publication | |
Czech National Repository | National Repository (NR) is a service provided to the scientific and research communities in the Czech Republic to store their generated research data together with persistent DOI identifier. NR service is currently under the pilot program. |
Researcher Data Steward Research Software Engi... Existing data Identifiers Data management plan | |
e-INFRA CZ (Supercomputing and Data Services) | e-INFRA CZ provides integrated high-performance research computing/data storage environment, providing world-class services to government, industry, and researchers. It also cooperates with European Open Science Cloud (EOSC) implementation in the Czech Republic. |
Data Steward Research Software Engi... Data analysis | |
ownCloud@CESNET | CESNET-hosted ownCloud is a 100 GB cloud storage freely available for Czech scientists to manage their data from any research projects.
ownCloud
|
Researcher Research Software Engi... Data organisation | |
GHGA | The German Human Genome-Phenome Archive. |
Documentation and meta... Researcher Data Steward | |
Fairdata.fi | With the Fairdata Services you can store, share and publish your research data with easy-to-use web tools. |
CSC Researcher Data Steward Data publication Existing data | |
Sensitive Data Services for Research | CSC Sensitive Data Services for Research are designed to support secure sensitive data management through web-user interfaces accessible from the user’s own computer. |
CSC Researcher Data Steward Data sensitivity Data analysis Data publication Human data | |
BBMRI catalogue | Biobanking Netherlands makes biosamples, images and data findable, accessible and usable for health research. |
Human data Researcher Data analysis Existing data | |
cBioPortal for Cancer Genomics | cBioPortal provides a web-based resource for researchers to explore, visualize, analyze, and share multidimensional cancer genomic datasets, as well as other studies involving multidimensional genomic data. |
Human data Researcher Data analysis Existing data | |
Health-RI Service Catalogue | Health-RI provides a set of tools and services available to the biomedical research community. |
Human data Researcher Data analysis Existing data | |
Educloud Research | Educloud Research is a platform provided by the Centre for Information Technology (USIT) at the University of Oslo (UiO). This platform provides access to a work environment accessible to collaborators from other institutions or countries. This service provides a storage solution and a low-threshold HPC system that offers batch job submission (SLURM) and interactive nodes. Data up to the red classification level can be stored/analysed. |
Data analysis Data sensitivity | |
HUNTCloud | The HUNT Cloud, established in 2013, aims to improve and develop the collection, accessibility and exploration of large-scale information. HUNT Cloud offers cloud services and lab management. It is a key service that has established a framework for data protection, data security, and data management. HUNT Cloud is owned by NTNU and operated by HUNT Research Centre at the Department of Public Health and Nursing at the Faculty of Medicine and Health Sciences. |
Human data Data analysis Data sensitivity | |
NIRD | The National Infrastructure for Research Data (NIRD) infrastructure offers storage services, archiving services, and processing capacity for computing on the stored data. It offers services and capacities to any scientific discipline that requires access to advanced, large-scale, or high-end resources for storing, processing, publishing research data or searching digital databases and collections. This service is owned and operated by Sigma2 NRIS, which is a joint collaboration between UiO, UiB, NTNU, UiT, and UNINETT Sigma2. |
Data transfer NeLS FAIRtracks | |
Norwegian Research and Education Cloud (NREC) | NREC is an Infrastructure-as-a-Service (IaaS) project between the University of Bergen and the University of Oslo, with additional contributions from NeIC (Nordic e-Infrastructure Collaboration) and Uninett., commonly referred to as a cloud infrastructure An IaaS is a self-service infrastructure where you spawn standardized servers and storage instantly, as needed, from a given resource quota.
OpenStack
|
Data analysis | |
SAFE | SAFE (secure access to research data and e-infrastructure) is the solution for the secure processing of sensitive personal data in research at the University of Bergen. SAFE is based on the “Norwegian Code of conduct for information security in the health and care sector” (Normen) and ensures confidentiality, integrity, and availability are preserved when processing sensitive personal data. Through SAFE, the IT department offers a service where employees, students and external partners get access to dedicated resources for processing of sensitive personal data. |
Human data Data analysis Data sensitivity | |
TSD | The TSD – Service for Sensitive Data, is a platform for collecting, storing, analysing and sharing sensitive data in compliance with the Norwegian privacy regulation. TSD is developed and operated by UiO. |
Human data Data analysis Data sensitivity TSD | |
BioData.pt Data Management Portal (DMPortal) | This instance of DataVerse is provided by the BioData.pt. We can help you write and maintain data management plans for your research.
DATAVERSE
|
Researcher Data Steward | |
BioData.pt Service Hub | BioData.pt Service Hub includes several data management resources, tools and services available for researchers in Life Sciences. |
Researcher Data Steward Data analysis | |
NAISS | The National Academic Infrastructure for Supercomputing in Sweden (NAISS) is a national research infrastructure that makes available large-scale high-performance computing resources, storage capacity, and advanced user support, for Swedish research. |
Data analysis |