What are the best practices for data analysis?
Description
When carrying out your analysis, you should also keep in mind that all your data analysis has to be reproducible. This will complement your research data management approach since your data will be FAIR compliant but also your tools and analysis environments. In other words, you should be able to tell what data and what code or tools were used to generate your results.
This will help to tackle reproducibility problems but also will improve the impact of your research through collaborations with scientists who will reproduce your in silico experiments.
Considerations
There are many ways that will bring reproducibility to your data analysis. You can act at several levels:
- by providing your code;
- by providing your execution environment;
- by providing your workflows;
- by providing your data analysis execution.
Solutions
- Make your code available. If you have to develop a software for your data analysis, it is always a good idea to publish your code. The git versioning system offers both a way to release your code but offers also a versioning system. You can also use Git to interact with your software users. Be sure to specify a license for your code (see the licensing section).
- Use package and environment management system. By using package and environment management systems like Conda and its bioinformatics specialized channel Bioconda, researchers that have got access to your code will be able to easily install specific versions of tools, even older ones, in an isolated environment. They will be able to compile/run your code in an equivalent computational environment, including any dependencies such as the correct version of R or particular libraries and command-line tools your code use. You can also share and preserve your setup by specifying in a environment file which tools you installed.
- Use container environments. As an alternative to package management systems you can consider container environments like Docker or Singularity.
- Use workflow management systems. Scientific Workflow management systems will help you organize and automate how computational tools are to be executed. Compared to composing tools using a standalone script, workflow systems also help document the different computational analyses applied to your data, and can help with scalability, such as cloud execution. Reproducibility is also enhanced by the use of workflows, as they typically have bindings for specifying software packages or containers for the tools you use from the workflow, allowing others to re-run your workflow without needing to pre-install every piece of software it needs. It is a flourishing field and many other workflow management systems are available, some of which are general-purpose (e.g. any command line tool), while others are domain-specific and have tighter tool integration. Among the many workflow management systems available, one can mention
- Workflow platforms that manage your data and provide an interface (web, GUI, APIs) to run complex pipelines and review their results. For instance: Galaxy and Arvados (Common Workflow Language (CWL)-based), open source.
- Workflow runners that take a workflow written in a proprietary or standardized format (such as the Common Workflow Language (CWL)) and execute it locally or on a remote compute infrastructure. For instance, CWL in Toil, the reference CWL runner (cwltool), Nextflow, Snakemake, Cromwell.
- Use notebooks. Using notebooks, you will be able to create reproducible documents mixing text and code; which can help explain your analysis choices; but also be used as an exploratory method to examine data in detail. Notebooks can be used in conjunction with the other solutions mentioned above, as typically the notebook can be converted to a script. Some of the most well-known notebooks systems are: Jupyter, with built-in support for code in Python, R and Julia, and many other Jupyter kernels; Rstudio based on R. See the table below for additional tools.
How can you use package and environment management systems?
Description
By using package and environment management systems like Conda and its bioinformatics specialized channel Bioconda, you will be able to easily install specific versions of tools, even older ones, in an isolated environment. You can also share and preserve your setup by specifying in a environment file which tools you installed.
Considerations
Conda works by making a nested folder containing the traditional UNIX directory structure bin/
lib/
but installed from Conda’s repositories instead of from a Linux distribution.
- As such Conda enables consistent installation of computational tools independent of your distribution or operating system version. Conda is available for Linux, macOS and Windows, giving consistent experience across operating systems (although not all software is available for all OSes).
- Package management systems work particularly well for installing free and Open Source software, but can also be useful for creating an isolated environment for installing commercial software packages; for instance if they requires an older Python version than you have pre-installed.
- Conda is one example of a generic package management, but individual programming languages typically have their environment management and package repositories.
- You may want to consider submitting a release of your own code, or at least the general bits of it, to the package repositories for your programming language.
Solutions
- MacOS-specific package management systems: Homebrew, MacPorts.
- Windows-specific package management systems: Chocolatey and Windows Package Manager
winget
. - Linux distributions also have their own package management systems (
rpm
/yum
/dnf
,deb
/apt
) that have a wide variety of tools available, but at the cost of less flexibility in terms of the tool versions, to ensure they exist co-installed. - Language-specific virtual environments and repositories including: rvm and RubyGems for Ruby, pip and venv for Python, npm for NodeJS/Javascript, renv and CRAN for R, Apache Maven or Gradle for Java.
- Tips and tricks to navigate the landscape of software package management solutions:
- Manage the software you need in an OS-independent way by listing all relevant packages in your Conda environment via the
environment.yaml
file. - If you need conflicting versions of some tools/libraries for different operations, make separate Conda environments.
- If you need a few open source libraries for your Python script, none of which require compiling, make a
requirements.txt
and referencepip
packages.
- Manage the software you need in an OS-independent way by listing all relevant packages in your Conda environment via the
How can you use container environments?
Description
Container environments like Docker or Singularity allow you to easily install specific versions of tools, even older ones, in an isolated environment.
Considerations
In short containers works almost like a virtual machine (VMs), in that it re-creates a whole Linux distribution with separation of processes, files and network.
- Containers are more lightweight than VMs since they don’t virtualize hardware. This allows a container to run with a fixed version of the distribution independent of the host, and have just the right, minimal dependencies installed.
- The container isolation also adds a level of isolation, which although not as secure as VMs, can reduce the attack vectors. For instance if the database container was compromised by unwelcome visitors, they would not have access to modify the web server configuration, and the container would not be able to expose additional services to the Internet.
- A big advantage of containers is that there are large registries of community-provided container images.
- Note that modifying things inside a container is harder than in a usual machine, as changes from the image are lost when a container is recreated.
- Typically containers run just one tool or applications, and for service deployment this is useful for instance to run mySQL database in a separate container from a NodeJS application.
Solutions
- Docker is the most well-known container runtime, followed by Singularity. These require (and could be used to access) system administrator privileges to be set up.
- udocker and Podman are also user space alternatives that have compatible command line usage.
- Large registries of community-provided container images are Podman and RedHat Quay.io. These are often ready-to-go, not requiring any additional configuration or installations, allowing your application to quickly have access to open source server solutions.
- BioContainers have a large selection of bioinformatics tools.
- To customize a Docker image, it is possible to use techniques such as Volumes to store data and Dockerfile reference. This is useful for installing your own application inside a new container image, based on a suitable base image where you can do your
apt install
and software setup in a reproducible fashion - and share your own application as an image on Docker Hub. - Container linkage can be done by container composition using tools like Docker Compose overview.
- More advanced container deployment solutions like Kubernetes and Computational Workflow Management systems can also manage cloud instances and handle analytical usage.
- OpenStack is an open-source platform that uses pooled virtual resources to build and manage private and public clouds. It provides a stable base for deploying and managing containers, allowing for faster application deployment and simplified management.
- Tips and tricks to navigate the landscape of container solutions:
- If you just need to run a database server, describe how to run it as a Docker/Singularity container.
- If you need several servers running, connected together, set up containers in Docker Compose.
- If you need to install many things, some of which are not available as packages, make a new
Dockerfile
recipe to build container image. - If you need to use multiple tools in a pipeline, find Conda or container images, compose them in a Computational Workflow.
- If you need to run tools in a cloud instance, but it has nothing preinstalled, use Conda or containers to ensure installation on cloud VM matches your local machine.
- If you just need a particular open source tool installed, e.g. ImageMagick, check the document how to install: For Ubuntu 20.04, try
apt install imagemagick
.
- Domain specific solutions that make use of containers to benchmark and reproducibly deploy workflows exist, including BIAFLOWS for bioimage data.
How can you use workflow management systems for reproducible data analysis?
Description
Using containerization together with workflow management systems provides several benefits for data analysis, including:
- Reproducibility: By using containerized environments and workflow management systems, you can ensure that your analysis is reproducible, as the environment in which the analysis is executed is exactly the same each time.
- Portability: Containerized environments can be easily moved between different computing environments, allowing you to execute your analysis on different computing resources or share your analysis with collaborators.
- Scalability: Workflow management systems can be used to execute analyses on large computing clusters (like the EuroHPC supercomputer LUMI) or cloud computing resources, enabling you to scale your analysis as needed.
Considerations
Creating an analysis workflow involves several steps that require careful consideration. The following steps can help you create a workflow and run it locally or in the cloud:
- Before creating a workflow, it is important to define the scope and objectives of the analysis. This will help you to determine the type of data to collect, the analysis methods to use, and the resources required for the analysis.
- After defining the scope and objectives, the next step is to determine the tools and software to use. You need to choose software that is compatible with the type of data you want to analyze and the analysis methods you plan to use.
- Once you have determined the tools and software to use, the next step is to create the workflow. This involves breaking down the analysis process into small, manageable steps that can be automated. Each step should be clearly defined, and the inputs and outputs of each step should be documented.
- If you want to use containers, you can now define the container images for the execution of the entire workflow or for the individual steps.
- After creating the workflow, it is important to test it to ensure that it works as expected. You can do this by running a test dataset through the workflow and checking the outputs to ensure they match the expected results.
- Once you have tested the workflow, the next step is to run it on your dataset. Depending on the size of your data, you can run the workflow locally on your computer or on a remote workflow management system.
Solutions
- Most workflow management systems provide detailed tutorials and documentation for creating workflows and including containerization technologies. Here are documentations for Nextflow, Snakemake, Cromwell, CWL.
- The BioContainers project provides a platform for storing and sharing containers that can used in your workflow.
- The bio.tools repository lists state of the art tools and databases from the field of bioinformatics ordered by collections and communities.
- OpenEBench is a framework for monitoring and benchmarking analysis tools and workflows.
- WorkflowHub and Dockstore are two popular services for sharing and re-using workflows.
- Life-Monitor is a service designed to facilitate the long-term viability and reusability of published computational workflows.
- The ELIXIR Cloud and AAI project supports a framework for executing workflows in the cloud via the standards developed by the GA4GH community.
Related pages
More information
Links to FAIR Cookbook
FAIR Cookbook is an online, open and live resource for the Life Sciences with recipes that help you to make and keep data Findable, Accessible, Interoperable and Reusable; in one word FAIR.
Links to DSW
With Data Stewardship Wizard (DSW), you can create, plan, collaborate, and bring your data management plans to life with a tool trusted by thousands of people worldwide — from data management pioneers, to international research institutes.
Training
Skip tool tableTools and resources on this page
Tool or resource | Description | Related pages | Registry |
---|---|---|---|
Arvados | With Arvados, bioinformaticians run and scale compute-intensive workflows, developers create biomedical applications, and IT administrators manage large compute and storage resources. | ||
BIAFLOWS | BIAFLOWS is an open-soure web framework to reproducibly deploy and benchmark bioimage analysis workflows | Tool info | |
bio.tools | Essential scientific and technical information about software tools, databases and services for bioinformatics and the life sciences. | Tool info Standards/Databases Training | |
Bioconda | Bioconda is a bioinformatics channel for the Conda package manager | Tool info Training | |
BioContainers | BioContainers Flow | Single-cell sequencing | Tool info Training |
Chocolatey | The Package Manager for Windows | ||
Common Workflow Language (CWL) | An open standard for describing workflows that are build from command line tools | Standards/Databases Training | |
Conda | Open source package management system | Training | |
Cromwell | Cromwell is a Workflow Management System geared towards scientific workflows. | ||
CWL in Toil | The Common Workflow Language CWL is an emerging standard for writing workflows that are portable across multiple workflow engines and platforms. Toil has full support for the CWL v1.0, v1.1, and v1.2 standards. | ||
cwltool | This is the reference implementation of the Common Workflow Language open standards. It is intended to be feature complete and provide comprehensive validation of CWL files as well as provide other tools related to working with CWL. | ||
Docker | Docker is a software for the execution of applications in virtualized environments called containers. It is linked to DockerHub, a library for sharing container images | Single-cell sequencing | Standards/Databases Standards/Databases Training |
Docker Compose overview | Compose is a tool for defining and running multi-container Docker applications. | ||
Docker Hub | Docker Hub is the world's easiest way to create, manage, and deliver your team's container applications. | Standards/Databases Training | |
Dockerfile reference | Docker can build images automatically by reading the instructions from a Dockerfile | ||
Dockstore | Dockstore is a free and open source platform for sharing reusable and scalable analytical tools and workflows. It’s developed by the Cancer Genome Collaboratory and used by the GA4GH. | Tool info Training | |
Galaxy | Open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses.
|
Marine Metagenomics Single-cell sequencing Data provenance Data storage | Tool info Training |
Homebrew | The Missing Package Manager for macOS or Linux | ||
Jupyter | Jupyter notebooks allow to share code, documentation | Training | |
Jupyter kernels | Kernel Zero is IPython, which you can get through ipykernel, and is still a dependency of jupyter. The IPython kernel can be thought of as a reference implementation, as CPython is for Python. | ||
Kubernetes | Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications. | Training | |
Life-Monitor | LifeMonitor is a service to support the sustainability and reusability of published computational workflows. | Training | |
LUMI | EuroHPC world-class supercomputer | Tool info | |
MacPorts | The MacPorts Project is an open-source community initiative to design an easy-to-use system for compiling, installing, and upgrading either command-line, X11 or Aqua based open-source software on the Mac operating system. | ||
Nextflow | Nextflow is a framework for data analysis workflow execution | Tool info Training | |
OpenEBench | ELIXIR benchmarking platform to support community-led scientific benchmarking efforts and the technical monitoring of bioinformatics reosurces | Tool info | |
OpenStack | OpenStack is an open source cloud computing infrastructure software project and is one of the three most active open source projects in the world
|
Data storage | Training |
Podman | Manage containers, pods, and images with Podman. Seamlessly work with containers and Kubernetes from your local environment. | ||
Rstudio | Rstudio notebooks allow to share code, documentation | Tool info Training | |
Singularity | Singularity is a container platform. | Training | |
Snakemake | Snakemake is a framework for data analysis workflow execution | Tool info Training | |
udocker | udocker is a basic user tool to execute simple docker containers in user space without requiring root privileges. | ||
Volumes | Volumes are the preferred mechanism for persisting data generated by and used by Docker containers. | Training | |
Windows Package Manager | Windows Package Manager is a comprehensive package manager solution that consists of a command line tool and set of services for installing applications on Windows 10 and Windows 11. | ||
WorkflowHub | WorkflowHub is a registry for describing, sharing and publishing scientific computational workflows. | Galaxy Data provenance | Tool info Standards/Databases Training |
National resources
Tools and resources tailored to users in different countries.
Tool or resource | Description | Related pages | Registry |
---|---|---|---|
Flemish Supercomputing Center (VSC) | VSC is the Flanders’ most highly integrated high-performance research computing environment, providing world-class services to government, industry, and researchers. |
Data Steward Research Software Engi... Data storage | |
Galaxy Belgium | Galaxy Belgium is a Galaxy instance managed by the Belgian ELIXIR node, funded by the Flemish government, which utilizing infrastructure provided by the Flemish Supercomputer Center (VSC).
Galaxy
|
Researcher | |
BioMedIT | A secure IT network for the responsible processing of health-related data. |
Human data Data sensitivity | |
OpenRDM.swiss | openRDM.swiss offers research data management as a service to the scientific community, based on the powerful openBIS platform. |
||
Renku | An open-source knowledge infrastructure for collaborative and reproducible data science. |
||
e-INFRA CZ (Supercomputing and Data Services) | e-INFRA CZ provides integrated high-performance research computing/data storage environment, providing world-class services to government, industry, and researchers. It also cooperates with European Open Science Cloud (EOSC) implementation in the Czech Republic. |
Data Steward Research Software Engi... Data storage | |
Galaxy MetaCentrum | Galaxy MetaCentrum is a Galaxy instance managed by the Czech ELIXIR node and e-INFRA. It provides extra support for RepeatExplorer tool for plant genomic analysis.
Galaxy
|
Researcher | Tool info |
Galaxy Estonia | This is the Estonian instance of Galaxy, which is an open source, web-based platform for data intensive biomedical research.
Galaxy
|
Researcher | |
Chipster | Chipster is a user-friendly analysis software for high-throughput data such as RNA-seq and single cell RNA-seq. It contains analysis tools and a large reference genome collection. |
CSC Researcher Research Software Engi... | |
Cloud computing | CSC offers a variety of cloud computing services: the Pouta IaaS services and the Rahti container cloud service. |
CSC Researcher Data Steward | |
High performance computing | CSC Supercomputers Puhti, Mahti and LUMI performance ranges from medium scale simulations to one of the most competitive supercomputers in the world. |
CSC Researcher Data Steward | |
Sensitive Data Services for Research | CSC Sensitive Data Services for Research are designed to support secure sensitive data management through web-user interfaces accessible from the user’s own computer. |
CSC Researcher Data Steward Data sensitivity Data storage Data publication Human data | |
BBMRI catalogue | Biobanking Netherlands makes biosamples, images and data findable, accessible and usable for health research. |
Human data Researcher Existing data Data storage | |
cBioPortal for Cancer Genomics | cBioPortal provides a web-based resource for researchers to explore, visualize, analyze, and share multidimensional cancer genomic datasets, as well as other studies involving multidimensional genomic data. |
Human data Researcher Existing data Data storage | |
Health-RI Service Catalogue | Health-RI provides a set of tools and services available to the biomedical research community. |
Human data Researcher Existing data Data storage | |
Educloud Research | Educloud Research is a platform provided by the Centre for Information Technology (USIT) at the University of Oslo (UiO). This platform provides access to a work environment accessible to collaborators from other institutions or countries. This service provides a storage solution and a low-threshold HPC system that offers batch job submission (SLURM) and interactive nodes. Data up to the red classification level can be stored/analysed. |
Data sensitivity Data storage | |
HUNTCloud | The HUNT Cloud, established in 2013, aims to improve and develop the collection, accessibility and exploration of large-scale information. HUNT Cloud offers cloud services and lab management. It is a key service that has established a framework for data protection, data security, and data management. HUNT Cloud is owned by NTNU and operated by HUNT Research Centre at the Department of Public Health and Nursing at the Faculty of Medicine and Health Sciences. |
Human data Data sensitivity Data storage | |
Meta-pipe | META-pipe is a pipeline for annotation and analysis of marine metagenomics samples, which provides insight into phylogenetic diversity, metabolic and functional potential of environmental communities. |
Marine Metagenomics | |
Norwegian Research and Education Cloud (NREC) | NREC is an Infrastructure-as-a-Service (IaaS) project between the University of Bergen and the University of Oslo, with additional contributions from NeIC (Nordic e-Infrastructure Collaboration) and Uninett., commonly referred to as a cloud infrastructure An IaaS is a self-service infrastructure where you spawn standardized servers and storage instantly, as needed, from a given resource quota.
OpenStack
|
Data storage | |
SAFE | SAFE (secure access to research data and e-infrastructure) is the solution for the secure processing of sensitive personal data in research at the University of Bergen. SAFE is based on the “Norwegian Code of conduct for information security in the health and care sector” (Normen) and ensures confidentiality, integrity, and availability are preserved when processing sensitive personal data. Through SAFE, the IT department offers a service where employees, students and external partners get access to dedicated resources for processing of sensitive personal data. |
Human data Data sensitivity Data storage | |
Sigma2 HPC systems | The current Norwegian academic HPC infrastructure consists of three systems for different purposes. The Norwegian academic high-performance computing and storage infrastructure is maintained by Sigma2 NRIS, which is a joint collaboration between UiO, UiB, NTNU, UiT, and UNINETT Sigma2 (SIKT). |
||
TSD | The TSD – Service for Sensitive Data, is a platform for collecting, storing, analysing and sharing sensitive data in compliance with the Norwegian privacy regulation. TSD is developed and operated by UiO. |
Human data Data sensitivity Data storage TSD | |
usegalaxy.no | Galaxy is an open-source, web-based platform for data-intensive biomedical research. This instance of Galaxy is coupled with NeLS for easy data transfer.
Galaxy
|
Data sensitivity Existing data Data publication NeLS | |
BioData.pt Service Hub | BioData.pt Service Hub includes several data management resources, tools and services available for researchers in Life Sciences. |
Researcher Data Steward Data storage | |
NAISS | The National Academic Infrastructure for Supercomputing in Sweden (NAISS) is a national research infrastructure that makes available large-scale high-performance computing resources, storage capacity, and advanced user support, for Swedish research. |
Data storage |