Development build for ELIXIR-Belgium/rdmkit-sandbox@0fafbdf (branch: contribute-refactor)
Skip to content Skip to footer

Your tasks: Data transfer

How do you transfer large data files?

Description

Often, research in Life Sciences generates massive amounts of digital data, such as output files of ‘omics’ techniques (genomics, transcriptomics, metabolomics, proteomics, etc.). Large data files cannot be sent by email because they exceed the file size limit of most common email servers. Moreover, some data cannot be sent by email due to its sensitive nature. So, how can large data files be transferred from a local computer to a distant one?

Considerations

There are many aspects to consider when dealing with data transfer.

  • The size or volume of the data and the capacity or bandwidth of the network that links your local computer with the distant computer are crucial aspects. Data size and bandwidth are tightly linked since transferring large volumes of data on a low bandwidth network will be so time consuming that it could be simpler to send the data on a hard drive through carrier services.

  • You need to be aware of the legal and ethical implications of your data transfer.
    • For personal data, you have to ensure compliance with various legal and ethical frameworks, including the GDPR. You might have to establish a data processing or joint data controller agreement before you can transfer the data. We highly recommend you to check the human data pages of the RDMkit.
    • For data relevant for later patenting or other types of commercialization you might want to establish a non-disclosure or other type of agreement with the other party to protect your interest.
    • You might also have to consider other laws and regulations, for instance regarding biosecurity of data affecting pathogens or other aspects of potential dual-use.
    • The technical protocol you choose for your data transfer should meet your requirement for data security resulting these implications. You can interact with the IT departments at both locations in order to establish your strategy.
  • If you have the technical skills and knowledge, consider using appropriate File Transfer Protocols.

  • Consider using Cloud Storage Services (see Data storage page), that provide data sharing solutions, or specialised data transfer services available in your institute or country.

  • Consider pros and cons of transferring data by shipping hard disks through carrier services (time, costs, security). This is not a recommended method, unless good internet connection is not available.

  • During a transfer, some data might become corrupted. Thus, it is important to check if the files you transferred have conserved their integrity. This can be done with hash algorithms. A checksum file is calculated for each file before transfer and compared to a checksum calculated on the transferred files. If the checksums are the same, then the files are not corrupted.

  • Since data transfer involves so many technical aspects, it is a good idea to interact with your technical/IT team in order to avoid any problem if you want to transfer a large amount of data.

Solutions

Preferable transfer channel depends on the volume of your data and number of files. However, there are several general approaches to help you with the task.

  • Try to optimise and ease your data transfer by archiving your data in a single file. This can be done with two tools available on most systems.
    • tar (tape archive) will create an archive, a single file containing several files or directories.
    • gzip: since tar does not compress the archive created, a compression tool such as gzip is often used to reduce the size of the archive.
  • Ask the IT team of your institution or organisation about available services for data transfer. Usually, for small data volume or limited number of files universities and professional organisations can provide:
    • Secure server- or cloud-based applications where you should store work-related data files, synchronize files from different computers and share files by sending a link for access or download. This solution is ideal in case of a small number of files, since files need to be downloaded one by one and this can be inconvenient. Examples of these kinds of applications are NextCloud, Box, ownCloud (see Data storage page).
    • Access to Office 365 (Software as a Service, or SaaS) that includes cloud storage on Microsoft OneDrive, and SharePoint for collaborations and files sharing - you can “transfer” your data with these services by generating and sending a link for access or download of specific files.
    • Cloud synchronization and sharing services (CS3) for that can be used in science, education and research have been implemented by companies (e.g. SeaFile), institutions such as CERN (e.g. Reva, Rucio) and initiatives (e.g. ScienceMesh).
  • Usually, universities and institutions strongly discourage the use of personal accounts on Google Drive, Amazon Drive, Dropbox and similar, to share and transfer work related data, and especially sensitive or personal data. Moreover, it is not allowed to store human data in clouds which are not hosted in the EU.

  • Institutions and professional organisations could also make use of Infrastructure as a Service (IaaS), such as Microsoft Azure, Amazon Web Services (Amazon Simple Storage Service or S3), Oracle Cloud Infrastructure or Google Cloud Platform.

  • A useful comparison of cloud-computing software and providers is on Wikipedia. Cloud-computing infrastructures, services and platforms offer a variety of file hosting services; a comparison of file hosting services is available on Wikipedia.

  • If you are considering transferring data from or to cloud-based services (Microsoft Azure or Amazon S3) by shipping hard disks through carrier services, it is useful to know that services such as Amazon Snowball and Azure Data Box Disk will help you with the shipping of hard disks or appliances through carrier services.

  • Countries could provide national file sender services (browser based or other) which could be useful for one time transfer of data files, limited in number and volume (for instance, up to 100 GB or 250 GB), from person to person. Importantly, an academic account is usually needed to use these kinds of services, therefore contact the IT team in your institute for more information.

  • If you have the technical skills and the knowledge, you can use the most common data transfer protocols. These protocols are useful for data volume larger than 50GB or for hundreds of data files.
    • Applications suitable for small to mid size data available on any operating system and that can be used either through command-line (directly or with tools like cURL) or through a graphical interface, are:
      • FTP (File Transfer Protocol) will transfer files between a client and an FTP server, which will require an account in order to transfer the files.
      • Be sure to use a secure version of this protocol, such as FTPS or SFTP (SSH File Transfer Protocol). A possible tool with graphical interface is FileZilla.
      • HTTP (HyperText Transfer Protocol).
      • Rsync (remote synchronization) can be used to transfer files between two computers and to keep the files synchronized between these two computers.
      • SCP (secure copy protocol) will securely transfer files between a client and a server. It will require an account on the server and can use SSH key based authentication. A possible tool with graphical interface is WinSCP.
    • For massive amounts of data, additional protocols have been developed, parallelizing the flow of data. These transfer solutions require specific tools and as such they are available mostly on large computational centres.
  • Several algorithms can be used for checksum calculation.
    • MD5 checksums can be generated and verified in command line of all operational systems or throught tools with a graphical interface, e.g. MD5Summer for Windows.
    • SHA-2 set is more secured but slower than MD5. SHA checksums can also be generated and verified in command line of all operational systems.

More information

FAIR Cookbook is an online, open and live resource for the Life Sciences with recipes that help you to make and keep data Findable, Accessible, Interoperable and Reusable; in one word FAIR.

With Data Stewardship Wizard (DSW), you can create, plan, collaborate, and bring your data management plans to life with a tool trusted by thousands of people worldwide — from data management pioneers, to international research institutes.

Skip tool table
Tool or resource Description Related pages Registry
Amazon Web Services Amazon Web Services Training
Box Cloud storage and file sharing service Data storage Training
cURL Command line tool and library for transferring data with URLs Training
Dropbox Cloud storage and file sharing service Documentation and meta... Data storage
FileZilla A free FTP (FTPS and SFTP) solution with graphical interface Training
Globus High-performance data transfers between systems within and across organizations
Google Drive Cloud Storage for Work and Home
IBM Aspera With fast file transfer and streaming solutions built on the award-winning IBM FASP protocol, IBM Aspera software moves data of any size across any distance
Microsoft Azure Cloud storage and file sharing service from Microsoft
Microsoft OneDrive Cloud storage and file sharing service from Microsoft Data storage
ownCloud Cloud storage and file sharing service Data storage
Reva Reva connects cloud storages and application providers Tool info Training
Rucio Rucio - Scientific Data Management Data storage
ScienceMesh ScienceMesh - frictionless scientific collaboration and access to research services Data storage
SeaFile SeaFile File Synchronization and Share Solution Data storage
WinSCP WinSCP is a popular SFTP client and FTP client for Microsoft Windows! Copy file between a local computer and remote servers using FTP, FTPS, SCP, SFTP, WebDAV or S3 file transfer protocols.
Skip national tools table

Tools and resources tailored to users in different countries.

Tool or resource Description Related pages Registry
Belnet

Belnet is the privileged partner of higher education, research and administration for connectivity. We provide high-bandwidth internet access and related services for our specific target groups.

Data Steward Researcher Research Software Engi...
NIRD

The National Infrastructure for Research Data (NIRD) infrastructure offers storage services, archiving services, and processing capacity for computing on the stored data. It offers services and capacities to any scientific discipline that requires access to advanced, large-scale, or high-end resources for storing, processing, publishing research data or searching digital databases and collections. This service is owned and operated by Sigma2 NRIS, which is a joint collaboration between UiO, UiB, NTNU, UiT, and UNINETT Sigma2.

Data storage NeLS FAIRtracks
Contributors