Your tasks: Data provenance

How to document and track your data provenance?

Description

Provenance is the documentation of why and how the data (but also datasets, computational analysis and other research output) was produced, where, when and by whom. Data provenance is often used interchangeably with the term “data lineage”, although their definition might slightly differs in some contexts. Data provenance/lineage means tracing the movements and the changes of the data that occurred between their origin and their destination system.

Well-documented data provenance is essential for assessing authenticity, credibility, trustworthiness, quality (it helps finding errors) and reusability of data, as well as the reproducibility of the results.

However, knowing what’s the best way to document provenance can be challenging due to the large amount and variety of the information that need to be recorded.

Considerations

Provence is part of documentation and metadata.
Many aspects of data documentation and metadata are related to provenance information, such as history log, versioning, licence, citation, identifiers, etc. Moreover, data provenance is related to several other aspects of data management, namely data access rights, governance, privacy and security.
Provence information can be recorded:
- as free text and unstructured information (mainly readable for humans, not for machines/software), describing data collection and processing method.
- according to metadata schemas or standards, that can be generic (e.g. Dublin Core) or discipline specific such as ISO19115-2.
- according to Provenance Data Model (PROV-DM: The PROV Data Model) and ontology (PROV-O).
As for documentation and metadata, the medium to capture provenance information can also varies. Provenance trails can be captured
- in text files or spreadsheets
- in registries or databases
- in dedicated software/platforms (such as LIMS)
- internally and automatically by software tools during their processing activity (such as workflow management systems)
As for documentation and metadata, provenance information can be recorded and displayed/visualised in machine-readable (see Machine actionability page) and/or human-readable form.

Solutions

Record provenance according to schemas or defined profiles. These can be generic or domain-specific, and can be found in RDA Standards or FAIRsharing. Use metadata schemas containing provenance information in your README file and in any kind of data documentation and metadata file. Best practices for documentation and metadata, and data organisation should be applied for provenance file as well.
Implement serialisation specification of the PROV-MODEL in your data management tools to record provenance in machine-actionable format (RDF, Linked data, owl, xml, etc.).
Use RO-Crate specifications and/or specific profiles for provenance (e.g., RO-Crate profiles to capture the provenance of workflow runs).
Make use of tools and software that help you record provenance in a manual or an automated way. Use:
- Electronic Data Capture (EDC) systems, Laboratory Information Management Systems (LIMS) or similar tools.
- Workflow management systems (such as Kepler, Galaxy, Taverna, VisTrails); provenance information embedded in such software or tools are usually available to users of the same tool or can be exported as separated file in several formats, such as Research Object Crate (RO-Crate).
- Registries such as WorkflowHub.

More information

Tools and resources on this page

Tool or resource	Description	Related pages	Registry
FAIRsharing	A curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies.	FAIRtracks Health data Microbial biotechnology Plant sciences Data publication Existing data Machine actionability Documentation and meta...	Standards/Databases Training
Galaxy	Open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses.	Marine Metagenomics Single-cell sequencing Data analysis Data storage	Tool info Training
PROV-DM: The PROV Data Model	PROV-DM is the conceptual data model that forms a basis for the W3C provenance (PROV) family of specifications.
RDA Standards	Directory of standard metadata, divided into different research areas	Documentation and meta...
Research Object Crate (RO-Crate)	RO-Crate is a lightweight approach to packaging research data with their metadata, using schema.org. An RO-Crate is a structured archive of all the items that contributed to the research outcome, including their identifiers, provenance, relations and annotations.	Galaxy Microbial biotechnology	Standards/Databases
WorkflowHub	WorkflowHub is a registry for describing, sharing and publishing scientific computational workflows.	Galaxy Data analysis	Tool info Standards/Databases Training

National resources

Tools and resources tailored to users in different countries.

Tool or resource	Description	Related pages	Registry
eLab BioData.pt	An electronic lab notebook (ELN) for the BioData.pt community.	Researcher Data Steward Principal Investigator... Documentation and meta... Data quality Project data managemen... Machine actionability

Contributors