Taverna has now moved to the Apache Software Foundation. For updated information, see Apache Taverna (incubating).

Provenance management

Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness.

W3C PROV-Overview

For a scientific workflow system, provenance can have several aspects:

  1. Provenance of the workflow definition
  2. Provenance of a workflow run
  3. Provenance of data

Provenance of workflow definitions

Taverna does not capture provenance of editing a workflow definition, but assume the scientist manages the evolution of workflow definitions through existing means for versioning files, such as filenames and folders, version control systems like git, or workflow sharing websites like myExperiment.

Within Taverna, a workflow can be annotated to give attribution to the Authors of a workflow (or nested workflow). We recommend using comma or linefeed for multiple authors.

Taverna’s workflow fileformat has an internal workflow identifier (UUID) which is updated for every workflow change. A log of previous workflow identifiers is included within the workflow definition formats t2flow and Taverna 3 workflow bundle, allowing detection of workflows with common ancestry.

Provenance of workflow runs

Taverna can capture provenance of workflow runs, including individual processor iterations and their inputs and outputs. This provenance is kept in an internal database, which is used to populate Previous runs and Intermediate results in the Results perspective in the Taverna Workbench.

The provenance trace can be used by the Taverna-PROV plugin to export the workflow run, including the output and intermediate values, and the provenance trace as a PROV-O RDF graph which can be queried using SPARQL and processed with other PROV tools, such as the PROV Toolbox.

We are planning to extend myExperiment to handle uploading of such provenance traces, which would give a mechanism to present and browse values and details of a workflow runs within the browser.

This presentation about Taverna’s provenance support gives an overview of the model and software architecture.

Provenance of data

Scientists using Taverna to perform analysis are often less concerned about the detailed provenance of a workflow run, which semantically just describes inputs and outputs to a chain of processes, but are rather interested in derivation and attribution of the data that is involved in a workflow. For instance, a workflow might be performing text-mining on a biomedical article to extract gene names, and then retrieve the genome sequences for those genes by looking up in a database. The sequences can then be said to be derived from that database and should (according to the license of the web service) also be attributed to its maintainers. The list of sequences can be said to be derived from the biomedical article.

The typical world of Taverna workflows is to combine web services “in the wild” (say found on http://www.biocatalogue.org/ BioCatalogue) with local tools. Neither of these will typical have any facility to provide such “science-level provenance”. myGrid is planning a facility for such data provenance in different ways:

  1. Merging and propagation of PROV-AQ provided provenance traces for REST services (including matching data identity) — “white-box service”
  2. A provenance “backchannel” for Components, which can be populated either by the underlying service directly or by shims within the component. This allows higher level provenance that is meaningful for a set of components instead of service-specific executiond etails.
  3. Annotation of workflow fragments by common motifs, which can provide higher-level provenance for data generated by the workflow

The paper Enhancing and Abstracting Scientific Workflow Provenance for Data Publishing (doi 10.1145/2457317.2457370) details these approaches.

Collaborations

myGrid actively participated in the W3C Provenance Working Group which developed the PROV family of standards. The Taverna-PROV plugin has been developed for Taverna that allows the export of workflow run provenance as PROV-O RDF.

The wf4ever project is investigating the sharing of workflows and workflow runs as research objects, in particular for Taverna is the development of the Research Object Bundle, which will form a single archive of a workflow run, including run provenance, inputs, outputs, intermediate values, workflow definition and (for Taverna 3) information about the run environment.

Past collaborations

Since early 2010, we are invited partners of the NSF DataONE project, dedicated to large-scale preservation of scientific data, and founding members of the Worklow and Provenance Working Group promoted by the project, along with Prof. Ludaescher at UC Davis, USA  and Juliana Freire at University of Utah, USA.

Historically, work on provenance within the myGrid consortium and Taverna team has been focusing on multiple aspects, beginning with the design and implementation of Janus, a data model and software component for provenance capture and analysis for Taverna. Our research in this area is often pursued in collaboration with external partners:

  • A model and architecture for capturing provenance. We have designed a data model for Janus that is at the same time specific to Taverna, but can also be exported to other models, notably the Open Provenance Model (OPM), to enable interoperability with third party provenance-generating systems. Taverna has been retrofitted with provenance generation capabilities.
  • An expressive provenance query language and efficient query processing model for large provenance graphs.
  • Investigation into provenance interoperability and exchange, using the OPM. The Taverna provenance component now exports data as OPM graphs, and can also import OPM graphs (with basic features) received from third parties.  We have also been working with the Kepler group on a project to promote provenance interoperability, in collaboration with Prof. Ludaescher at UC Davis, CA, and Ilkay Antintas at UCSD, CA .
  • Investigation into the role of semantics and of Linked Open Data (LOD) in provenance modelling and management,  in collaboration with the Knoesis Centre at Wright University, Ohio (Prof. Amit Sheth, Dr. Satya Sahoo) and with Jun Zhao of Oxford University.

Other past collaborations on the topic of provenance include: