Provenance management

A growing community of scientists from many disciplines, and notably in the life sciences, has been  reaping the benefits of e-science IT infrastructures for the design and automatic execution of in silico experiments. Workflow management systems, such as Taverna, have been at the forefront of this technology offering. It is now becoming increasing clear to e-scientists that, in addition to producing interesting results from their experiments, the computing infrastructure should also be able to support additional investigation into the nature of those results. In bioinformatics, for example, scientists may run a workflow to correlate genes with other types of biological objects, for instance metabolic pathways. In this case, the system should help the scientist understand why a particular pathway appears in the output of the experiment, and to which of the input genes it is associated.

The role of provenance collection and analysis is to help answering this type of questions, as well as others that have to do with establishing how the experiment results were obtained. At a technical level, this type of analysis involves: (a) collecting and persistently storing as much detailed information about a workflow run as possible, and (b) querying such trace information to answer the scientist’s questions regarding the provenance of their data.

A more detailed documentation about provenance management in Taverna is available from our wiki. Technical documentation on:

is also available.

Current collaborations

Work on provenance within the myGrid consortium and Taverna team has been focusing on multiple aspects, beginning with the design and implementation of Janus, a data model and software component for provenance capture and analysis for Taverna. Our research in this area is often pursued in collaboration with external partners:

  • A model and architecture for capturing provenance. We have designed a data model for Janus that is at the same time specific to Taverna, but can also be exported to other models, notably the Open Provenance Model (OPM), to enable interoperability with third party provenance-generating systems. Taverna has been retrofitted with provenance generation capabilities.
  • An expressive provenance query language and efficient query processing model for large provenance graphs [1].
  • Investigation into provenance interoperability and exchange, using the OPM. The Taverna provenance component now exports data as OPM graphs, and can also import OPM graphs (with basic features) received from third parties.  We have also been working with the Kepler group on a project to promote provenance interoperability, in collaboration with Prof. Ludaescher at UC Davis, CA, and Ilkay Antintas at UCSD, CA  [2], [3].
  • Investigation into the role of semantics and of Linked Open Data (LOD) in provenance modelling and management,  in collaboration with the Knoesis Centre at Wright University, Ohio (Prof. Amit Sheth, Dr. Satya Sahoo) and with Jun Zhao of Oxford University [4].

We also actively participate in the W3C Provenance Incubator group. Since early 2010, we are invited partners of the NSF DataONE project, dedicated to large-scale preservation of scientific data, and founding members of the Worklow and Provenance Working Group promoted by the project, along with Prof. Ludaescher at UC Davis, USA  and Juliana Freire at University of Utah, USA.

Past collaborations

Past collaborations on the topic of provenance include:

References

References to published material are available on-line.