Workflow preservation

The Wf4Ever project developed models and infrastructure for the purpose of preservation of scientific workflows and their supporting material. In particular, Wf4Ever seeks to promote workflow sustainability by collecting workflow metadata (such as author, title, purpose of workflow, description of steps, references) and provenance from example and complete workflow runs, and to proactively monitor workflows for decay, for examples by detecting resources, such as web services, that have been interrupted.

Central to Wf4Ever is the Research Object model, a suite of ontologies that extend existing models for enabling aggregation and annotation of research objects and their constituent resources. A research object is a collection of resources that together form a unit of research; for instance a workflow definition (like a t2flow file), input data sets, workflow runs, provenance, intermediate data, output data. In addition a research object can cover the scientific process around a workflow, so it can include a hypothesis, results, citations, articles, presentations and other any other resource.

In practice, a research object (RO) can be viewed as an extension of myExperiment packs. Wf4Ever is exploring how to present and collect the additional metadata that covers a research object, in addition to integrating Wf4Ever infrastructure like analytics and quality reports, stability checklists, pack/workflow recommendations and tracking evolution of the RO and its components. One of the most interesting additions for Taverna users will be the ability to upload a workflow run to browse and share the run data (including intermediate values) with other users. A mockup of how this would be presented is available.

Wf4Ever supports Taverna workflows, but also workflows of other systems, such as Wings, for forming use cases and testing the models and infrastructure. For Taverna, Wf4Ever has developed a plugin for exporting W3C PROV-O compliant workflow run provenance as RDF, called taverna-prov. This provenance is based on two extensions of PROV-O, the Wf4Ever wfprov model and a Taverna-specific model TavernaProv. This plugin has been used to generate a publicly available Provenance Corpus which contains traces of workflow runs, (including inputs, outputs and intermediate values) of 134 Taverna workflow together with 70 WINGS workflows. The Taverna workflows were harvested from the >1500 publically available Taverna 1 and Taverna 2 workflows shared on the myExperiment website, carefully selected to cover different domains, years and authors. The corpus includes traces of KEGG workflows using the now decommissioned KEGG SOAP API, preserving last-known-working snapshots in time for observing real-life service decay, and hence workflow decay.

Ongoing research in addition to the above covers a classification of workflow patterns into motifs to classify the task performed by a series of step in a workflow, for instance “data cleaning”, “retrieval”, “visualization”, “interaction”. This is culminating into forming a motif ontology, which can be used together with ongoing myGrid and BioVel work to support Taverna workflow components.

Wf4Ever software and models are all open source and available on GitHub. The Wf4Ever infrastructure communicates using a set of specified REST service APIs, that are deployed for testing purposes in a sandbox environment which will be publicly available as a virtual machine image. Development is driven by the specific requirements of users in bioinformatics, astronomy and biodiversity. It is also attracting interest from wider communities such as digital libraries, journal publishers and health economics.

For more information about the wf4ever project and workflow preservation, please contact support@mygrid.org.uk

Workflow preservation

Developers