***********************
Provenance Data Capture
***********************


.. toctree::
    :maxdepth: 2
    :caption: Contents:


The process of understanding, optimizing, and reproducing complex Edge-to-Cloud workflows
may be assisted by **provenance data capture**. **Provenance data** refer to a record
trail that accounts for **the origin of a piece of data** together with descriptions of
the computational processes that assist in explaining **how and why it was generated**.
**Capturing provenance data during workflow execution** helps users in tracking inputs,
outputs, and processing history, allowing them to steer workflows precisely.


ProvLight
=========

ProvLight is an `open-source tool <https://gitlab.inria.fr/provlight/provlight>`_ that
allows researchers to efficiently capture provenance data of workflows running on IoT/Edge
infrastructures. ProvLight presents **low capture overhead** in terms of: capture time;
CPU and memory usage; network usage; and power consumption.


Architecture
------------

The architecture of ProvLight is presented in :ref:`provlight_archi`. It follows a
**client/server model**:

.. _provlight_archi:
.. figure:: ../figures/ProvLight-architecture.png
    :width: 60%
    :align: center

    Figure 1: ProvLight architecture


- **Server:** the ProvLight server is composed of a `broker <https://github.com/eclipse/mosquitto.rsmb.git>`_
  and a `provenance data translator <https://gitlab.inria.fr/E2Clab/e2clab/-/blob/provenance/e2clab/services/provenance/translator.py>`_
  (`MQTT-SN client lib <https://github.com/luanguimaraesla/mqttsn.git>`_).

  - **(i) Broker:** refers to an MQTT-SN broker (MQTT for Sensor Networks). During workflow
    execution, clients subscribe to the broker and then start to transmit the captured data.
    Next, this data is forwarded to the provenance data translator, which is subscribed to
    the broker.

  - **(ii) Provenance Data Translator:** translates the captured data to the respective
    format used by the provenance system. The provenance data translator may be extended,
    by users, to translate to a particular data model of a provenance system (compatible
    with **W3C PROV-DM**). After translating, it sends the data to the provenance system
    service to store the captured data.

- **Client:** the ProvLight client aims to efficiently capture provenance data on
  resource-limited devices. `ProvLight provides a client library <https://gitlab.inria.fr/provlight/provlight>`_
  that follows the **W3C PROV-DM** provenance model. This library allows users to
  **instrument their workflow code** to decide what data to capture. A client is
  configured to transmit, at runtime, the captured data to the **remote broker**. This
  allows users to **track workflow execution at runtime** (e.g., started and finished
  tasks, input and output data, etc.) through provenance systems supporting data ingestion
  at runtime.


Data exchange model
-------------------

:ref:`provlight_provdm`. The goal is to have a data exchange specification (domain-agnostic PROV
modeling) for capturing data in the IoT/Edge and making sure these captured data are
compatible with W3C PROV-based workflow provenance systems, such as ProvLake, DfAnalyzer,
PROV-IO, among many others.

Figure 2 presents ProvLight classes (right side) and their relationships and maps them to
PROV-DM core elements. The main classes of our model are **Workflow**, **Task**, and
**Data**. These classes are derived from the **Agent**, **Activity**, and **Entity**
PROV-DM types, respectively. ProvLight classes aim to provide a simplified abstraction
allowing users to **track workflow** (Workflow class), **input and output parameters**
(Data class), and **processing history** (Task class).


.. _provlight_provdm:
.. figure:: ../figures/ProvLight-prov-dm.png
    :width: 100%
    :align: center

    Figure 2: ProvLight provenance data exchange model follows the W3C PROV-DM recommendation


- The Workflow class may be used to refer to the application workflow
  (e.g., Federated Learning training).
- The Task class refers to the tasks executed in the workflow (e.g., each epoch or model
  update of the model training).
- The Data class represents the input data attributes and values (e.g., hyperparameters
  of the learning algorithm) or the output attributes (e.g., training time and accuracy).


These classes are implemented in the `ProvLight Python capture library <https://gitlab.inria.fr/provlight/provlight>`_.


Capture library
---------------

The design choices of the ProvLight client library provides a series of features targeting
resource-limited IoT/Edge devices:

- **Simplified data models:** simplified classes for provenance modeling that allow users
  to represent workflows, data derivations (e.g., input/output data from tasks) and tasks
  (e.g., status, dependencies, data derivations)
- **Data compression & grouping:** compresses the bytes in captured data before transmitting
  over the network; and allow users to optionally group data just from ended tasks, so
  users may still track at workflow runtime the tasks that have already started.
- **Lightweight transmission protocol:** such as MQTT for Sensor Networks.
- **Asynchronous communication model:** MQTT-SN QoS level 2 (exactly once)


The provenance manager (ProvLight into E2Clab)
==============================================

The integration of ProvLight into E2Clab allows users to **capture end-to-end provenance data**
of Edge-to-Cloud workflows. :ref:`provlight_e2clab` shows the extended E2Clab architecture
with the new components in the red color (the Provenance Manager).


.. _provlight_e2clab:
.. figure:: ../figures/E2Clab-provenance.png
    :width: 100%
    :align: center

    Figure 3: ProvLight into E2Clab


The Provenance Manager is composed of:

- **ProvLight:** to efficiently capture provenance data of workflows running on IoT devices.
  It also allows users to capture provenance in Cloud/HPC environments. ProvLight translates
  the captured data to the DfAnalyzer data model.
- **DfAnalyzer:** to ``store`` and ``query`` provenance captured by ProvLight during workflow
  runtime (e.g., compare provenance of multiple workflow evaluations to understand how
  they impact on performance). Furthermore, it allows users to ``visualize`` dataflow
  specifications (i.e., data attributes of each dataset).
  `DfAnalyzer <https://github.com/ElsevierSoftwareX/SOFTX_2019_102#docker-image>`_ is
  available as open-source software.


How to capture provenance data?
-------------------------------

To enable provenance data capture in E2Clab, users must **(1) define the dataflow specification**,
**(2) instrument their application code**, and configure the **(3) layers_services.yaml** file.


- **(1) Defining the dataflow specification:** next we illustrate how users can define their
  dataflow specification using **DfAnalyzer** as a provenance system to store the data.
  In this example the **Dataflow** refers to the ``model_training`` the **Transformation**
  refers to the ``training`` of a model. For each ``training`` we have the input and output
  data:

  - input: **Set** refers to the ``training_input`` such as the hyperparameters
    (``kernel_size``, ``num_kernels``, ``length_of_strides``, and ``pooling_size``).

  - output: **Set** refers to the ``training_output`` such as ``accuracy`` and
    ``training_time``.


.. literalinclude:: ../examples/provenance_capture/my-dataflow-specification.py
   :language: python
   :linenos:


- **(2) Instrumenting the application code:** based on the dataflow specification,
  next we show how users can instrument their application code using the **Workflow**,
  **Task**, and **Data** classes. Note that users can easily instrument their code to
  decide what to capture. In this example the user wants to capture the **model hyperparameters**
  and the respective **model performance** (e.g., accuracy and training time).


.. literalinclude:: ../examples/provenance_capture/user-application.py
   :language: python
   :linenos:


**(3) Configuring layers_services.yaml**: next, we show how to enable provenance data
capture in E2Clab.

.. literalinclude:: ../examples/provenance_capture/provenance.yaml
   :language: yaml
   :linenos:


.. note::

    E2Clab will look for the **dataflow specification** file in the ``artifacts_dir`` you
    used in the ``E2Clab command line``.


Try some examples
=================

We provide a `toy example here <../examples/provenance_capture.html>`_.