*********************** Provenance Data Capture *********************** .. toctree:: :maxdepth: 2 :caption: Contents: The process of understanding, optimizing, and reproducing complex Edge-to-Cloud workflows may be assisted by **provenance data capture**. **Provenance data** refer to a record trail that accounts for **the origin of a piece of data** together with descriptions of the computational processes that assist in explaining **how and why it was generated**. **Capturing provenance data during workflow execution** helps users in tracking inputs, outputs, and processing history, allowing them to steer workflows precisely. ProvLight ========= ProvLight is an `open-source tool `_ that allows researchers to efficiently capture provenance data of workflows running on IoT/Edge infrastructures. ProvLight presents **low capture overhead** in terms of: capture time; CPU and memory usage; network usage; and power consumption. Architecture ------------ The architecture of ProvLight is presented in :ref:`provlight_archi`. It follows a **client/server model**: .. _provlight_archi: .. figure:: ../figures/ProvLight-architecture.png :width: 60% :align: center Figure 1: ProvLight architecture - **Server:** the ProvLight server is composed of a `broker `_ and a `provenance data translator `_ (`MQTT-SN client lib `_). - **(i) Broker:** refers to an MQTT-SN broker (MQTT for Sensor Networks). During workflow execution, clients subscribe to the broker and then start to transmit the captured data. Next, this data is forwarded to the provenance data translator, which is subscribed to the broker. - **(ii) Provenance Data Translator:** translates the captured data to the respective format used by the provenance system. The provenance data translator may be extended, by users, to translate to a particular data model of a provenance system (compatible with **W3C PROV-DM**). After translating, it sends the data to the provenance system service to store the captured data. - **Client:** the ProvLight client aims to efficiently capture provenance data on resource-limited devices. `ProvLight provides a client library `_ that follows the **W3C PROV-DM** provenance model. This library allows users to **instrument their workflow code** to decide what data to capture. A client is configured to transmit, at runtime, the captured data to the **remote broker**. This allows users to **track workflow execution at runtime** (e.g., started and finished tasks, input and output data, etc.) through provenance systems supporting data ingestion at runtime. Data exchange model ------------------- :ref:`provlight_provdm`. The goal is to have a data exchange specification (domain-agnostic PROV modeling) for capturing data in the IoT/Edge and making sure these captured data are compatible with W3C PROV-based workflow provenance systems, such as ProvLake, DfAnalyzer, PROV-IO, among many others. Figure 2 presents ProvLight classes (right side) and their relationships and maps them to PROV-DM core elements. The main classes of our model are **Workflow**, **Task**, and **Data**. These classes are derived from the **Agent**, **Activity**, and **Entity** PROV-DM types, respectively. ProvLight classes aim to provide a simplified abstraction allowing users to **track workflow** (Workflow class), **input and output parameters** (Data class), and **processing history** (Task class). .. _provlight_provdm: .. figure:: ../figures/ProvLight-prov-dm.png :width: 100% :align: center Figure 2: ProvLight provenance data exchange model follows the W3C PROV-DM recommendation - The Workflow class may be used to refer to the application workflow (e.g., Federated Learning training). - The Task class refers to the tasks executed in the workflow (e.g., each epoch or model update of the model training). - The Data class represents the input data attributes and values (e.g., hyperparameters of the learning algorithm) or the output attributes (e.g., training time and accuracy). These classes are implemented in the `ProvLight Python capture library `_. Capture library --------------- The design choices of the ProvLight client library provides a series of features targeting resource-limited IoT/Edge devices: - **Simplified data models:** simplified classes for provenance modeling that allow users to represent workflows, data derivations (e.g., input/output data from tasks) and tasks (e.g., status, dependencies, data derivations) - **Data compression & grouping:** compresses the bytes in captured data before transmitting over the network; and allow users to optionally group data just from ended tasks, so users may still track at workflow runtime the tasks that have already started. - **Lightweight transmission protocol:** such as MQTT for Sensor Networks. - **Asynchronous communication model:** MQTT-SN QoS level 2 (exactly once) The provenance manager (ProvLight into E2Clab) ============================================== The integration of ProvLight into E2Clab allows users to **capture end-to-end provenance data** of Edge-to-Cloud workflows. :ref:`provlight_e2clab` shows the extended E2Clab architecture with the new components in the red color (the Provenance Manager). .. _provlight_e2clab: .. figure:: ../figures/E2Clab-provenance.png :width: 100% :align: center Figure 3: ProvLight into E2Clab The Provenance Manager is composed of: - **ProvLight:** to efficiently capture provenance data of workflows running on IoT devices. It also allows users to capture provenance in Cloud/HPC environments. ProvLight translates the captured data to the DfAnalyzer data model. - **DfAnalyzer:** to ``store`` and ``query`` provenance captured by ProvLight during workflow runtime (e.g., compare provenance of multiple workflow evaluations to understand how they impact on performance). Furthermore, it allows users to ``visualize`` dataflow specifications (i.e., data attributes of each dataset). `DfAnalyzer `_ is available as open-source software. How to capture provenance data? ------------------------------- To enable provenance data capture in E2Clab, users must **(1) define the dataflow specification**, **(2) instrument their application code**, and configure the **(3) layers_services.yaml** file. - **(1) Defining the dataflow specification:** next we illustrate how users can define their dataflow specification using **DfAnalyzer** as a provenance system to store the data. In this example the **Dataflow** refers to the ``model_training`` the **Transformation** refers to the ``training`` of a model. For each ``training`` we have the input and output data: - input: **Set** refers to the ``training_input`` such as the hyperparameters (``kernel_size``, ``num_kernels``, ``length_of_strides``, and ``pooling_size``). - output: **Set** refers to the ``training_output`` such as ``accuracy`` and ``training_time``. .. literalinclude:: ../examples/provenance_capture/my-dataflow-specification.py :language: python :linenos: - **(2) Instrumenting the application code:** based on the dataflow specification, next we show how users can instrument their application code using the **Workflow**, **Task**, and **Data** classes. Note that users can easily instrument their code to decide what to capture. In this example the user wants to capture the **model hyperparameters** and the respective **model performance** (e.g., accuracy and training time). .. literalinclude:: ../examples/provenance_capture/user-application.py :language: python :linenos: **(3) Configuring layers_services.yaml**: next, we show how to enable provenance data capture in E2Clab. .. literalinclude:: ../examples/provenance_capture/provenance.yaml :language: yaml :linenos: .. note:: E2Clab will look for the **dataflow specification** file in the ``artifacts_dir`` you used in the ``E2Clab command line``. Try some examples ================= We provide a `toy example here <../examples/provenance_capture.html>`_.