Provenance Capture: Edge-to-Cloud workflow

In this tutorial, we show how to capture provenance data of a toy application (e.g., AI model training) executed on the Edge-to-Cloud continuum (G5K and FIT IoT LAB testbeds) . The goal is to show how provenance data capture can help users to answer research questions like:

What are the model hyperparameters that obtained an accuracy value above 90%?

In this example you will learn how to:

Enable provenance data capture in E2Clab (Edge-to-Cloud: G5K and FIT IoT LAB testbeds)
Create a dataflow specification
Instrument the application code to decide what to capture
Query the database to answer the research questions

Experiment Artifacts

$ cd ~/git/
$ git clone https://gitlab.inria.fr/E2Clab/examples/provenance-tutorial

In this repository you will find:

the E2Clab configuration files such as layers_services.yaml, network.yaml, and workflow.yaml.
the my-dataflow-specification.py and the toy application user-application.py.

Defining the Experimental Environment

Layers & Services Configuration

This configuration file presents the layers and services that compose this example. The Master (one machine quantity: 1 in Grid’5000 environment: g5k). The Worker (one A8-M3 device quantity: 1 in FIT IoT LAB environment: iotlab).

To enable the E2Clab provenance service, we add the provenance: attribute with the following configuration:

provider: g5k to deploy in a G5K machine in the gros cluster cluster: gros.
dataflow_spec: my-dataflow-specification.py to define the attributes and value types of the dataflow to create the provenance database tables. This file must be in the artifacts_dir directory you defined in the E2Clab command line.
ipv: 6 to allow FIT IoT LAB device to use its IPv6 network to send data
parallelism: 2 to parallelize the provenance data translator (translates from ProvLight to DfAnalyzer data format) and the broker topic.

Finally, we add roles: ['provenance'] in the Master and Worker services to enable data capture on them (e.g., install ProvLight capture library and set up environment variables to enable the connection with the provenance service).

---
environment:
  job_name: provenance-tutorial
  walltime: "00:59:00"
  g5k:
    cluster: gros
    job_type: ["deploy"]
    env_name: "debian11-x64-big"
    firewall_rules:
      - services: ["provenance_service"]
        ports: [1883]
  iotlab:
    cluster: grenoble
provenance:
  provider: g5k
  cluster: gros
  dataflow_spec: my-dataflow-specification.py
  ipv: 6
  parallelism: 2
layers:
- name: cloud
  services:
  - name: Master
    environment: g5k
    cluster: gros
    quantity: 1
    roles: ['provenance'] 
- name: edge
  services:
  - name: Worker
    environment: iotlab
    cluster: grenoble
    archi: a8:at86rf231
    quantity: 1
    roles: ['provenance'] 

Note

We create a firewall rule on Grid’5000 to allow the Worker (FIT IoT LAB device) to send the captured data to the E2Clab provenance service deployed on G5K on port 1883 (MQTT protocol).

The toy application

To emulate the model training with various hyperparameters we implement the application below. The model input are the hyperparameters and the training output is the model performance such as the accuracy and the training time.

import time, random

def model_training(training_time):
    time.sleep(training_time)
    accuracy = round(random.uniform(0, 1), 2)
    return training_time, accuracy


if __name__ == "__main__":
    dataflow_id = "model_training"
    transformation_id = "training"
    training_input = "training_input"
    training_output = "training_output"

    # training 10x with different hyperparameters
    for training_id in range(1, 10):
        # model hyperparameters
        kernel_size = random.randint(1, 10)
        num_kernels = random.randint(8, 16)
        length_of_strides = random.randint(1, 5)
        pooling_size = random.randint(8, 16)
        # training input: model hyperparameters
        model_hyperparameters = {'model_hyperparameters': [
            kernel_size,
            num_kernels,
            length_of_strides,
            pooling_size,
        ]}
        # START training... time to train the model with hyperparameter set
        _training_time, _accuracy = model_training(training_time=random.randint(1, 5))
        # training output: model performance
        print(f"hyperparameters = {model_hyperparameters['model_hyperparameters']} | "
              f"accuracy = {_accuracy} / training_time = {_training_time}")

Network Configuration

The file below presents the network configuration between the cloud and edge infrastructures delay: 28ms, loss: 0.1%, rate: 1gbit.

networks:
- src: cloud
  dst: edge
  delay: "28ms"
  rate: "1gbit"
  loss: 0.1

Workflow Configuration

This configuration file presents the application workflow configuration.

The Master cloud.* and the Worker edge.*:

prepare copies from the local machine to the remote machine the application.

launch executes the application.

- hosts: cloud.*
  prepare:
    - copy:
        src: "{{ working_dir }}/user-application.py"
        dest: "/tmp/user-application.py"
  launch:
    - shell: python /tmp/user-application.py
      async: 120
      poll: 0
- hosts: edge.*
  prepare:
    - copy:
        src: "{{ working_dir }}/user-application.py"
        dest: "/tmp/user-application.py"
  launch:
    - shell: source ~/.bashrc && python /tmp/user-application.py

User-Defined Provenance Data Capture

Next, we show how we used the ProvLight client library to instrument the application code to capture the model hyperparameters and the model performance results. The Workflow, Task, and Data classes are used to capture data.

import time, random, os
from provlight.workflow import Workflow
from provlight.task import Task
from provlight.data import Data

client_id = os.environ.get('PROVLIGHT_SERVER_TOPIC', "")

def model_training(training_time):
    time.sleep(training_time)
    accuracy = round(random.uniform(0, 1), 2)
    return training_time, accuracy


if __name__ == "__main__":
    # IDs defined in the dataflow specification
    dataflow_id = "model_training"
    transformation_id = "training"
    training_input = "training_input"
    training_output = "training_output"

    wf = Workflow(dataflow_id)
    wf.begin()

    # training 10x with different hyperparameters
    for training_id in range(1, 10):
        # model hyperparameters
        kernel_size = random.randint(1, 10)
        num_kernels = random.randint(8, 16)
        length_of_strides = random.randint(1, 5)
        pooling_size = random.randint(8, 16)
        # training input: model hyperparameters
        model_hyperparameters = {'model_hyperparameters': [
            kernel_size,
            num_kernels,
            length_of_strides,
            pooling_size,
        ]}
        task = Task(int(str(client_id)+str(training_id)), wf, transformation_id, dependencies=[])
        data_in = Data(training_input, dataflow_id, [], model_hyperparameters)
        task.begin([data_in])
        # START training... time to train the model with hyperparameter set
        _training_time, _accuracy = model_training(training_time=random.randint(1, 5))
        # training output: model performance
        data_out = Data(training_output, dataflow_id, [], {'model_performance': [
            _accuracy,
            _training_time,
        ]})
        task.end([data_out])

    wf.end()

Running & Verifying Experiment Execution

Find below the commands to deploy this application and check its execution.

$ e2clab layers-services ~/git/provenance-tutorial/ ~/git/provenance-tutorial/artifacts/

The Provenance Service GUI is available at:

ssh -NL 22000:localhost:22000 gros-86.nancy.grid5000.fr

../_images/prov_gui.png — Figure 1: Provenance Service GUI (DfAnalyzer)

$ e2clab workflow ~/git/provenance-tutorial/ prepare

$ e2clab workflow ~/git/provenance-tutorial/ launch

Deployment Validation & Experiment Results

We can access the database as follows:

$ ssh root@gros-86.nancy.grid5000.fr
$ docker exec -it dfanalyzer bash
$ monetdb status
$ mclient dataflow_analyzer

With \d you can list all tables.

../_images/prov_tables.png — Figure 2: Tables in the provenance database

After model training on G5K node and FIT IoT LAB device, we can visualize the training results as presented in Figure 3: Model input (hyperparameters) and Figure 4: Model output (accuracy and training time).

../_images/prov_in.png — Figure 3: Model input (hyperparameters)

../_images/prov_out.png — Figure 4: Model output (accuracy and training time)

After multiple model evaluations, thanks to provenance data capture during model training, users can easily answer the following research question:

What are the model hyperparameters that obtained an accuracy value above 90%?

../_images/prov_rq.png — Figure 5: What are the model hyperparameters that obtained an accuracy value above 90%?

Saving the Experiment Results

$ e2clab finalize ~/git/provenance-tutorial/

The experiment results will be saved at:

$ ls ~/provenance-tutorial/20231120-102842/
$ layers_services-validate.yaml
$ provenance-data/                  # contains the 'provenance_database.sql' file.
$ workflow-validate.out