Provenance Capture: Edge-to-Cloud workflow

In this tutorial, we show how to capture provenance data of a toy application (e.g., AI model training) executed on the Edge-to-Cloud continuum (G5K and FIT IoT LAB testbeds) . The goal is to show how provenance data capture can help users to answer research questions like:

  • What are the model hyperparameters that obtained an accuracy value above 90%?

In this example you will learn how to:

  • Enable provenance data capture in E2Clab (Edge-to-Cloud: G5K and FIT IoT LAB testbeds)

  • Create a dataflow specification

  • Instrument the application code to decide what to capture

  • Query the database to answer the research questions

Experiment Artifacts

$ cd ~/git/
$ git clone https://gitlab.inria.fr/E2Clab/examples/provenance-tutorial

The structure of the experimental setup looks like this:

provenance-tutorial/ # SCENARIO_DIR
├── artifacts/
│   ├── my-dataflow-specification.py
│   └── user-application.py
├── layers_services.yaml
├── network.yaml
└── workflow.yaml

In this repository you will find:

  • the E2Clab configuration files such as layers_services.yaml, network.yaml, and workflow.yaml.

  • the my-dataflow-specification.py and the toy application user-application.py.

Defining the Experimental Environment

Layers & Services Configuration

This configuration file presents the layers and services that compose this example. The Master (one machine quantity: 1 in Grid’5000 environment: g5k). The Worker (one A8-M3 device quantity: 1 in FIT IoT LAB environment: iotlab).

To enable the E2Clab provenance service, we add the provenance: attribute with the following configuration:

  • provider: g5k to deploy in a G5K machine in the gros cluster cluster: gros.

  • dataflow_spec: my-dataflow-specification.py to define the attributes and value types of the dataflow to create the provenance database tables. This file must be in the artifacts_dir directory you defined in the E2Clab command line.

  • ipv: 6 to allow FIT IoT LAB device to use its IPv6 network to send data

  • parallelism: 2 to parallelize the provenance data translator (translates from ProvLight to DfAnalyzer data format) and the broker topic.

Finally, we add roles: ['provenance'] in the Master and Worker services to enable data capture on them (e.g., install ProvLight capture library and set up environment variables to enable the connection with the provenance service).

 1---
 2environment:
 3  job_name: provenance-tutorial
 4  walltime: "00:59:00"
 5  g5k:
 6    cluster: gros
 7    job_type: ["deploy"]
 8    env_name: "debian11-x64-big"
 9    ssh_key: "your_g5k_key.pub"
10    firewall_rules:
11      - services: ["provenance_service"]
12        ports: [1883]
13  iotlab:
14    cluster: grenoble
15provenance:
16  provider: g5k
17  cluster: gros
18  dataflow_spec: my-dataflow-specification.py
19  ipv: 6
20  parallelism: 2
21layers:
22- name: cloud
23  services:
24  - name: Master
25    environment: g5k
26    cluster: gros
27    quantity: 1
28    roles: ['provenance'] 
29- name: edge
30  services:
31  - name: Worker
32    environment: iotlab
33    cluster: grenoble
34    archi: a8:at86rf231
35    quantity: 1
36    roles: ['provenance'] 

Note

We create a firewall rule on Grid’5000 to allow the Worker (FIT IoT LAB device) to send the captured data to the E2Clab provenance service deployed on G5K on port 1883 (MQTT protocol).

The toy application

To emulate the model training with various hyperparameters we implement the application below. The model input are the hyperparameters and the training output is the model performance such as the accuracy and the training time.

 1import random
 2import time
 3
 4
 5def model_training(training_time):
 6    time.sleep(training_time)
 7    accuracy = round(random.uniform(0, 1), 2)
 8    return training_time, accuracy
 9
10
11if __name__ == "__main__":
12    dataflow_id = "model_training"
13    transformation_id = "training"
14    training_input = "training_input"
15    training_output = "training_output"
16
17    # training 10x with different hyperparameters
18    for training_id in range(1, 10):
19        # model hyperparameters
20        kernel_size = random.randint(1, 10)
21        num_kernels = random.randint(8, 16)
22        length_of_strides = random.randint(1, 5)
23        pooling_size = random.randint(8, 16)
24        # training input: model hyperparameters
25        model_hyperparameters = {
26            "model_hyperparameters": [
27                kernel_size,
28                num_kernels,
29                length_of_strides,
30                pooling_size,
31            ]
32        }
33        # START training... time to train the model with hyperparameter set
34        _training_time, _accuracy = model_training(training_time=random.randint(1, 5))
35        # training output: model performance
36        print(
37            f"hyperparameters = {model_hyperparameters['model_hyperparameters']} | "
38            f"accuracy = {_accuracy} / training_time = {_training_time}"
39        )

Network Configuration

The file below presents the network configuration between the cloud and edge infrastructures delay: 28ms, loss: 0.1%, rate: 1gbit.

1networks:
2- src: cloud
3  dst: edge
4  delay: "28ms"
5  rate: "1gbit"
6  loss: "0.1%"

Workflow Configuration

This configuration file presents the application workflow configuration.

  • The Master cloud.* and the Worker edge.*:

prepare copies from the local machine to the remote machine the application.

launch executes the application.

 1- hosts: cloud.*
 2  prepare:
 3    - copy:
 4        src: "{{ working_dir }}/user-application.py"
 5        dest: "/tmp/user-application.py"
 6  launch:
 7    - shell: python /tmp/user-application.py
 8      async: 120
 9      poll: 0
10- hosts: edge.*
11  prepare:
12    - copy:
13        src: "{{ working_dir }}/user-application.py"
14        dest: "/tmp/user-application.py"
15  launch:
16    - shell: source ~/.bashrc && python /tmp/user-application.py

User-Defined Provenance Data Capture

Next, we show how we used the ProvLight client library to instrument the application code to capture the model hyperparameters and the model performance results. The Workflow, Task, and Data classes are used to capture data.

 1import time, random, os
 2from provlight.workflow import Workflow
 3from provlight.task import Task
 4from provlight.data import Data
 5
 6client_id = os.environ.get('PROVLIGHT_SERVER_TOPIC', "")
 7
 8def model_training(training_time):
 9    time.sleep(training_time)
10    accuracy = round(random.uniform(0, 1), 2)
11    return training_time, accuracy
12
13
14if __name__ == "__main__":
15    # IDs defined in the dataflow specification
16    dataflow_id = "model_training"
17    transformation_id = "training"
18    training_input = "training_input"
19    training_output = "training_output"
20
21    wf = Workflow(dataflow_id)
22    wf.begin()
23
24    # training 10x with different hyperparameters
25    for training_id in range(1, 10):
26        # model hyperparameters
27        kernel_size = random.randint(1, 10)
28        num_kernels = random.randint(8, 16)
29        length_of_strides = random.randint(1, 5)
30        pooling_size = random.randint(8, 16)
31        # training input: model hyperparameters
32        model_hyperparameters = {'model_hyperparameters': [
33            kernel_size,
34            num_kernels,
35            length_of_strides,
36            pooling_size,
37        ]}
38        task = Task(int(str(client_id)+str(training_id)), wf, transformation_id, dependencies=[])
39        data_in = Data(training_input, dataflow_id, [], model_hyperparameters)
40        task.begin([data_in])
41        # START training... time to train the model with hyperparameter set
42        _training_time, _accuracy = model_training(training_time=random.randint(1, 5))
43        # training output: model performance
44        data_out = Data(training_output, dataflow_id, [], {'model_performance': [
45            _accuracy,
46            _training_time,
47        ]})
48        task.end([data_out])
49
50    wf.end()

Running & Verifying Experiment Execution

Find below the commands to deploy this application and check its execution.

$ e2clab layers-services ~/git/provenance-tutorial/ ~/git/provenance-tutorial/artifacts/

You can find the command to acces the provenance GUI in the layers-services_validate file like :

  • ssh -NL 22000:localhost:22000 PROVENANCE_NODE

the GUI will then be accessible from http://localhost:22000

../_images/prov_gui.png

Fig. 21 Figure 1: Provenance Service GUI (DfAnalyzer)

$ e2clab workflow ~/git/provenance-tutorial/ prepare
$ e2clab workflow ~/git/provenance-tutorial/ launch

Deployment Validation & Experiment Results

We can access the database as follows:

# In layers-verices_validate
$ ssh root@PROVENANCE_NODE
# OR select Provenance in
$ e2clab ssh .
$ docker exec -it dfanalyzer bash
$ monetdb status
$ mclient dataflow_analyzer

With \d you can list all tables.

../_images/prov_tables.png

Fig. 22 Figure 2: Tables in the provenance database

After model training on G5K node and FIT IoT LAB device, we can visualize the training results as presented in Figure 3: Model input (hyperparameters) and Figure 4: Model output (accuracy and training time).

../_images/prov_in.png

Fig. 23 Figure 3: Model input (hyperparameters)

../_images/prov_out.png

Fig. 24 Figure 4: Model output (accuracy and training time)

After multiple model evaluations, thanks to provenance data capture during model training, users can easily answer the following research question:

  • What are the model hyperparameters that obtained an accuracy value above 90%?

../_images/prov_rq.png

Fig. 25 Figure 5: What are the model hyperparameters that obtained an accuracy value above 90%?

Saving the Experiment Results

$ e2clab finalize ~/git/provenance-tutorial/

The experiment results will be saved at:

$ ls ~/provenance-tutorial/OUTPUT_DIR/
layers_services-validate.yaml
provenance-data/                  # contains the 'provenance_database.sql' file.
workflow-validate.out