Monitoring

There are 3 main ways to monitor the computing resources during experiment execution, they are:

  • Dstat

  • TIG stack: Telegraf/InfluxDB/Grafana

  • TPG stack: Telegraf/Prometheus/Grafana

To enable monitoring in E2Clab, users have to configure the layers_services.yaml file as follows:

  • Define the monitoring by adding the monitoring attribute (more details in the next sections).

  • Add roles: [monitoring] on each Service the user wants to monitor.

In addition, you can monitor energy consumption:

  • Monitoring profile in FIT IoT LAB

To enable monitoring of FIT IoT LAB nodes in E2Clab, users have to configure the layers_services.yaml file as follows:

  • Define the monitoring profile by adding the monitoring_iotlab attribute (for more details refer to Section Set up a monitoring profile in FIT IoT LAB).

Set up Dstat

G5K, FIT IoT LAB, or Chameleon Cloud

Set up dstat is very simple (see example below).

1monitoring:
2  type: dstat

Set up TIG stack: Telegraf/InfluxDB/Grafana

It requires a monitoring provider. This provider is a dedicated machine hosting InfluxDB and Grafana. For visualizing the monitoring data in Grafana you have to follow the instructions in the layers_services-validate.yaml (file located in the experiment directory).

After deployed, the monitoring provider will be available at http://paradoxe-10.rennes.grid5000.fr:3000. You can access it from your local machine as follows ssh -NL 3000:localhost:3000 paradoxe-10.rennes.grid5000.fr. You can use admin for the username and password.

G5K

 1monitoring:
 2  type: tig
 3  provider: g5k
 4  # you can use `cluster` or `servers` to deploy the monitoring provider
 5  cluster: paradoxe
 6  servers: ["paradoxe-10.rennes.grid5000.fr"]
 7  # if `private`, a new network is created for the monitoring traffic.
 8  # if `private`, it requires at least 2 NICs in the machine.
 9  network: shared or private
10  # if the monitoring provider will use a IPv4 or IPv6 network
11  ipv: 4 or 6
12  # you can provide a config file (must be in `artifacts_dir`) for the telegraf agents.
13  agent_conf: telegraf.conf.j2

Chameleon Cloud

1monitoring:
2  type: tig
3  provider: chameleoncloud
4  cluster: compute_cascadelake

G5K + FIT IoT LAB

For G5K + FIT IoT LAB, a firewall rule is needed. The reconfigurable Firewall API resource URLs are of the form https://api.grid5000.fr/stable/sites/<site>/firewall/<jobid> where <site> and <jobid> are the Grid’5000 site and the OAR job number for which one requests openings. For instance: https://api.grid5000.fr/stable/sites/rennes/firewall/1961803.

In the example below, we open a firewall rule for the monitoring_service (the monitoring provider) on port 8086 (InfluxDB). It allows the telegraf agents on FIT IoT LAB nodes to send their data to the monitoring service on G5K.

 1environment:
 2  g5k:
 3    cluster: paradoxe
 4    job_type: ["allow_classic_ssh"]
 5    firewall_rules:
 6      - services: ["monitoring_service"]
 7        ports: [8086]
 8  iotlab:
 9    cluster: grenoble
10monitoring:
11  type: tig
12  provider: g5k
13  cluster: paradoxe
14  network: shared
15  ipv: 6

Set up TPG stack: Telegraf/Prometheus/Grafana

G5K

 1monitoring:
 2  type: tpg
 3  provider: g5k
 4  # you can use `cluster` or `servers` to deploy the monitoring provider
 5  cluster: paradoxe
 6  servers: ["paradoxe-10.rennes.grid5000.fr"]
 7  # if `private`, a new network is created for the monitoring traffic.
 8  # if `private`, it requires at least 2 NICs in the machine.
 9  network: shared or private
10  # if the monitoring provider will use a IPv4 or IPv6 network
11  ipv: 4 or 6

Chameleon Cloud

1monitoring:
2  type: tpg
3  provider: chameleoncloud
4  cluster: compute_cascadelake

G5K + FIT IoT LAB

Prometheus uses a pull model to scrape metrics from the telegraf agents. In this case, we do not need to create a firewall rule. IPv6 connection from Grid’5000 to IoT-LAB is allowed (the inverse is not true unless you open the firewall port, as presented earlier).

1monitoring:
2  type: tpg
3  provider: g5k
4  cluster: paradoxe
5  network: shared
6  ipv: 6

Set up a monitoring profile in FIT IoT LAB (energy consumption)

Next, we show how to set up a monitoring profile to monitor current, voltage, and power of FIT IoT LAB nodes (in this case, a8 and rpi3 nodes). You can manage the monitoring profiles in the dashboard through this link https://www.iot-lab.info/testbed/resources/monitoring.

 1monitoring_iotlab:
 2  profiles:
 3    - name: test_capture_a8
 4      archi: a8               # ['a8', 'custom']
 5      current: True           # [True, False]
 6      power: True             # [True, False]
 7      voltage: True           # [True, False]
 8      period: 8244            # [140, 204, 332, 588, 1100, 2116, 4156, 8244]
 9      average: 4              # [1, 4, 16, 64, 128, 256, 512, 1024]
10    - name: test_capture_rpi
11      archi: custom
12      current: True
13      power: True
14      voltage: True
15      period: 8244
16      average: 4

Monitoring Grid’5000 using Kwollect metrics

Setting-up your G5k reservation

Grid’5000 has a monitoring service already embedded within the testbed which gives access to environmental and performance metrics on nodes. Feel free to get familiar with this service with the official documentation

Some metrics metrics may not be activated by default on the clusters that you want to use (see documentation). For that you have to modify your Grid’5000 reservation in your layers_services file. The monitor parameter allows you to provide the name of the metric you wish to activate or a regular expression describing all the metrics you wish to activate.

1environment:
2  job_name: e2clab-test
3  walltime: 01:00:00
4  g5k:
5    job_type: ["allow_classic_ssh"]
6    cluster: paradoxe
7    # active all temperature-related metrics
8    monitor: ".*temp.*"

Configuring metrics collection

Next, to configure which metric we wish to pull from API and on which nodes, we need to add the following configuration to layers_services.yaml

E2Clab allows the user to pull metrics from Grid’5000 API for a given period of their workflow i.e. “prepare”, “launch” or “finalize”.

1kwollect:
2  # Metrics to pull from the API
3  metrics:
4    - wattmetre_power_watt
5    - bmc_ambient_temp_celcius
6  # Workflow step to monitor
7  step: "launch" # or "prepare", "finalize", "infra", "wait"

metrics

This is where you specify the list of metrics you want to fetch from the monitoring API. If this list contains the value all E2Clab will fetch all metrics available, e.g.

1kwollect:
2  metrics:
3    - all

step

This is the time period for which E2Clab is going to fetch the information, valid values are.:

  • infra: beggining of the Grid’5000 reservation to the job deletion

  • prepare

  • launch

  • wait: duration time during the deployment with a --duration parameter

  • finalize

It defaults to “launch” if not specified.

start / end

If you want to fetch metrics data from a wider timerange than a simple step, you can specify the start time from a step and the end time of a step:

For example here we fetch data from the start of the “launch” step to the end of the “finalize” step.

1Kwollect:
2  # Metrics to pull from the API
3  metrics:
4    - wattmetre_power_watt
5    - bmc_ambient_temp_celcius
6  start: "launch"
7  end: "finalize" # If end not set, current time will be used

Configuring nodes / services to monitor

With the previous configuration, E2Clab will fetch data for every node reserved on Grid’5000 for your experiment. You can specify which services you wigh to monitor with kwollect using the k_monitor role.

Here, only the Server service will be monitored with kwollect.

 1layers:
 2- name: cloud
 3  services:
 4  - name: Server
 5    roles: ["k_monitor"]
 6    cluster: "paradoxe"
 7- name: fog
 8  services:
 9  - name: Producer
10    cluster: "paradoxe"

Output

The data will be dumped in csv files in the kwollect-data directory.

Starting monitoring and saving captured data

For Dstat, TIG, and TPG, the monitoring is started during the launch step of the workflow.yaml file or when the user executes the following command:

$ e2clab workflow /path/to/scenario/ launch

It allows capturing monitoring data before the start of each Service.

Energy monitoring in FIT IoT LAB starts after reservation.

All the monitoring data is saved in the /path/to/scenario_dir/monitoring-data/ directory. It is saved in the finalize step with the following command:

$ e2clab finalize /path/to/scenario_dir/

Besides saving the monitoring data, it also stops the monitoring services and agents started on each machine.

Try some examples

We provide a few tutorials: