Kwollect monitoring on Grid’5000

In this tutorial, we will show how to use e2clab with the kwollect monitoring service deployed on Grid’5000: kwollect documentation

Kwollect is particularly useful as it grants access to a bunch of already availble metrics on the hosts like:

  • wattmetre power consumption

  • metrics from the Board Management Controller (bmc)

  • metrics from Prometheus node exporter

  • etc

You can find all the availble metrics on every Grid’5000 clusters in the kwollect documentation.

In this short example, we will show how to

  • Fetch some of those metrics for your experiments

  • Analyze them

  • Push your own custom metrics to the monitoring stack

Experiment Artifacts

The artifacts repository contains the E2Clab configuration files

$ git clone https://gitlab.inria.fr/E2Clab/examples/monitoring-kwollect.git
$ cd monitoring-kwollect

The structure of the experimental setup looks like this:

monitoring-kwollect/        # SCENARIO_DIR
├── artifacts/              # ARTIFACTS_DIR
│   └── push_metrics.sh
├── .e2c_env
├── layers_services.yaml
├── networks.yaml
├── workflow.yaml
├── workflow_env.yaml
├── analysis.py
├── requirements.txt
├── README.md
└── ...

Defining the Experimental Environment

Experiment definition

Notice that the experiment artifacts contain a .e2c_env file:

.e2c_env
1E2C_DEBUG=false
2E2C_SCENARIO_DIR="./"
3E2C_ARTIFACTS_DIR="./artifacts/"

This defines environment variables that will be passed to the e2clab CLI. As you can see in the documentation (E2Clab command-line interface), we can define environment varibales for the SCENARIO_DIR and ARTIFACTS_DIR arguments as well as other CLI arguments and options.

In this example we define E2C_SCENARIO_DIR and E2C_ARTIFACTS_DIR as well as whether we want the debug logs. This way we do not have to indicate the SCENARIO_DIR and ARTIFACTS_DIR arguments every time that we want to launch a command.

Payload

This simple payload found in the artifacst directory is a simple bash script exporting a my_e2clab_metric metric to the kwollect monitoring stack.

push_metrics.sh
1#!/bin/bash
2
3for val in {1..10}; do
4  echo "kwollect_custom{_metric_id=\"my_e2clab_metric\"} $val" >/var/lib/prometheus/node-exporter/kwollect.prom
5  sleep 15
6done

Note

If your change the name of the metric from my_e2clab_metric to any other name, make sure that new name validates the regular exression in the monitor section of the layers_services.yaml file, otherwise it will not be collected.

To kwow more about custom metrics and at what timestamp they are recorded, check the Pushing custom metrics section in the kwollect documentation.

Layers & Services Configuration

This example of layers_services.yaml file defines our experiment’s deployment. We make a reservation on the paradoxe cluster and define a new section called kwollect (whose schema you can find in Schema)

Within the kwollect section we define two things:

  • The metrics we want to pull from kwollect in the metrics option
    • wattmetre_power_watt

    • bmc_node_power_watt

    • Our custom my_e2clab_metric that we are going to push during the experiment

  • The timeframe we want to pull the metrics from, defined by the step option
    • We pull the metrics recorded during the time it took to run the launch part of the workflow

Learn more about those options in the documentation section dedicated to Monitoring Grid’5000 using Kwollect metrics.

Another important detail, the kwollect documentation mentions that not all metrics are recorded from the nodes by default. To make sure that we have access the the bmc_node_power_watt metric and that our custom my_e2clab_metric are collected by the kwollect monitoring stack, we have to activate them with the monitor option in the g5k section of the environment.

layers_services.yaml
 1environment:
 2  job_name: e2clab-kwollect
 3  walltime: 01:00:00
 4  g5k:
 5    job_type: ["allow_classic_ssh"]
 6    cluster: paradoxe
 7    # Activating all metrics containing 'power' and 'metric'
 8    monitor: ".*metric*.|.*power*."
 9kwollect:
10  metrics: 
11    # wattmetre_power_watt is activated by default
12    - wattmetre_power_watt
13    - bmc_node_power_watt
14    - my_e2clab_metric
15    # If you want to pull all metrics possible:
16    # - all
17  step: "launch"
18layers:
19- name: cloud
20  services:
21  - name: Server
22    # You can specify which services to pull metrics from with the "k_monitor" role
23    # If "kwollect" is defined but none are set, it will pull from all Grid'5000 nodes by default
24    # roles: ["k_monitor"]
25    cluster: "paradoxe"
26- name: fog
27  services:
28  - name: Producer
29    cluster: "paradoxe"
30    # roles: ["k_monitor"]

Network Configuration

No network emulation needed:

networks.yaml
1networks:

Workflow Configuration

In this simple monitoring demonstration, we are just going to run some stress tests on the hosts with the stress command. We also run the push_metrics.sh script in the background of the fog host to demonstrate how to publish and collect custom metrics with kwollect.

In this case, the push_metrics.sh just sets the my_e2clab_metric to a new value every 15 seconds. This functionality may be useful if you need to monitor some values in real time or you want to fetch those metrics at the same time that you fetch the data from the kwollect api.

The {{ env_time }} variables refer to application configurations defined in the workflow_env.yaml file. When running the long application configuration, the stress commands will last “60s”, and “30s” when using the short configuration.

To know more about how workflow.yaml and workflow_env.yaml articulate, you ca read the following documentation: workflow_env.yaml and application configuration.

workflow.yaml
 1- hosts: cloud.*
 2  launch:
 3    # Running a stress command
 4    - shell: "stress -c 8 --timeout {{ env_time }} && sleep 10 && stress -c 8 --timeout {{ env_time }}"
 5
 6- hosts: fog.*
 7  prepare:
 8    - copy:
 9        src: "{{ working_dir }}/push_metrics.sh"
10        dest: "/opt/push_metrics.sh"
11        mode: preserve
12  launch:
13    # Pushing our custom metrics in the background
14    - shell: "/opt/push_metrics.sh"
15      async: 3600
16      poll: 0
17    # Running a stress command
18    - shell: "stress -c 8 --timeout {{ env_time }} && sleep 10 && stress -c 8 --timeout {{ env_time }}"
workflow_env.yaml
1short:
2  time: "30s"
3
4long:
5  time: "60s"

Running & Verifying Experiment Execution

Run the experiment

Use the command bellow to run this example:

$ e2clab deploy --app_conf short,long

The command will run the whole workflow for both configurations of your workflow long and short.

Check metrics in real-time

During the experiment’s execution, you can access the kwollect monitoring dashboard for the rennes site at: https://api.grid5000.fr/stable/sites/rennes/metrics/dashboard, and entering the Job ID corresponding to the deployment of the experiment on Grid’5000.

You can find that job id in the logs while running the experiment or in the e2clab.logs file.

[E2C,KWOLLECT] Access kwollect metric dashboard for job <JOB ID>: https://api.grid5000.fr/stable/sites/rennes/metrics/dashboard
../_images/grafana_kwollect.png

Example of visualization on the kwollect dashboard interface

Analyze experiment metrics

At the end of the deployment, the metrics that you requested will be fetched from the kwollect API and saved in a csv file in your result directory which should look lilke:

YYMMDD-HHMMSS/
├── long/
│   ├── monitoring-data/
│   │   └── kwollect-data/
│   └── workflow-validate.out
├── short/
│   └── ...
├── e2clab.err
├── e2clab.log
├── layers_services-validate.yaml
└── workflow-validate.out

You will get output data for both the long and short configurations of the experiment.

Power consumption

We provide a simple python script to vizualize the data that was pulled from the kwollect API.

Note

It is best to run the follwoing commands inside of a python virtual environment

(venv)$ pip install -r requirements.txt
(venv)$ analysis.py YYYYMMDD-hhmmss/long/monitoring-data/kwollect-data/<site>.yaml
../_images/data_analysis.png

Example of power consumption monitoring on Grid’5000 nodes

We can clearly see the rise in power consumption caused by the stress command.

Note

To know more about the caveats of power monitoring on Grid’5000, check the following link: https://www.grid5000.fr/w/Unmaintained:Power_Monitoring_Devices#measurement_artifacts_and_pitfalls

Custom metric

We can also check that our cutom metric my_e2clab_metric has been captured by kwollect, first by looking at the online dashboard and entering my_e2clab_metric in the Metric name selection:

../_images/my_e2clab_metric.png

We can also check that it was indeed pulled from the API and saved in our experiment results:

$ cat YYYYMMDD-hhmmss/long/monitoring-data/kwollect-data/<site>.yaml | grep "my_e2clab_metric"

Note

To kwow more about custom metrics and at what timestamp they are recorded, check the Pushing custom metrics section in the kwollect documentation.

Free the computing resources

Once you are done, you may kill your job on the Grid’5000 platform using the following command:

$ e2clab destroy